Workload thread profiles

A pod is not one uniform workload. Workload profiles describe the thread structure inside a pod — connection threads, I/O chains, background housekeeping — and give each thread group its own scheduling treatment. This granularity exists nowhere above the kernel.

Naming note. Binaries, the helm chart, and annotation keys currently ship under the project’s former name (infera); the commands below are what works today. A rename migration is planned.

Why thread groups

Consider MySQL: connection threads want exclusive cores and instant wakeups; InnoDB I/O threads want latency treatment on short wake chains; purge and background threads should yield to everything else. Container-level tools see one CPU number for the whole pod and must treat those threads identically. Temper schedules at the layer where threads actually exist, so a profile can say: this group gets exclusive cores, that group gets latency treatment, the rest yield — all inside one pod, with the pod’s QoS tier still governing its standing against other pods.

The division of labor: tiers arbitrate between workloads; profiles structure the threads within one. Workload identity (image or annotation) selects which profile applies.

Builtin profiles

The agent ships with compiled-in profiles for common workload shapes — for example a PyTorch training profile (keeps DataLoader worker threads fed so the GPU never starves) and a MySQL/InnoDB profile (connection threads exclusive, I/O threads latency-treated, background threads yielding). Builtins apply automatically when detection matches and can be overridden by file-based profiles with the same id.

Detection: how a pod gets a profile

  1. Annotation (explicit, wins): infera.io/workload-profile: mysql-innodb on the pod.
  2. Image match: each profile carries container-image patterns; a pod whose image matches gets the profile automatically.

Profiles are additionally keyed by machine shape — a profile tuned for an 8-core SMT node is not blindly applied to a 64-core one. Shape-matched file profiles override builtins with the same id, most specific match first.

File-based profiles

Custom profiles are TOML files (schema v1) with four sections: fingerprint (how to detect the workload), machine_shape (what hardware the tuning was measured on), traits (workload-level characteristics), and one or more thread_groups (the per-group treatments). The shape, illustratively:

# my-service.toml — illustrative sketch; the shipped profiles are the
# authoritative schema reference
schema_version = 1
id = "my-service"

[fingerprint]      # how pods are matched to this profile
# image patterns and/or annotation id

[machine_shape]    # the node shape this tuning was measured on
# core count, SMT topology

[traits]           # workload-level characteristics

[[thread_groups]]  # one block per thread group:
# how to identify the group's threads, and its scheduling treatment
# (exclusive cores / latency treatment / yield)

Deploy profiles with the helm chart — they render into a ConfigMap mounted into the agent, and edits roll the DaemonSet automatically:

helm upgrade infera deploy/helm/infera --reuse-values \
  --set-file 'agent.profiles.my-service\.toml'=./my-service.toml

Training mode: profiles you don’t write by hand

Writing a thread-group profile from first principles requires knowing your workload’s thread structure. Training mode measures it instead:

  1. Observe — capture a bounded kernel trace plus an /observe snapshot while the workload runs under representative load.
  2. Analyze — cluster the workload’s threads by runtime distribution, wake rate, and waker→wakee relationships; classify each cluster (sync compute, latency critical, I/O wake chain, sporadic).
  3. Synthesize — emit a profile TOML keyed to the machine shape it was measured on.
  4. Evaluate & refine — benchmark the profile against baseline, hill-climb one parameter at a time, and keep only measured improvements.

The pipeline is automated end to end on a live cluster; perfetto trace bursts are bounded and only run during training or canary cycles, so the always-on cost stays under 1% CPU.

Measured effect

Tier QoS alone already carries most headline results — the memcached density run held flat p99 with no profile at all. Profiles are the second stage: the GPU training wedge (+67% samples/s at density) leans on keeping DataLoader threads fed, and profile-tuned runs are how the training pipeline squeezes workload-specific structure. See the benchmarks page for which result used what.