Workload thread profiles
A pod is not one uniform workload. Workload profiles describe the thread structure inside a pod — connection threads, I/O chains, background housekeeping — and give each thread group its own scheduling treatment. This granularity exists nowhere above the kernel.
infera); the commands below are what works today. A rename migration is planned.Why thread groups
Consider MySQL: connection threads want exclusive cores and instant wakeups; InnoDB I/O threads want latency treatment on short wake chains; purge and background threads should yield to everything else. Container-level tools see one CPU number for the whole pod and must treat those threads identically. Temper schedules at the layer where threads actually exist, so a profile can say: this group gets exclusive cores, that group gets latency treatment, the rest yield — all inside one pod, with the pod’s QoS tier still governing its standing against other pods.
The division of labor: tiers arbitrate between workloads; profiles structure the threads within one. Workload identity (image or annotation) selects which profile applies.
Builtin profiles
The agent ships with compiled-in profiles for common workload shapes — for example a PyTorch training profile (keeps DataLoader worker threads fed so the GPU never starves) and a MySQL/InnoDB profile (connection threads exclusive, I/O threads latency-treated, background threads yielding). Builtins apply automatically when detection matches and can be overridden by file-based profiles with the same id.
Detection: how a pod gets a profile
- Annotation (explicit, wins):
infera.io/workload-profile: mysql-innodbon the pod. - Image match: each profile carries container-image patterns; a pod whose image matches gets the profile automatically.
Profiles are additionally keyed by machine shape — a profile tuned for an 8-core SMT node is not blindly applied to a 64-core one. Shape-matched file profiles override builtins with the same id, most specific match first.
File-based profiles
Custom profiles are TOML files (schema v1) with four sections: fingerprint
(how to detect the workload), machine_shape (what hardware the tuning was
measured on), traits (workload-level characteristics), and one or more
thread_groups (the per-group treatments). The shape, illustratively:
# my-service.toml — illustrative sketch; the shipped profiles are the
# authoritative schema reference
schema_version = 1
id = "my-service"
[fingerprint] # how pods are matched to this profile
# image patterns and/or annotation id
[machine_shape] # the node shape this tuning was measured on
# core count, SMT topology
[traits] # workload-level characteristics
[[thread_groups]] # one block per thread group:
# how to identify the group's threads, and its scheduling treatment
# (exclusive cores / latency treatment / yield)
Deploy profiles with the helm chart — they render into a ConfigMap mounted into the agent, and edits roll the DaemonSet automatically:
helm upgrade infera deploy/helm/infera --reuse-values \
--set-file 'agent.profiles.my-service\.toml'=./my-service.toml
Training mode: profiles you don’t write by hand
Writing a thread-group profile from first principles requires knowing your workload’s thread structure. Training mode measures it instead:
- Observe — capture a bounded kernel trace plus an /observe snapshot while the workload runs under representative load.
- Analyze — cluster the workload’s threads by runtime distribution, wake rate, and waker→wakee relationships; classify each cluster (sync compute, latency critical, I/O wake chain, sporadic).
- Synthesize — emit a profile TOML keyed to the machine shape it was measured on.
- Evaluate & refine — benchmark the profile against baseline, hill-climb one parameter at a time, and keep only measured improvements.
The pipeline is automated end to end on a live cluster; perfetto trace bursts are bounded and only run during training or canary cycles, so the always-on cost stays under 1% CPU.
Measured effect
Tier QoS alone already carries most headline results — the memcached density run held flat p99 with no profile at all. Profiles are the second stage: the GPU training wedge (+67% samples/s at density) leans on keeping DataLoader threads fed, and profile-tuned runs are how the training pipeline squeezes workload-specific structure. See the benchmarks page for which result used what.