05 Deep dive
Scheduling inside the pod
A pod-level CPU number is an average over threads with completely different needs, and averages hide geometry: a pod can read “plenty of CPU, low utilization” while its two hot threads share one physical core's hyperthreads. This article is the evidence that thread-group scheduling — workload profiles — finds and closes gaps that no container-level tool can see.
What the average could not see
The motivating incident comes from llama.cpp inference on EKS: the pod ran ~20–25%
slower under Temper than under CFS, including on an idle node. The pod-scoped
aggregate said nothing was wrong — the layer's cpus_used read 1.23, so
the workload was not CPU-starved by quantity. The cause was only visible in per-thread
placement from the /observe endpoint: both inference threads were running on
CPUs 0 and 2 — which on that AWS m5 shape are the two hyperthreads of one physical
core. scx_layered allocates whole cores; a 2–3 vCPU Confined layer on a 2-core node
is one core, which guarantees sibling-stacking, while CFS happily spreads across cores.
Two things fell out of that forensic. First, the fix class: a workload profile declaring the hot threads an exclusive-core group closed ~60% of the idle gap immediately (−12% median). Second, an embarrassing tooling gap we kept in the record: the placement linter's SMT-collision check was silently skipping pinned tier layers — exactly where every unprofiled workload lives — so the invariant that should have caught this never evaluated it. Fixed, with a red-first regression test.
A workload profile groups a pod's threads by name pattern into separate scheduler layers with their own treatment — exclusive cores for the hot path, latency treatment for wake chains, yield for housekeeping — while the pod's QoS tier still governs its standing against other pods. The mechanics are in the workload-profiles doc; what follows is the measured evidence.
ONNX inference: parity at peak, flat under load
ONNX Runtime CPU inference (ResNet-50, 3 intra-op threads, Guaranteed 3 CPU) on a 4-core c2-standard-8 shows the two halves of the story cleanly. Tier-only (no profile), Temper was perfectly flat across the background ladder (±1%, p99 ~45 ms) but paid a large peak gap: CFS reached 43–44 samples/s on the idle node by spreading the three threads over three physical cores, then degraded −38% by bg=4. Temper's whole-core tier layer had SMT-paired the threads — predictable, and slow. The generic tier config cannot express “3 whole cores with idle siblings”; that is precisely what profiles are for.
| bg spinners | CFS sps (p99) | Temper + exclusive profile sps (p99) |
|---|---|---|
| 0 | 44.45 (26.2 ms) | 44.46 (23.9 ms) |
| 1 | 44.27 (27.9 ms) | 44.60 (22.8 ms) |
| 2 | 28.15 (38.7 ms) | 44.71 (22.7 ms) |
| 4 | 22.65 (45.8 ms) | 44.59 (22.8 ms) |
| 8 | 30.58 (36.0 ms) | 44.61 (22.8 ms) |
With the exclusive-core profile (3 hot threads → 3 whole cores, siblings idle): idle parity with CFS (44.5 vs 44.5) and dead flat ±0.3% under density while CFS loses up to 49%. The earlier 22.9-vs-43 tier-only gap was pure SMT pairing. Single run per arm; c2-standard-8, 2026-07-02. source: docs/training-artifacts/onnx-inference/REPORT.md
The same report carries the negative that produced a sizing rule: on a 2-core c2-standard-4, no profile can manufacture a third core for three hot threads. The original Confined clamp collapsed to −62% at bg=4; the fix (demoting an over-budget Confined layer to Grouped instead of caging it) bounds the worst case at −27% with a monotonic decline — bounded damage, not parity. Hence the guidance: do not put >2-CPU critical pods on 2-core nodes.
llama.cpp: the 12% confinement gap, closed
Back to the llama.cpp story. After the SMT root-cause, the trail continued across node shapes. On a 4-core m5.2xlarge with no profile at all, the story flips: Temper flat (~1.98–2.02 s median) while CFS degrades +46% under background load — Temper wins −9% at bg=4 and −21% at bg=8; whole-core allocation spreads hot threads naturally when there are enough cores, so SMT stacking was 2-core geometry, not a general defect. What remained was a ~12% idle confinement gap against CFS's free run of the whole node.
The exclusive profile on c2-standard-8 closed that too: idle parity (1512 vs
1514 ms median), and under density a worst-case drift of +3.7% while CFS degraded
+27% — at bg=8 Temper measured −18% median and −24% p90 against CFS. A
profile detail worth showing because it is the kind of thing you get wrong by hand: the
group's cpu_fraction had to be 0.85, not 0.9, because the pod's aggregate
request (2250m) times 0.9 rounds up to three hot threads — three exclusive cores for a
two-thread server. The committed profile TOML documents the arithmetic.
MySQL: profile validation, and the CommRegex bug
The MySQL run (sysbench OLTP, quota-limited — the throttle side of it is in
the CPU-limits article) doubled as the first live validation of
the builtin mysql-innodb profile: connection threads in a Confined exclusive layer, InnoDB
internals (ib_*) weight-boosted Open, the remainder low-weight Open. Result:
p99 15.6–16.7 ms dead flat through bg=8, against CFS's 60–66 ms, with
~3.3 background cores reclaimed at 0.99 node utilization.
“First live validation” is doing honest work in that sentence. The builtin
profile had shipped emitting a CommRegex thread matcher — a variant
scx_layered does not have. The spec parse failed, the scheduler exited before
attach, and the node silently stayed on CFS: the profile had never actually run until
this benchmark tried it. (The regex also used negative lookahead, which the Rust regex
engine rejects — doubly dead.) The fix grouped InnoDB internals by the ib_
comm prefix with an exclude-based remainder, and the incident is why profile-load failures
now fail loudly. Caveat that stays attached: this run has no tier-only comparison arm, so
the profile's marginal contribution over plain tiers is not isolated in this record.
Cassandra: the case for a JVM profile
Cassandra ran tier-only — no profile exists for it yet — and still landed −78% p99 on the idle node (2.6 vs 11.8 ms; a re-check puts the honest idle range at −69…−78%), with every density step better than CFS (−13…−62%). The interesting part is where the biggest win sits: at idle, where there are no neighbors, the latency driver is the pod's own internal thread contention — 84 JVM threads (request handlers vs. GC and compaction) bursting past a 3-core quota into refill freezes. That is intra-pod structure, which is exactly the thing a profile expresses: GC and compaction threads to batch treatment, request threads to latency treatment. A cassandra/JVM profile is the obvious next increment, and this tier-only run is its baseline.
Where profiles stand
- Tier QoS alone carries most headline results (memcached never had a profile). Profiles are the second stage: they close peak-throughput gaps that whole-core tier confinement creates on small shapes, and they encode intra-pod structure that averages cannot.
- Profiles are measured artifacts, not hand-tuned configs: the training pipeline (observe → analyze → synthesize → refine) generates them from traces. Its own committed cycle record (phase4) is honestly mixed — the synthesized PyTorch profile trailed CFS on that ladder and one refine step went the wrong way — which is why refinement keeps only measured improvements.
- All runs above are single-run-per-arm lab shapes; the ONNX and llama profile wins replicate across two node shapes each, the MySQL and Cassandra results are single-shape so far.
Raw records
- docs/training-artifacts/onnx-inference/REPORT.md (+ onnx-inference.toml)
- docs/training-artifacts/llm-inference/FINDINGS.md
- docs/training-artifacts/llm-inference/smt-fix/FINDINGS.md (+ profile TOMLs)
- docs/training-artifacts/mysql-oltp/REPORT.md
- docs/training-artifacts/cassandra/REPORT.md
- docs/training-artifacts/phase4/REPORT.md (training-mode cycle, mixed result)
Committed benchmark records in the product repository; design partners get the full artifact tree.