05 Deep dive

Scheduling inside the pod

A pod-level CPU number is an average over threads with completely different needs, and averages hide geometry: a pod can read “plenty of CPU, low utilization” while its two hot threads share one physical core's hyperthreads. This article is the evidence that thread-group scheduling — workload profiles — finds and closes gaps that no container-level tool can see.

workloads: ONNX inference · llama.cpp · MySQL · Cassandra clusters: GKE (c2) · EKS (m5) records: 5 reports

What the average could not see

The motivating incident comes from llama.cpp inference on EKS: the pod ran ~20–25% slower under Temper than under CFS, including on an idle node. The pod-scoped aggregate said nothing was wrong — the layer's cpus_used read 1.23, so the workload was not CPU-starved by quantity. The cause was only visible in per-thread placement from the /observe endpoint: both inference threads were running on CPUs 0 and 2 — which on that AWS m5 shape are the two hyperthreads of one physical core. scx_layered allocates whole cores; a 2–3 vCPU Confined layer on a 2-core node is one core, which guarantees sibling-stacking, while CFS happily spreads across cores.

Two things fell out of that forensic. First, the fix class: a workload profile declaring the hot threads an exclusive-core group closed ~60% of the idle gap immediately (−12% median). Second, an embarrassing tooling gap we kept in the record: the placement linter's SMT-collision check was silently skipping pinned tier layers — exactly where every unprofiled workload lives — so the invariant that should have caught this never evaluated it. Fixed, with a red-first regression test.

A workload profile groups a pod's threads by name pattern into separate scheduler layers with their own treatment — exclusive cores for the hot path, latency treatment for wake chains, yield for housekeeping — while the pod's QoS tier still governs its standing against other pods. The mechanics are in the workload-profiles doc; what follows is the measured evidence.

ONNX inference: parity at peak, flat under load

ONNX Runtime CPU inference (ResNet-50, 3 intra-op threads, Guaranteed 3 CPU) on a 4-core c2-standard-8 shows the two halves of the story cleanly. Tier-only (no profile), Temper was perfectly flat across the background ladder (±1%, p99 ~45 ms) but paid a large peak gap: CFS reached 43–44 samples/s on the idle node by spreading the three threads over three physical cores, then degraded −38% by bg=4. Temper's whole-core tier layer had SMT-paired the threads — predictable, and slow. The generic tier config cannot express “3 whole cores with idle siblings”; that is precisely what profiles are for.

bg spinners	CFS sps (p99)	Temper + exclusive profile sps (p99)
0	44.45 (26.2 ms)	44.46 (23.9 ms)
1	44.27 (27.9 ms)	44.60 (22.8 ms)
2	28.15 (38.7 ms)	44.71 (22.7 ms)
4	22.65 (45.8 ms)	44.59 (22.8 ms)
8	30.58 (36.0 ms)	44.61 (22.8 ms)

With the exclusive-core profile (3 hot threads → 3 whole cores, siblings idle): idle parity with CFS (44.5 vs 44.5) and dead flat ±0.3% under density while CFS loses up to 49%. The earlier 22.9-vs-43 tier-only gap was pure SMT pairing. Single run per arm; c2-standard-8, 2026-07-02. source: docs/training-artifacts/onnx-inference/REPORT.md

The same report carries the negative that produced a sizing rule: on a 2-core c2-standard-4, no profile can manufacture a third core for three hot threads. The original Confined clamp collapsed to −62% at bg=4; the fix (demoting an over-budget Confined layer to Grouped instead of caging it) bounds the worst case at −27% with a monotonic decline — bounded damage, not parity. Hence the guidance: do not put >2-CPU critical pods on 2-core nodes.

llama.cpp: the 12% confinement gap, closed

Back to the llama.cpp story. After the SMT root-cause, the trail continued across node shapes. On a 4-core m5.2xlarge with no profile at all, the story flips: Temper flat (~1.98–2.02 s median) while CFS degrades +46% under background load — Temper wins −9% at bg=4 and −21% at bg=8; whole-core allocation spreads hot threads naturally when there are enough cores, so SMT stacking was 2-core geometry, not a general defect. What remained was a ~12% idle confinement gap against CFS's free run of the whole node.

The exclusive profile on c2-standard-8 closed that too: idle parity (1512 vs 1514 ms median), and under density a worst-case drift of +3.7% while CFS degraded +27% — at bg=8 Temper measured −18% median and −24% p90 against CFS. A profile detail worth showing because it is the kind of thing you get wrong by hand: the group's cpu_fraction had to be 0.85, not 0.9, because the pod's aggregate request (2250m) times 0.9 rounds up to three hot threads — three exclusive cores for a two-thread server. The committed profile TOML documents the arithmetic.

MySQL: profile validation, and the CommRegex bug

The MySQL run (sysbench OLTP, quota-limited — the throttle side of it is in the CPU-limits article) doubled as the first live validation of the builtin mysql-innodb profile: connection threads in a Confined exclusive layer, InnoDB internals (ib_*) weight-boosted Open, the remainder low-weight Open. Result: p99 15.6–16.7 ms dead flat through bg=8, against CFS's 60–66 ms, with ~3.3 background cores reclaimed at 0.99 node utilization.

“First live validation” is doing honest work in that sentence. The builtin profile had shipped emitting a CommRegex thread matcher — a variant scx_layered does not have. The spec parse failed, the scheduler exited before attach, and the node silently stayed on CFS: the profile had never actually run until this benchmark tried it. (The regex also used negative lookahead, which the Rust regex engine rejects — doubly dead.) The fix grouped InnoDB internals by the ib_ comm prefix with an exclude-based remainder, and the incident is why profile-load failures now fail loudly. Caveat that stays attached: this run has no tier-only comparison arm, so the profile's marginal contribution over plain tiers is not isolated in this record.

Cassandra: the case for a JVM profile

Cassandra ran tier-only — no profile exists for it yet — and still landed −78% p99 on the idle node (2.6 vs 11.8 ms; a re-check puts the honest idle range at −69…−78%), with every density step better than CFS (−13…−62%). The interesting part is where the biggest win sits: at idle, where there are no neighbors, the latency driver is the pod's own internal thread contention — 84 JVM threads (request handlers vs. GC and compaction) bursting past a 3-core quota into refill freezes. That is intra-pod structure, which is exactly the thing a profile expresses: GC and compaction threads to batch treatment, request threads to latency treatment. A cassandra/JVM profile is the obvious next increment, and this tier-only run is its baseline.

Where profiles stand

Tier QoS alone carries most headline results (memcached never had a profile). Profiles are the second stage: they close peak-throughput gaps that whole-core tier confinement creates on small shapes, and they encode intra-pod structure that averages cannot.
Profiles are measured artifacts, not hand-tuned configs: the training pipeline (observe → analyze → synthesize → refine) generates them from traces. Its own committed cycle record (phase4) is honestly mixed — the synthesized PyTorch profile trailed CFS on that ladder and one refine step went the wrong way — which is why refinement keeps only measured improvements.
All runs above are single-run-per-arm lab shapes; the ONNX and llama profile wins replicate across two node shapes each, the MySQL and Cassandra results are single-shape so far.

Raw records

docs/training-artifacts/onnx-inference/REPORT.md (+ onnx-inference.toml)
docs/training-artifacts/llm-inference/FINDINGS.md
docs/training-artifacts/llm-inference/smt-fix/FINDINGS.md (+ profile TOMLs)
docs/training-artifacts/mysql-oltp/REPORT.md
docs/training-artifacts/cassandra/REPORT.md
docs/training-artifacts/phase4/REPORT.md (training-mode cycle, mixed result)

Committed benchmark records in the product repository; design partners get the full artifact tree.