05 Deep dive

Scheduling inside the pod

A pod-level CPU number is an average over threads with completely different needs, and averages hide geometry: a pod can read “plenty of CPU, low utilization” while its two hot threads share one physical core's hyperthreads. This article is the evidence that thread-group scheduling — workload profiles — finds and closes gaps that no container-level tool can see.

What the average could not see

The motivating incident comes from llama.cpp inference on EKS: the pod ran ~20–25% slower under Temper than under CFS, including on an idle node. The pod-scoped aggregate said nothing was wrong — the layer's cpus_used read 1.23, so the workload was not CPU-starved by quantity. The cause was only visible in per-thread placement from the /observe endpoint: both inference threads were running on CPUs 0 and 2 — which on that AWS m5 shape are the two hyperthreads of one physical core. scx_layered allocates whole cores; a 2–3 vCPU Confined layer on a 2-core node is one core, which guarantees sibling-stacking, while CFS happily spreads across cores.

Two things fell out of that forensic. First, the fix class: a workload profile declaring the hot threads an exclusive-core group closed ~60% of the idle gap immediately (−12% median). Second, an embarrassing tooling gap we kept in the record: the placement linter's SMT-collision check was silently skipping pinned tier layers — exactly where every unprofiled workload lives — so the invariant that should have caught this never evaluated it. Fixed, with a red-first regression test.

A workload profile groups a pod's threads by name pattern into separate scheduler layers with their own treatment — exclusive cores for the hot path, latency treatment for wake chains, yield for housekeeping — while the pod's QoS tier still governs its standing against other pods. The mechanics are in the workload-profiles doc; what follows is the measured evidence.

ONNX inference: parity at peak, flat under load

ONNX Runtime CPU inference (ResNet-50, 3 intra-op threads, Guaranteed 3 CPU) on a 4-core c2-standard-8 shows the two halves of the story cleanly. Tier-only (no profile), Temper was perfectly flat across the background ladder (±1%, p99 ~45 ms) but paid a large peak gap: CFS reached 43–44 samples/s on the idle node by spreading the three threads over three physical cores, then degraded −38% by bg=4. Temper's whole-core tier layer had SMT-paired the threads — predictable, and slow. The generic tier config cannot express “3 whole cores with idle siblings”; that is precisely what profiles are for.

bg spinnersCFS sps (p99)Temper + exclusive profile sps (p99)
044.45 (26.2 ms)44.46 (23.9 ms)
144.27 (27.9 ms)44.60 (22.8 ms)
228.15 (38.7 ms)44.71 (22.7 ms)
422.65 (45.8 ms)44.59 (22.8 ms)
830.58 (36.0 ms)44.61 (22.8 ms)

With the exclusive-core profile (3 hot threads → 3 whole cores, siblings idle): idle parity with CFS (44.5 vs 44.5) and dead flat ±0.3% under density while CFS loses up to 49%. The earlier 22.9-vs-43 tier-only gap was pure SMT pairing. Single run per arm; c2-standard-8, 2026-07-02. source: docs/training-artifacts/onnx-inference/REPORT.md

The same report carries the negative that produced a sizing rule: on a 2-core c2-standard-4, no profile can manufacture a third core for three hot threads. The original Confined clamp collapsed to −62% at bg=4; the fix (demoting an over-budget Confined layer to Grouped instead of caging it) bounds the worst case at −27% with a monotonic decline — bounded damage, not parity. Hence the guidance: do not put >2-CPU critical pods on 2-core nodes.

llama.cpp: the 12% confinement gap, closed

Back to the llama.cpp story. After the SMT root-cause, the trail continued across node shapes. On a 4-core m5.2xlarge with no profile at all, the story flips: Temper flat (~1.98–2.02 s median) while CFS degrades +46% under background load — Temper wins −9% at bg=4 and −21% at bg=8; whole-core allocation spreads hot threads naturally when there are enough cores, so SMT stacking was 2-core geometry, not a general defect. What remained was a ~12% idle confinement gap against CFS's free run of the whole node.

The exclusive profile on c2-standard-8 closed that too: idle parity (1512 vs 1514 ms median), and under density a worst-case drift of +3.7% while CFS degraded +27% — at bg=8 Temper measured −18% median and −24% p90 against CFS. A profile detail worth showing because it is the kind of thing you get wrong by hand: the group's cpu_fraction had to be 0.85, not 0.9, because the pod's aggregate request (2250m) times 0.9 rounds up to three hot threads — three exclusive cores for a two-thread server. The committed profile TOML documents the arithmetic.

MySQL: profile validation, and the CommRegex bug

The MySQL run (sysbench OLTP, quota-limited — the throttle side of it is in the CPU-limits article) doubled as the first live validation of the builtin mysql-innodb profile: connection threads in a Confined exclusive layer, InnoDB internals (ib_*) weight-boosted Open, the remainder low-weight Open. Result: p99 15.6–16.7 ms dead flat through bg=8, against CFS's 60–66 ms, with ~3.3 background cores reclaimed at 0.99 node utilization.

“First live validation” is doing honest work in that sentence. The builtin profile had shipped emitting a CommRegex thread matcher — a variant scx_layered does not have. The spec parse failed, the scheduler exited before attach, and the node silently stayed on CFS: the profile had never actually run until this benchmark tried it. (The regex also used negative lookahead, which the Rust regex engine rejects — doubly dead.) The fix grouped InnoDB internals by the ib_ comm prefix with an exclude-based remainder, and the incident is why profile-load failures now fail loudly. Caveat that stays attached: this run has no tier-only comparison arm, so the profile's marginal contribution over plain tiers is not isolated in this record.

Cassandra: the case for a JVM profile

Cassandra ran tier-only — no profile exists for it yet — and still landed −78% p99 on the idle node (2.6 vs 11.8 ms; a re-check puts the honest idle range at −69…−78%), with every density step better than CFS (−13…−62%). The interesting part is where the biggest win sits: at idle, where there are no neighbors, the latency driver is the pod's own internal thread contention — 84 JVM threads (request handlers vs. GC and compaction) bursting past a 3-core quota into refill freezes. That is intra-pod structure, which is exactly the thing a profile expresses: GC and compaction threads to batch treatment, request threads to latency treatment. A cassandra/JVM profile is the obvious next increment, and this tier-only run is its baseline.

Where profiles stand

Raw records

  • docs/training-artifacts/onnx-inference/REPORT.md (+ onnx-inference.toml)
  • docs/training-artifacts/llm-inference/FINDINGS.md
  • docs/training-artifacts/llm-inference/smt-fix/FINDINGS.md (+ profile TOMLs)
  • docs/training-artifacts/mysql-oltp/REPORT.md
  • docs/training-artifacts/cassandra/REPORT.md
  • docs/training-artifacts/phase4/REPORT.md (training-mode cycle, mixed result)

Committed benchmark records in the product repository; design partners get the full artifact tree.