03 Deep dive

Tail amplification in service chains

A request that fans out across nineteen services samples nineteen scheduling delays, and the end-to-end tail compounds every hop's worst case. This is the measurement on a real microservice application — and the honest story of the scheduler bug the run found, how we proved which scheduler was actually live in each window, and the fix.

Why chains amplify tails

A monolith's p99 is one service's p99. A call graph is different: if a request touches n services and each hop independently has probability p of hitting a slow scheduling event, the end-to-end request hits at least one with probability 1 − (1 − p)n — the per-hop tail is effectively multiplied by the fan-out. On a co-located node, the dominant slow event is a runnable thread waiting behind lower-value work; CFS distributes that wait across every service in the chain. That compounding argument is the mechanism; what we measured is the end-to-end result at the load generator, not a hop-by-hop decomposition — the raw records keep the distinction.

The workload: upstream DeathStarBench hotel-reservation — 19 services including five MongoDBs, four memcacheds, consul, and jaeger — all Guaranteed and pinned to one 4-vCPU EKS node (m5.xlarge, requests totaling 3550m), driven open-loop by wrk2 at a fixed 200 requests/s from a second node, with a BestEffort spinner ladder (0/1/2/4/8) as the co-located contention. The CFS arm is the same node with Temper's safe-mode annotation set.

The measurement

DeathStarBench, 19 services, end-to-end p99 growth across the ladder CFS 9.4× · Temper attached ~1.5×

End-to-end p99 growth from bg=0 to the top of the spinner ladder; p50 stayed 1.8–6.2 ms in both arms — the story is entirely in the tail.

0 p99 growth 9.4× ~1.5× CFS (19.6 → 183.9 ms) Temper (18.1 → 28.4 ms)

CFS end-to-end p99 grew from 19.6 ms (bg=0) to 183.9 ms (bg=8) — 9.4×; the report's prose summary states 5–8× across the mid-ladder. Attached Temper windows: 18.1 ms → 28.4 ms (~1.5×). Clean attached operating points: −75% at bg=2 (25.8 vs 104.6 ms) and −83% at bg=4 (28.4 vs 164.2 ms). Caveat inherited from the CPU-limits article: cpu.max is not enforced under sched_ext, so attached-arm wins are labeled “fence + limit-bypass” in the record. source: docs/training-artifacts/deathstarbench/REPORT.md · EKS 1.36, 2× m5.xlarge, scheduler v13

Getting a trustworthy baseline took as much engineering as the win. Two integrity findings from the record are worth repeating because they invalidate a lot of published microservice benchmarks: the Go services saw 4 CPUs while living under sub-core quotas, so GC bursts exhausted the CFS quota window and produced 100–300 ms stalls that made even the baseline drift (fixed with GOMAXPROCS=1); and the upstream workload mix's 0.5% reserve writes grow a Mongo collection that the search path rescans, so p99 climbs with wall-clock time (fixed by making the mix read-only). Without those fixes the ladder measures throttle and dataset artifacts, not scheduling.

The eject saga, told honestly

This run also found a real scheduler bug, and we would rather narrate it than bury it. During the sustained-saturation windows, the scheduler (v13 at the time) was ejected by the kernel eight times. An eject is sched_ext's watchdog doing its job: if any runnable task is not serviced within 30 s, the kernel unloads the BPF scheduler and the node reverts to CFS — the designed fail-safe direction (the full contract is in the failure-engineering article). The workload kept running; the fence silently stopped existing.

The forensics are specific. The stalled tasks were per-CPU kworkers — kernel threads pinned to one CPU with migration disabled. v13's starvation floor guaranteed service at layer granularity: some CPU in the layer's mask would run the layer's work. But a pinned kworker can only run on its own CPU, and on a 4-CPU node where that CPU was monopolized by protected and background traffic, the kworker aged past the 30 s watchdog. The eject dump showed the open-consumption path skipping migration-disabled tasks wholesale (8,357 skip events). A second, agent-side defect compounded it: after an eject the agent took ~3 minutes to notice its scheduler process was dead, advertising a fence that was not there.

How did we know which measurement windows were polluted? PSI fingerprinting. Node-level CPU pressure (cpu some) is captured before and after every window, and the two schedulers have unmistakably different pressure shapes at the same density: attached windows run 46–65% (confined spinners pile up as runnable-but-waiting) while CFS windows run 6–35% (the pain is distributed into the primary). Every window whose fingerprint said CFS was marked, kept in the table with an asterisk, and excluded from attached claims. The report's own words: at bg=2, one polluted “Temper” window read 102.6 ms against CFS's 104.6 — “the ladder converging is literally CFS measuring itself.”

The fix shipped as scheduler v14: a per-CPU starvation guarantee (pinned kernel threads are serviced on their own CPU regardless of layer accounting) plus a scheduler supervisor in the agent that detects an unexpected exit and flips status/respawns in 3–6 s instead of minutes. The v14 re-run on a fresh cluster: zero ejects across the full ~35-minute saturated arm (v13: 8 ejects/11 attaches), Temper's tail flat at 17–20 ms from bg=2 up while CFS climbed to 29.5 ms — −19/−28/−33% at bg=2/4/8. The v13 and v14 campaigns ran on different fresh clusters with materially different CFS baselines, so the record forbids mixing rows across them; we quote each campaign's numbers only against its own baseline.

One honest residual from v14, with its mechanism: on the undersized EKS node (3.55 cores of requests on a 2-physical-core box) the bg=1 point measured +156% against CFS — small-node demotion pinned all 19 services onto one physical core's SMT siblings while CFS exploited the half-idle second core. Re-run on a right-sized GKE node (app scaled to fit cores−1), the catastrophe collapses to +14% at bg=1 and Temper wins from bg=2 (−39% at bg=4) with a dead-flat 8.5–9.2 ms tail. The sizing rule this produced — critical demand must fit within physical cores minus one — is now first-class guidance in operations.

What this run is evidence for

Raw records

  • docs/training-artifacts/deathstarbench/REPORT.md
  • docs/training-artifacts/deathstarbench/v14-validation/REPORT.md
  • docs/training-artifacts/pwb-v14-validation/raw/gke-e2-repro-agent.log
  • docs/training-artifacts/forensics/FINDINGS.md
  • docs/training-artifacts/forensics/FINDINGS-2-stranded-exclusive.md
  • docs/training-artifacts/headroom/gke-c2-http/v10-validation/REPORT.md
  • docs/training-artifacts/headroom/gke-c2-http/v11-validation/REPORT.md (eject-class lineage)

Committed benchmark records in the product repository; design partners get the full artifact tree. Windows that ran after a watchdog eject are asterisked in the raw tables, never silently dropped.