06 Deep dive
Failure and rollback engineering
The question that decides whether kernel-level enforcement is adoptable is not the upside; it is what happens when it breaks. We benchmark the failure paths the way we benchmark the happy path. This article collects every fail-safe measurement in one place, with the mechanisms behind them.
The kernel's contract, not ours
sched_ext (upstream since Linux 6.12) is designed so a BPF scheduler
cannot take the node down with it. When the scheduler is disabled — by request
or by error — the kernel moves all tasks back to the stock fair-class scheduler
atomically. The kernel itself ejects the BPF scheduler on any of: the scheduler process
dying (crash, OOM-kill, pod deletion), any BPF-side error, a runnable-task stall
caught by a kernel-enforced watchdog (a hard 30 s cap: if any runnable task is not
scheduled within the timeout, the scheduler is ejected), or an operator request. The
consequence for risk assessment is structural rather than promised:
the worst case of a Temper scheduling failure is the scheduler you run today —
not a kernel panic, not a hung node; degradation to stock CPU behavior, on that node only.
The same direction holds for nodes that never qualify: where the kernel lacks sched_ext, the scheduler fails to attach and the agent enters safe mode — the node runs stock CFS exactly as it would without Temper. Mixed fleets degrade per-node, never per-cluster.
Measured failover: kill the agent under load
We force-killed the node agent mid-benchmark, with a latency-critical memcached under memtier load on the node:
Two details matter more than the headline. First, scx_layered keeps running
when the agent dies — the scheduler is not on the agent's life support; the EKS
record shows the workload literally not noticing. Second, when the scheduler itself goes,
the fallback is the kernel revert described above, so the whole kill-and-recover cycle on
GKE amounted to roughly a 5% p99 blip and fifteen seconds of DaemonSet rescheduling.
Eight hours of deliberate abuse
A soak run held the full benchmark workload (two Critical memcacheds under load, Normal fillers, Background spinners) for 8 hours while a churn generator created and deleted a pod every ~36 seconds — 399 cycles, which drove 798 scheduler reconfiguration restarts on the node that received them. Verdict, from the committed report: pass — no leak, no SLO drift, no agent instability.
Honest notes from the record: the soak's CFS-gap uses a restart-count × 75 ms estimate because reading 8 hours of logs on the churned node timed out (the restart count itself is exact, from metrics); one of 400 churn cycles was dropped when the wall clock ran out; and all churn landed on one node because the generator used no spread constraint — which is not how production pods behave, but it cleanly bounds the worst case.
What a reconfiguration costs
When QoS assignments change on a node, the agent regenerates the scheduler config and
restarts scx_layered; the node runs stock CFS for the gap. A deliberate churn
storm (50 assignment cycles in 10 minutes → 100 restarts on one node, with 48 no-op
reloads correctly skipped) measured 52.6 ms of CFS fallback per restart from
log kill→spawn timestamps — 0.88% of the busiest node's time at that extreme rate.
The EKS parity run showed the same blast-radius shape (all restarts confined to the churned
node, zero on the others) with a different per-restart cost model (~172 ms implied);
the records note the models differ, so we quote the measured GKE figure and the shape, not a
cross-cloud average. The failure direction during every gap is the same as everywhere else
on this page: absence of benefit, not harm.
The watchdog eject is designed behavior — and it fired
The kernel watchdog is not theoretical for us: during the DeathStarBench campaign it ejected scheduler v13 eight times under sustained saturation, each time reverting the node to CFS exactly as specified while we root-caused a per-CPU kthread starvation bug and shipped the v14 fix (zero ejects in the re-run). The full narrative — forensics, the PSI fingerprinting that identified which windows silently ran CFS, and the fix timeline — is in the service-chains article. What belongs here is the operational conclusion: the eject is the safety system. v14 also added a scheduler supervisor in the agent; a SIGKILL test against the live scheduler showed reap, status flip, and respawn in 3–5 s, closing the v13-era defect where the agent took minutes to notice its scheduler was gone.
The kill switch, and why rollback is two separate speeds
Fleet-wide rollback is one node annotation — temper.codes/safe-mode-requested
— honored directly by each node's agent with no control-plane round trip, so it works
even when everything above the node engine is down. Entering safe mode always succeeds (it
kills the scheduler; the kernel does the rest), and the agent also auto-enters safe mode on
scheduler crash. The operational point, spelled out in
the runbook: enforcement rollback (milliseconds,
kernel-native) is deliberately decoupled from software rollback (minutes, helm) — you
never wait on an image pull to get back to stock scheduling. Safe mode also restores CFS
with CPU quotas, which matters given the
cpu.max disclosure.
For upgrades, the chart ships a canary mechanism: a second DaemonSet with the candidate image on labeled nodes, with hard mutual-exclusion affinity against the main one. A misbehaving candidate's blast radius is the labeled nodes, and each of them fails toward stock CFS — the same direction as every other failure on this page.
What is measured vs. what is guaranteed
- Guaranteed by the kernel: atomic revert to the stock scheduler on scheduler death, BPF error, watchdog stall, or request. This is sched_ext's contract, independent of our code quality.
- Measured by us: the failover blip (GKE and EKS), the soak (memory, SLO envelope, CFS-gap), the churn cost, the supervisor recovery time, and the eject class found and closed on a real workload.
- Known costs, stated: ~52.6 ms of CFS per reconfiguration (debounced and batched; a deployment pattern that flaps PriorityClasses works against itself), and the cpu.max semantic difference while attached.
Raw records
- docs/training-artifacts/reliability/SOAK-REPORT.md
- docs/training-artifacts/reliability/failover-record.json
- docs/training-artifacts/reliability/churn-record.json
- docs/training-artifacts/reliability/soak-churn-record.json
- docs/training-artifacts/reliability/agent-mem.csv · slo.csv
- docs/training-artifacts/reliability/eks/ (failover + churn parity)
- docs/training-artifacts/binpack/SAVINGS-REPORT.md (§4 reliability)
- docs/training-artifacts/deathstarbench/v14-validation/REPORT.md (eject closure + supervisor kill test)
- docs/security/WHITEPAPER.md (§2, the kernel contract and kill switch)
Committed benchmark records in the product repository; design partners get the full artifact tree.