06 Deep dive

Failure and rollback engineering

The question that decides whether kernel-level enforcement is adoptable is not the upside; it is what happens when it breaks. We benchmark the failure paths the way we benchmark the happy path. This article collects every fail-safe measurement in one place, with the mechanisms behind them.

experiments: agent kill · 8h soak · churn storm · eject forensics clusters: GKE (e2) · EKS (m5) records: 4 reports + machine records

The kernel's contract, not ours

sched_ext (upstream since Linux 6.12) is designed so a BPF scheduler cannot take the node down with it. When the scheduler is disabled — by request or by error — the kernel moves all tasks back to the stock fair-class scheduler atomically. The kernel itself ejects the BPF scheduler on any of: the scheduler process dying (crash, OOM-kill, pod deletion), any BPF-side error, a runnable-task stall caught by a kernel-enforced watchdog (a hard 30 s cap: if any runnable task is not scheduled within the timeout, the scheduler is ejected), or an operator request. The consequence for risk assessment is structural rather than promised: the worst case of a Temper scheduling failure is the scheduler you run today — not a kernel panic, not a hung node; degradation to stock CPU behavior, on that node only.

The same direction holds for nodes that never qualify: where the kernel lacks sched_ext, the scheduler fails to attach and the agent enters safe mode — the node runs stock CFS exactly as it would without Temper. Mixed fleets degrade per-node, never per-cluster.

Measured failover: kill the agent under load

We force-killed the node agent mid-benchmark, with a latency-critical memcached under memtier load on the node:

0.607 / 0.639 / 0.607 p99 (ms) before the kill / during the CFS-fallback window / after the replacement agent re-attached docs/training-artifacts/reliability/failover-record.json

15.0 s replacement agent (it is a DaemonSet) Ready and scheduler re-attached docs/training-artifacts/binpack/SAVINGS-REPORT.md

0.271 → 0.239 → 0.215 the same experiment replicated on EKS (p99 ms, before/during/after) — zero SLO blip, agent Ready in 14.8 s docs/training-artifacts/reliability/eks/failover-record.json

Two details matter more than the headline. First, scx_layered keeps running when the agent dies — the scheduler is not on the agent's life support; the EKS record shows the workload literally not noticing. Second, when the scheduler itself goes, the fallback is the kernel revert described above, so the whole kill-and-recover cycle on GKE amounted to roughly a 5% p99 blip and fifteen seconds of DaemonSet rescheduling.

Eight hours of deliberate abuse

A soak run held the full benchmark workload (two Critical memcacheds under load, Normal fillers, Background spinners) for 8 hours while a churn generator created and deleted a pod every ~36 seconds — 399 cycles, which drove 798 scheduler reconfiguration restarts on the node that received them. Verdict, from the committed report: pass — no leak, no SLO drift, no agent instability.

192 Mi flat agent memory, all three agents, stable to within 1 Mi across 8 h of continuous churn (50 samples each) docs/training-artifacts/reliability/agent-mem.csv

0.21% of the churned node’s time spent in CFS fallback across 798 reconfig restarts docs/training-artifacts/reliability/SOAK-REPORT.md

0 crashes the 798 restarts are clean reconfigs, not failures — agent restartCount stayed 0 on every pod; SLO envelope stable first hour to last docs/training-artifacts/reliability/SOAK-REPORT.md

Honest notes from the record: the soak's CFS-gap uses a restart-count × 75 ms estimate because reading 8 hours of logs on the churned node timed out (the restart count itself is exact, from metrics); one of 400 churn cycles was dropped when the wall clock ran out; and all churn landed on one node because the generator used no spread constraint — which is not how production pods behave, but it cleanly bounds the worst case.

What a reconfiguration costs

When QoS assignments change on a node, the agent regenerates the scheduler config and restarts scx_layered; the node runs stock CFS for the gap. A deliberate churn storm (50 assignment cycles in 10 minutes → 100 restarts on one node, with 48 no-op reloads correctly skipped) measured 52.6 ms of CFS fallback per restart from log kill→spawn timestamps — 0.88% of the busiest node's time at that extreme rate. The EKS parity run showed the same blast-radius shape (all restarts confined to the churned node, zero on the others) with a different per-restart cost model (~172 ms implied); the records note the models differ, so we quote the measured GKE figure and the shape, not a cross-cloud average. The failure direction during every gap is the same as everywhere else on this page: absence of benefit, not harm.

The watchdog eject is designed behavior — and it fired

The kernel watchdog is not theoretical for us: during the DeathStarBench campaign it ejected scheduler v13 eight times under sustained saturation, each time reverting the node to CFS exactly as specified while we root-caused a per-CPU kthread starvation bug and shipped the v14 fix (zero ejects in the re-run). The full narrative — forensics, the PSI fingerprinting that identified which windows silently ran CFS, and the fix timeline — is in the service-chains article. What belongs here is the operational conclusion: the eject is the safety system. v14 also added a scheduler supervisor in the agent; a SIGKILL test against the live scheduler showed reap, status flip, and respawn in 3–5 s, closing the v13-era defect where the agent took minutes to notice its scheduler was gone.

The kill switch, and why rollback is two separate speeds

Fleet-wide rollback is one node annotation — temper.codes/safe-mode-requested — honored directly by each node's agent with no control-plane round trip, so it works even when everything above the node engine is down. Entering safe mode always succeeds (it kills the scheduler; the kernel does the rest), and the agent also auto-enters safe mode on scheduler crash. The operational point, spelled out in the runbook: enforcement rollback (milliseconds, kernel-native) is deliberately decoupled from software rollback (minutes, helm) — you never wait on an image pull to get back to stock scheduling. Safe mode also restores CFS with CPU quotas, which matters given the cpu.max disclosure.

For upgrades, the chart ships a canary mechanism: a second DaemonSet with the candidate image on labeled nodes, with hard mutual-exclusion affinity against the main one. A misbehaving candidate's blast radius is the labeled nodes, and each of them fails toward stock CFS — the same direction as every other failure on this page.

What is measured vs. what is guaranteed

Guaranteed by the kernel: atomic revert to the stock scheduler on scheduler death, BPF error, watchdog stall, or request. This is sched_ext's contract, independent of our code quality.
Measured by us: the failover blip (GKE and EKS), the soak (memory, SLO envelope, CFS-gap), the churn cost, the supervisor recovery time, and the eject class found and closed on a real workload.
Known costs, stated: ~52.6 ms of CFS per reconfiguration (debounced and batched; a deployment pattern that flaps PriorityClasses works against itself), and the cpu.max semantic difference while attached.

Raw records

docs/training-artifacts/reliability/SOAK-REPORT.md
docs/training-artifacts/reliability/failover-record.json
docs/training-artifacts/reliability/churn-record.json
docs/training-artifacts/reliability/soak-churn-record.json
docs/training-artifacts/reliability/agent-mem.csv · slo.csv
docs/training-artifacts/reliability/eks/ (failover + churn parity)
docs/training-artifacts/binpack/SAVINGS-REPORT.md (§4 reliability)
docs/training-artifacts/deathstarbench/v14-validation/REPORT.md (eject closure + supervisor kill test)
docs/security/WHITEPAPER.md (§2, the kernel contract and kill switch)

Committed benchmark records in the product repository; design partners get the full artifact tree.