Safety & rollback: the kernel takes back over
The question that decides whether kernel-level enforcement is adoptable is not the upside — it is the failure modes. This page is the complete honest list, with measurements.
infera); the commands below are what works today. A rename migration is planned.The fail-safe is the kernel’s contract
sched_ext was designed so that a BPF scheduler cannot take the system down
with it: if the scheduler misbehaves, stalls, crashes, or detaches for any reason, the
kernel ejects it and atomically resumes scheduling with the stock scheduler. This is not
Temper code — it is the kernel feature Temper is built on. The consequence is structural:
the worst case is the scheduler you already run today, per node, never per cluster.
Measured failover
We force-killed the node agent mid-benchmark, under load:
| Moment | memcached p99 |
|---|---|
| Before the kill (Temper attached) | 0.607 ms |
| During the kill (kernel reverts to CFS) | 0.639 ms |
| After recovery (agent re-attached) | back to baseline in seconds |
A 0.03 ms blip, no blackout, replacement agent (it is a DaemonSet) Ready in 15 s. An 8-hour soak ran clean. Full data on the benchmarks page.
The kill switch
Fleet-wide rollback is one annotation — no helm operation, no control-plane dependency, honored directly by each node’s agent:
# stand the scheduler down everywhere; pods run stock CFS
kubectl annotate node --all infera.io/safe-mode-requested=true
# re-engage
kubectl annotate node --all infera.io/safe-mode-requested-
Safe mode can also be targeted at single nodes, toggled from the
dashboard (audit-logged), or driven by the optional controller via
an InferaPolicy resource. Entering safe mode always succeeds — it kills the
scheduler; exit re-generates config and re-attaches.
Reconfiguration churn cost
When QoS assignments change on a node (a pod joins or leaves a tier), the agent regenerates the scheduler configuration and restarts the kernel scheduler. The measured cost is a ~52 ms window per reconfiguration during which the node runs stock CFS — node-local, bounded, and in the same safe failure direction as everything else here: absence of benefit, not harm. Pod churn is debounced and batched so a busy node does not thrash.
The cpu.max disclosure
Stated plainly, because it is the one behavioral difference you must know:
while Temper’s scheduler is attached, cgroup cpu.max CPU quotas are not
enforced by the kernel. This is a property of sched_ext scheduling, not a Temper choice.
Containment of greedy workloads comes from Temper’s layer ceilings instead — which
is what the benchmarks exercise — and quota-derived layer ceilings are on the roadmap to
close the semantic gap. Two mitigations are unconditional: memory limits are unaffected (only
CPU quota semantics change), and the kill switch restores CFS with quotas instantly.
If strict CPU quota enforcement is a compliance requirement for a node, do not attach Temper
to that node — mixed fleets are fully supported.
Privileged DaemonSet posture
Loading a kernel scheduler requires privileged + hostPID and /sys access
— the standard posture of node agents like Falco or Datadog. What bounds it: the agent
serves only in-cluster endpoints, executes no remote code, makes zero external calls, and
writes only its own scheduler process and node annotations. Every permission is justified
line by line in the security whitepaper. Full posture, supply chain, and disclosure policy:
security & trust.