About Temper
This page describes the system the way an architecture document would: what it is, how it works, what you operate, what it requires, what it does not do, and how teams adopt it.
What Temper is
Temper is a CPU scheduling and capacity platform for Kubernetes. Its core is a node agent
that replaces the Linux scheduler's default arbitration with an explicit, QoS-tiered policy,
using the kernel's sched_ext framework and the scx_layered BPF
scheduler. The practical effect is that a node can run latency-critical services and batch
work together at high utilization: the critical services keep their tail latency because the
kernel enforces their priority at every scheduling decision, and the batch work consumes
whatever is genuinely idle.
The reason this is a product and not a config file is that the policy has to be computed and re-computed from cluster state. Pods arrive and leave; their priorities and resource requests define what the scheduler config should be on each node at each moment; and the whole thing has to fail toward the stock scheduler, observably, every time. Temper is the machinery around that loop, plus the optional placement and management layers that build on it. It is not a rightsizer, an autoscaler, or a dashboard with recommendations — it is an enforcement mechanism, and everything else in the product exists to feed or observe that mechanism.
How it works
On each node, the agent watches the pods scheduled locally through the Kubernetes API.
Every pod is assigned one of five QoS tiers (Critical, High, Normal, Low, Background)
derived from pod.spec.priority — the value its PriorityClass sets —
with pods lacking a PriorityClass defaulted by their Kubernetes QoS class. From the tier
membership and the pods' actual CPU requests and limits, the agent generates a
scx_layered configuration: Critical becomes a confined, protected layer with
exclusive whole cores; High and Normal become grouped layers with preferred CPU sets; Low
and Background become open layers that run on idle capacity and are preempted when a
protected tier wakes. Layer weights, CPU ranges, and utilization bands are computed from the
requests, not hardcoded. When assignments change, the agent regenerates the config and
restarts the scheduler; the node runs stock CFS for the ~52 ms gap.
The components, by layer:
- L0, node enforcement —
temper-agent, a Rust daemon deployed as a privileged DaemonSet. It derives tiers, discovers pod cgroups, generates and manages thescx_layeredprocess (a pinned upstream build plus two reviewable carry-patches in the repository), applies workload profiles (per-thread-group layer assignment inside a pod), publishes node annotations, and runs the always-on observation layer: per-thread placement sampling, PSI pressure monitoring, a placement linter, and bounded kernel trace capture. Measured overhead of the observation layer is 0.13% of one core. - L1, cluster intelligence (optional, off by default) — a density-aware
kube-scheduler plugin that places pods using the per-tier load annotations L0 publishes,
and a mutating admission webhook that scales down the CPU requests of non-critical pods in
namespaces you label, recording the original value in an annotation. A small controller
translates the
TemperPolicyresource into safe-mode node annotations. - L2, management plane (optional) —
temper-dashboard, an in-cluster web application with a hierarchy explorer, live logs and manifests, per-pod performance panels, audit-logged operator actions, a savings view that separates measured reclaim from estimated opportunity, and a multi-cluster peer registry. It consumes the same public interfaces the CLI does.
Each layer consumes only the one below it, and each is separately removable. Safe mode illustrates the design: the fleet-wide kill switch is a node annotation the agent honors directly, so rollback works even when everything above L0 is down.
What you operate
One Helm chart installs the agent DaemonSet, RBAC, the recommended
temper-* PriorityClasses, and optionally the dashboard, scheduler plugin,
webhook, and controller. Day to day you interact with: the dashboard (or plain
kubectl — all agent state is visible in node annotations); the
temper-cli for direct agent gRPC operations (status, QoS assignment, scheduler
reload, config inspection, trace capture); a Prometheus /metrics endpoint per
node with generated Grafana dashboards; and a JSON GET /observe snapshot per
node with the machine shape, per-layer scheduler statistics, top busy threads, and linter
verdicts. Workload profiles, when you use them, are TOML files deployed through chart
values. Everything runs in your cluster; there is no SaaS control plane and the product
makes no external network calls.
What it requires
Two hard requirements. First, node kernels ≥ 6.12 built with
CONFIG_SCHED_CLASS_EXT=y and BTF — the kernel version alone is not
sufficient, because distributions choose whether to enable the config (two ship 6.12 with it
off), so the agent verifies each node at startup and refuses loudly rather than
half-working. Second, permission to run a privileged, hostPID DaemonSet with
/sys mounts. The privilege is what loading a BPF scheduler into the kernel's
scheduling path costs; it is also used to read cgroup statistics and PSI pressure files and,
during opt-in trace captures, tracefs. This is the standard posture of node-level eBPF
agents (Falco, Datadog's system-probe, Cilium), and the security whitepaper justifies it
operation by operation. Verified platforms today: GKE Standard ≥1.36 and EKS ≥1.33
(AL2023, Bottlerocket, and EKS Auto Mode); the full matrix, including the unsupported
platforms, is in platform support.
What it does not do
Temper does not provision nodes, resize node pools, or orchestrate spot instances —
Karpenter, Cast AI, and Cluster Autoscaler keep those jobs, and the default install
(complement mode) leaves placement entirely to them. It does not arbitrate memory or
network: enforcement is CPU scheduling, because that is where sched_ext gives kernel-level
leverage; memory limits and PSI-based monitoring behave as before. It does not enforce
Kubernetes CPU limits while attached: cpu.max is fair-class machinery that
sched_ext tasks never charge, so containment comes from Temper's layer ceilings instead
— measured at-or-below quota on the tested shapes, but not semantically identical.
That difference, its kernel mechanics, and the measurement are written up in
the CPU-limits deep dive; if strict quota
enforcement is a compliance requirement on a node, do not attach Temper to it. Finally, it
does not run where the platform forbids the mechanism: AKS currently disables sched_ext in
its kernels, and GKE Autopilot and similar fully-managed modes forbid privileged
DaemonSets. We list those as plain noes rather than working around policy.
How people adopt it
The recommended path is incremental and observation-first. Install the chart on one node
pool with no PriorityClasses assigned: nothing changes for workloads, and the observation
layer starts showing per-thread placement and contention. Then assign
temper-critical to one latency-sensitive service and verify its tail under
your real traffic — that single field in the pod spec is the entire application-side
integration. Once the fence is proven, densify: let batch work share the protected nodes,
and optionally enable the L1 webhook and scheduler plugin to convert the safety into
measured packing. Fleets running Karpenter or Cast AI keep them throughout; Temper arbitrates
the CPU on whatever nodes those tools provision. Mixed fleets are safe by construction
— nodes that fail the kernel gate simply run stock CFS — and canary tooling for
agent upgrades ships in the chart.
Project status
Temper is in early access, working with design partners. The benchmark harness is open and every performance claim on this site traces to a committed report with its caveats attached — the deep dives are the readable form of that evidence, including the negatives. The repository is not yet public; design partners get source access, and every release ships signed images, an SBOM, and the exact carry-patches applied to the pinned upstream scheduler. SOC 2 Type II is in progress, not complete. If you want to evaluate it, the quickstart is about fifteen minutes on a real cluster, and the Community tier is the full node engine.