About Temper

This page describes the system the way an architecture document would: what it is, how it works, what you operate, what it requires, what it does not do, and how teams adopt it.

What Temper is

Temper is a CPU scheduling and capacity platform for Kubernetes. Its core is a node agent that replaces the Linux scheduler's default arbitration with an explicit, QoS-tiered policy, using the kernel's sched_ext framework and the scx_layered BPF scheduler. The practical effect is that a node can run latency-critical services and batch work together at high utilization: the critical services keep their tail latency because the kernel enforces their priority at every scheduling decision, and the batch work consumes whatever is genuinely idle.

The reason this is a product and not a config file is that the policy has to be computed and re-computed from cluster state. Pods arrive and leave; their priorities and resource requests define what the scheduler config should be on each node at each moment; and the whole thing has to fail toward the stock scheduler, observably, every time. Temper is the machinery around that loop, plus the optional placement and management layers that build on it. It is not a rightsizer, an autoscaler, or a dashboard with recommendations — it is an enforcement mechanism, and everything else in the product exists to feed or observe that mechanism.

How it works

On each node, the agent watches the pods scheduled locally through the Kubernetes API. Every pod is assigned one of five QoS tiers (Critical, High, Normal, Low, Background) derived from pod.spec.priority — the value its PriorityClass sets — with pods lacking a PriorityClass defaulted by their Kubernetes QoS class. From the tier membership and the pods' actual CPU requests and limits, the agent generates a scx_layered configuration: Critical becomes a confined, protected layer with exclusive whole cores; High and Normal become grouped layers with preferred CPU sets; Low and Background become open layers that run on idle capacity and are preempted when a protected tier wakes. Layer weights, CPU ranges, and utilization bands are computed from the requests, not hardcoded. When assignments change, the agent regenerates the config and restarts the scheduler; the node runs stock CFS for the ~52 ms gap.

The components, by layer:

L0, node enforcement — temper-agent, a Rust daemon deployed as a privileged DaemonSet. It derives tiers, discovers pod cgroups, generates and manages the scx_layered process (a pinned upstream build plus two reviewable carry-patches in the repository), applies workload profiles (per-thread-group layer assignment inside a pod), publishes node annotations, and runs the always-on observation layer: per-thread placement sampling, PSI pressure monitoring, a placement linter, and bounded kernel trace capture. Measured overhead of the observation layer is 0.13% of one core.
L1, cluster intelligence (optional, off by default) — a density-aware kube-scheduler plugin that places pods using the per-tier load annotations L0 publishes, and a mutating admission webhook that scales down the CPU requests of non-critical pods in namespaces you label, recording the original value in an annotation. A small controller translates the TemperPolicy resource into safe-mode node annotations.
L2, management plane (optional) — temper-dashboard, an in-cluster web application with a hierarchy explorer, live logs and manifests, per-pod performance panels, audit-logged operator actions, a savings view that separates measured reclaim from estimated opportunity, and a multi-cluster peer registry. It consumes the same public interfaces the CLI does.

Each layer consumes only the one below it, and each is separately removable. Safe mode illustrates the design: the fleet-wide kill switch is a node annotation the agent honors directly, so rollback works even when everything above L0 is down.

What you operate

One Helm chart installs the agent DaemonSet, RBAC, the recommended temper-* PriorityClasses, and optionally the dashboard, scheduler plugin, webhook, and controller. Day to day you interact with: the dashboard (or plain kubectl — all agent state is visible in node annotations); the temper-cli for direct agent gRPC operations (status, QoS assignment, scheduler reload, config inspection, trace capture); a Prometheus /metrics endpoint per node with generated Grafana dashboards; and a JSON GET /observe snapshot per node with the machine shape, per-layer scheduler statistics, top busy threads, and linter verdicts. Workload profiles, when you use them, are TOML files deployed through chart values. Everything runs in your cluster; there is no SaaS control plane and the product makes no external network calls.

What it requires

Two hard requirements. First, node kernels ≥ 6.12 built with CONFIG_SCHED_CLASS_EXT=y and BTF — the kernel version alone is not sufficient, because distributions choose whether to enable the config (two ship 6.12 with it off), so the agent verifies each node at startup and refuses loudly rather than half-working. Second, permission to run a privileged, hostPID DaemonSet with /sys mounts. The privilege is what loading a BPF scheduler into the kernel's scheduling path costs; it is also used to read cgroup statistics and PSI pressure files and, during opt-in trace captures, tracefs. This is the standard posture of node-level eBPF agents (Falco, Datadog's system-probe, Cilium), and the security whitepaper justifies it operation by operation. Verified platforms today: GKE Standard ≥1.36 and EKS ≥1.33 (AL2023, Bottlerocket, and EKS Auto Mode); the full matrix, including the unsupported platforms, is in platform support.

What it does not do

Temper does not provision nodes, resize node pools, or orchestrate spot instances — Karpenter, Cast AI, and Cluster Autoscaler keep those jobs, and the default install (complement mode) leaves placement entirely to them. It does not arbitrate memory or network: enforcement is CPU scheduling, because that is where sched_ext gives kernel-level leverage; memory limits and PSI-based monitoring behave as before. It does not enforce Kubernetes CPU limits while attached: cpu.max is fair-class machinery that sched_ext tasks never charge, so containment comes from Temper's layer ceilings instead — measured at-or-below quota on the tested shapes, but not semantically identical. That difference, its kernel mechanics, and the measurement are written up in the CPU-limits deep dive; if strict quota enforcement is a compliance requirement on a node, do not attach Temper to it. Finally, it does not run where the platform forbids the mechanism: AKS currently disables sched_ext in its kernels, and GKE Autopilot and similar fully-managed modes forbid privileged DaemonSets. We list those as plain noes rather than working around policy.

How people adopt it

The recommended path is incremental and observation-first. Install the chart on one node pool with no PriorityClasses assigned: nothing changes for workloads, and the observation layer starts showing per-thread placement and contention. Then assign temper-critical to one latency-sensitive service and verify its tail under your real traffic — that single field in the pod spec is the entire application-side integration. Once the fence is proven, densify: let batch work share the protected nodes, and optionally enable the L1 webhook and scheduler plugin to convert the safety into measured packing. Fleets running Karpenter or Cast AI keep them throughout; Temper arbitrates the CPU on whatever nodes those tools provision. Mixed fleets are safe by construction — nodes that fail the kernel gate simply run stock CFS — and canary tooling for agent upgrades ships in the chart.

Project status

Temper is in early access, working with design partners. The benchmark harness is open and every performance claim on this site traces to a committed report with its caveats attached — the deep dives are the readable form of that evidence, including the negatives. The repository is not yet public; design partners get source access, and every release ships signed images, an SBOM, and the exact carry-patches applied to the pinned upstream scheduler. SOC 2 Type II is in progress, not complete. If you want to evaluate it, the quickstart is about fifteen minutes on a real cluster, and the Community tier is the full node engine.