Kubernetes capacity platform

CPU scheduling, enforced inside the kernel.

Temper is a Kubernetes capacity platform that replaces best-effort CFS arbitration with kernel-enforced QoS (Linux sched_ext). Latency-critical services keep their p99 while batch work soaks up the idle cycles — so you pack nodes tight without losing tail latency. Like tempered steel: stronger and more flexible at once — leaner, never brittle.

One helm install. Kernel-native rollback — kill the agent and the node reverts to the stock scheduler instantly.

memcached p99 vs. node load Temper flat · CFS 3.1×

Multi-node GKE, batch-filler ladder. Nodes at 0.81–0.92 utilization.

CFS Temper
0 0.5 1.0 1.5 p99 (ms) 0 4 8 12 18 batch fillers on 3 nodes 1.52 ms (3.1×) 0.48 ms flat

Measured live, single run per arm; the full write-up with methodology and caveats is the sideloading deep dive. source: docs/training-artifacts/binpack/REPORT.md

−88% memcached p99 at the knee vs. CFS, heavy operating point deep dive: sideloading →
9.4×→~1.5× end-to-end p99 growth under load, 19-service DeathStarBench deep dive: service chains →
+72% pods placed at equal SLO on the same nodes with overcommit deep dive: sideloading →
<1% CPU overhead for always-on observation (measured 0.13% of one core) docs: observability →
Built on Linux sched_ext Runs on GKE & EKS today Zero external calls — fully in-cluster Kubernetes-native inputs, no CRDs

01 The problem

Every Kubernetes cost tool operates above the kernel.

Rightsizers predict usage. Placement engines move pods. Autoscalers resize fleets. All of it is a prediction — and the moment two pods share a node, the prediction is over and CFS arbitration begins.

The Linux Completely Fair Scheduler has no notion of which pod’s p99 matters. Pack a node tight and CFS decides — microsecond by microsecond — who eats the latency. A batch task that becomes runnable can take a timeslice from your revenue service at exactly the wrong moment, and that one delayed request is the tail spike your SLO dashboard shows.

This is why every cost-optimization product keeps its savings engine conservative: the utilization it can safely reach is capped by a scheduler it does not control. Temper occupies the layer none of them do — CPU arbitration inside the kernel’s scheduling path — and makes dense packing safe by enforcement instead of prediction.

SaaSRightsizers · FinOps platforms Predict usage, trim requests & limits. Container-granular.
k8sPlacement · autoscalers · bin-packing Move pods, resize fleets. Pack by requests, blind to on-node contention.
kernel boundary — every other vendor stops here
L0Temper — scx_layered (sched_ext) CPU arbitration inside the kernel scheduling path. Fence, loan, preempt on wake.
CFSStock Linux scheduler What every co-located pod inherits without Temper — and the automatic fallback if our agent dies.

02 The platform

Five capabilities. One helm chart.

The node engine is the core; everything else is optional and individually switchable. Install any depth — from a single protected node to the full multi-cluster management plane.

a · node engine (L0)

Enforce QoS at the kernel scheduler

A node agent maps every pod to one of five QoS tiers — derived from standard Kubernetes PriorityClasses and resource requests, no CRDs and no app changes — and drives scx_layered, a Linux sched_ext scheduler, with a config computed from what your pods actually request. Critical tiers get fenced CPU; batch tiers get whatever is idle, and get preempted the instant a protected workload wakes.

  • Five tiers (Critical → Background) from pod.spec.priority — Kubernetes-native inputs only
  • Layer weights, CPU ranges, and utilization bands computed from real resource requests
  • Fail-safe by construction: agent death = instant kernel revert to the stock scheduler
Learn more →
ONE NODE · 8 CPUS · LAYERS FROM POD REQUESTS CRITICAL · CONFINED HIGH / NORMAL · GROUPED LOW / BG · OPEN idle cycles loaned to batch preempt-kick the instant a protected pod wakes tiers: temper-critical / -high / -normal / -low / -background PriorityClasses

b · node engine (L0)

Tune scheduling per thread group, inside one pod

A pod is not one uniform workload — a database has connection threads, I/O threads, and background purge threads with completely different needs. Workload Profiles give each thread group its own scheduling treatment inside a single pod: exclusive cores for the hot path, latency treatment for wake chains, yield for housekeeping. No product operating above the kernel can see thread structure, let alone schedule on it.

  • Builtin profiles for common shapes (e.g. PyTorch dataloaders, MySQL/InnoDB), plus file-based custom profiles
  • Auto-detection by container image or a single pod annotation
  • Training mode measures your workload and synthesizes a profile automatically
Learn more →
ONE POD · MYSQL conn threads ×16 innodb i/o threads purge / bg threads exclusive cores latency treatment yield to others

c · cluster intelligence (L1)

Pack more pods without breaking SLOs

Kubernetes bin-packs by declared requests — which are usually padded, because nobody trusts CFS with a tight node. With enforcement underneath, Temper’s optional placement layer packs by what protection capacity actually exists: a density-aware scheduler plugin reads per-tier load from node annotations, and an opt-in admission webhook scales down the CPU requests of non-critical pods so the bin-packer fits more of them. Never limits, never the Critical tier, always reversible.

  • Opt-in per namespace via a label; every mutation annotated with the original value
  • Packing and consolidation results measured live — deep dive: sideloading
  • Complement mode: one helm flag stands this layer down and Karpenter / Cast AI keep placement
Learn more →
PACK BY DECLARED REQUESTS padded headroom (wasted) PACK WITH ENFORCEMENT critical — fenced at the CPU more pods at the same SLO — measured webhook scales requests only — never limits, never Critical — original value kept in an annotation

d · management plane (L2)

Run the fleet from one in-cluster plane

A hierarchy explorer walks cluster → node → pod → container with live updates: logs, manifests with revision diffs, per-pod performance panels, and scheduling detail down to the layer a pod landed in. Operator actions — cordon, drain, rollout restart, safe mode, trace download — are role-gated and audit-logged. A savings view splits realized reclaim from identified opportunity, priced per machine type.

  • Viewer / operator / admin roles, named tokens, audit export
  • Multi-cluster hub via a peer registry — per-cluster data planes stay self-contained, air-gap friendly
  • Versioned REST API (/api/v1, OpenAPI); the built-in UI uses the same public API
Learn more →
EXPLORER ▾ cluster ▾ node-a ▸ pod: api-7f ▸ pod: batch-2 ▸ node-b deploy · sts · ds · job CPU / RUNQUEUE / PSI LIVE LOGS · FOLLOW SAVINGS realized: measured reclaim identified: rightsizing slack cordon · drain · safe-mode · scale audit-logged

e · observability

See threads, not container averages

An always-on observation layer samples per-pod, per-thread placement and runqueue telemetry at under 1% CPU overhead — measured, not promised. A placement linter continuously checks scheduling invariants, and on-demand kernel trace capture gives you bounded perfetto traces without installing anything extra. The same thread-level data feeds a rightsizer that sees what container averages hide: one hot thread that needs an exclusive core.

  • Prometheus /metrics, a JSON /observe snapshot, and generated Grafana dashboards
  • Placement linter with invariant checks exported as metrics
  • Thread-aware rightsizing recommendations — a class above container-average rightsizers
Learn more →
/metrics prometheus + grafana /observe thread-level snapshot trace capture bounded kernel traces PLACEMENT LINTER — INVARIANTS ✓ smt_collision  ✓ protected_fallback  ✓ open_reserve  ✓ layer_mismatch observation overhead, measured: 0.13% of one core — always on

03 The engine

Three layers. Each consumes only the one below.

L0 is a helm-installed DaemonSet and the whole story works with it alone. L1 and L2 are optional and individually switchable — which is why running under Karpenter or Cast AI and running standalone are the same codebase.

L2 · MANAGEMENT PLANE explorer · actions · savings · thread-aware rightsizer L1 · CLUSTER INTELLIGENCE density-aware scheduler plugin · overcommit webhook KERNEL BOUNDARY L0 · NODE ENFORCEMENT scx_layered (sched_ext) — BPF scheduler in the kernel fence critical layers · loan idle cycles · preempt-kick on wake CFS · FAIL-SAFE kill the agent → node reverts to the stock scheduler — measured, no blackout

Inside the scheduler, not above it. Every other capacity product predicts contention and hopes. Temper stands in the kernel’s CPU scheduling path and arbitrates it — the difference between suggesting who should run and deciding who runs.

Thread-group granularity nobody else has. Container-level tools see one number per pod. Temper schedules the threads inside a pod differently — exclusive cores for a hot loop, latency treatment for a wake chain — because at the kernel layer, threads are what actually exist.

Fail-safe is the kernel’s contract, not ours. When the BPF scheduler detaches for any reason — crash, kill, upgrade — the kernel atomically reverts to the stock scheduler. The worst case is the scheduler you already run today. Read the architecture →

04 How it connects

Observe first. Enforce when you say so.

Installation is deliberately boring: a DaemonSet, a ConfigMap, and Kubernetes-native inputs. Nothing changes for your workloads until you assign a PriorityClass.

STEP 01

helm install

One chart deploys the node agent DaemonSet and, optionally, the dashboard. The agent verifies each node’s kernel and attaches the scheduler; nodes that can’t run sched_ext simply stay on the stock scheduler.

helm install temper deploy/helm/temper -n temper --create-namespace
STEP 02

Observe — zero enforcement

With no PriorityClasses assigned, workloads land in default tiers and behave as before. Meanwhile the observation layer streams per-pod, per-thread placement telemetry to /metrics and /observe — you see the contention before you act on it.

kubectl get node NODE -o jsonpath='{.metadata.annotations}'
STEP 03

Assign PriorityClasses

Add priorityClassName: temper-critical to the services whose p99 matters. The agent recomputes the layer config from their real resource requests and the kernel starts enforcing. That is the entire integration.

priorityClassName: temper-critical
Safety first. One annotation — temper.codes/safe-mode-requested — is a fleet-wide kill switch that stands the scheduler down and returns every node to stock CFS. We benchmark the failure paths too — agent kills under load, an 8-hour soak, churn storms. Deep dive: failure & rollback engineering →

06 FAQ

The questions everyone asks first.

Is it safe to put something in the kernel’s scheduling path?
sched_ext was designed for exactly this: the kernel’s contract is that a misbehaving or detached BPF scheduler causes an instant, atomic revert to the stock scheduler, so the worst case is the scheduler you run today. We benchmark the failure paths, not just the happy path. Deep dive: failure & rollback engineering →
What happens if the agent dies?
The scheduler keeps running without it, the replacement agent (it is a DaemonSet) re-attaches when Ready, and degradation — if any — is per-node, never per-cluster. We force-killed it under load and published the numbers. Deep dive: failure & rollback engineering →
Does it work with Karpenter or Cast AI?
Yes — that is complement mode, and it is the default: your existing tool decides where pods go, Temper decides who gets the CPU when the node is busy (complement mode). The measured Karpenter-in-both-arms result is in the capacity write-up. Deep dive: sideloading →
What kernels and platforms does it need?
A node kernel ≥6.12 with CONFIG_SCHED_CLASS_EXT=y plus BTF, and permission to run a privileged DaemonSet — verified live on GKE Standard ≥1.36 and EKS, with AKS and Autopilot-style managed modes stated plainly as noes. Full matrix →

The kernel layer is open.
Nobody else is standing in it.

One helm install. Kubernetes-native inputs. Kernel-enforced p99. Kernel-native rollback.