Kubernetes capacity platform

CPU scheduling, enforced inside the kernel.

Temper is a Kubernetes capacity platform that replaces best-effort CFS arbitration with kernel-enforced QoS (Linux sched_ext). Latency-critical services keep their p99 while batch work soaks up the idle cycles — so you pack nodes tight without losing tail latency.

One helm install. Kernel-native rollback — kill the agent and the node reverts to the stock scheduler instantly.

memcached p99 vs. node load Temper flat · CFS 3.1×

Multi-node GKE, batch-filler ladder. Nodes at 0.81–0.92 utilization.

CFS Temper
0 0.5 1.0 1.5 p99 (ms) 0 4 8 12 18 batch fillers on 3 nodes 1.52 ms (3.1×) 0.48 ms flat

Measured live, single run per arm; methodology and caveats on the benchmarks page. source: docs/training-artifacts/binpack/REPORT.md

−88% memcached p99 at the knee vs. CFS, heavy operating point benchmarks · gke-c2-mc-heavy/REPORT.md
9.4×→1.5× end-to-end microservice tail amplification, 19-service DeathStarBench benchmarks · DeathStarBench report
+72% pods placed at equal SLO on the same nodes with overcommit benchmarks · binpack/REPORT.md
<1% CPU overhead for always-on observation (measured 0.13% of one core) docs · observability
Built on Linux sched_ext Runs on GKE & EKS today Zero external calls — fully in-cluster Kubernetes-native inputs, no CRDs

01 The problem

Every Kubernetes cost tool operates above the kernel.

Rightsizers predict usage. Placement engines move pods. Autoscalers resize fleets. All of it is a prediction — and the moment two pods share a node, the prediction is over and CFS arbitration begins.

The Linux Completely Fair Scheduler has no notion of which pod’s p99 matters. Pack a node tight and CFS decides — microsecond by microsecond — who eats the latency. A batch task that becomes runnable can take a timeslice from your revenue service at exactly the wrong moment, and that one delayed request is the tail spike your SLO dashboard shows.

This is why every cost-optimization product keeps its savings engine conservative: the utilization it can safely reach is capped by a scheduler it does not control. Temper occupies the layer none of them do — CPU arbitration inside the kernel’s scheduling path — and makes dense packing safe by enforcement instead of prediction.

SaaSRightsizers · FinOps platforms Predict usage, trim requests & limits. Container-granular.
k8sPlacement · autoscalers · bin-packing Move pods, resize fleets. Pack by requests, blind to on-node contention.
kernel boundary — every other vendor stops here
L0Temper — scx_layered (sched_ext) CPU arbitration inside the kernel scheduling path. Fence, loan, preempt on wake.
CFSStock Linux scheduler What every co-located pod inherits without Temper — and the automatic fallback if our agent dies.

02 The platform

Five capabilities. One helm chart.

The node engine is the core; everything else is optional and individually switchable. Install any depth — from a single protected node to the full multi-cluster management plane.

a · node engine (L0)

Enforce QoS at the kernel scheduler

A node agent maps every pod to one of five QoS tiers — derived from standard Kubernetes PriorityClasses and resource requests, no CRDs and no app changes — and drives scx_layered, a Linux sched_ext scheduler, with a config computed from what your pods actually request. Critical tiers get fenced CPU; batch tiers get whatever is idle, and get preempted the instant a protected workload wakes.

  • Five tiers (Critical → Background) from pod.spec.priority — Kubernetes-native inputs only
  • Layer weights, CPU ranges, and utilization bands computed from real resource requests
  • Fail-safe by construction: agent death = instant kernel revert to the stock scheduler
Learn more →
ONE NODE · 8 CPUS · LAYERS FROM POD REQUESTS CRITICAL · CONFINED HIGH / NORMAL · GROUPED LOW / BG · OPEN idle cycles loaned to batch preempt-kick the instant a protected pod wakes tiers: infera-critical / -high / -normal / -low / -background PriorityClasses

b · node engine (L0)

Tune scheduling per thread group, inside one pod

A pod is not one uniform workload — a database has connection threads, I/O threads, and background purge threads with completely different needs. Workload Profiles give each thread group its own scheduling treatment inside a single pod: exclusive cores for the hot path, latency treatment for wake chains, yield for housekeeping. No product operating above the kernel can see thread structure, let alone schedule on it.

  • Builtin profiles for common shapes (e.g. PyTorch dataloaders, MySQL/InnoDB), plus file-based custom profiles
  • Auto-detection by container image or a single pod annotation
  • Training mode measures your workload and synthesizes a profile automatically
Learn more →
ONE POD · MYSQL conn threads ×16 innodb i/o threads purge / bg threads exclusive cores latency treatment yield to others

c · cluster intelligence (L1)

Pack more pods without breaking SLOs

Kubernetes bin-packs by declared requests — which are usually padded, because nobody trusts CFS with a tight node. With enforcement underneath, Temper’s optional placement layer packs by what protection capacity actually exists: a density-aware scheduler plugin reads per-tier load from node annotations, and an opt-in admission webhook scales down the CPU requests of non-critical pods so the bin-packer fits more of them. Never limits, never the Critical tier, always reversible.

  • Opt-in per namespace via a label; every mutation annotated with the original value
  • Measured: +72% pods placed at equal SLO; a 16-pod fleet consolidated 3→2 nodes
  • Complement mode: one helm flag stands this layer down and Karpenter / Cast AI keep placement
Learn more →
PACK BY DECLARED REQUESTS padded headroom (wasted) PACK WITH ENFORCEMENT critical — fenced at the CPU +72% pods at the same SLO (measured) webhook scales requests only — never limits, never Critical — original value kept in an annotation

d · management plane (L2)

Run the fleet from one in-cluster plane

A hierarchy explorer walks cluster → node → pod → container with live updates: logs, manifests with revision diffs, per-pod performance panels, and scheduling detail down to the layer a pod landed in. Operator actions — cordon, drain, rollout restart, safe mode, trace download — are role-gated and audit-logged. A savings view splits realized reclaim from identified opportunity, priced per machine type.

  • Viewer / operator / admin roles, named tokens, audit export
  • Multi-cluster hub via a peer registry — per-cluster data planes stay self-contained, air-gap friendly
  • Versioned REST API (/api/v1, OpenAPI); the built-in UI uses the same public API
Learn more →
EXPLORER ▾ cluster ▾ node-a ▸ pod: api-7f ▸ pod: batch-2 ▸ node-b deploy · sts · ds · job CPU / RUNQUEUE / PSI LIVE LOGS · FOLLOW SAVINGS realized: measured reclaim identified: rightsizing slack cordon · drain · safe-mode · scale audit-logged

e · observability

See threads, not container averages

An always-on observation layer samples per-pod, per-thread placement and runqueue telemetry at under 1% CPU overhead (measured 0.13% of one core). A placement linter continuously checks scheduling invariants, and on-demand kernel trace capture gives you bounded perfetto traces without installing anything extra. The same thread-level data feeds a rightsizer that sees what container averages hide: one hot thread that needs an exclusive core.

  • Prometheus /metrics, a JSON /observe snapshot, and generated Grafana dashboards
  • Placement linter with invariant checks exported as metrics
  • Thread-aware rightsizing recommendations — a class above container-average rightsizers
Learn more →
/metrics prometheus + grafana /observe thread-level snapshot trace capture bounded kernel traces PLACEMENT LINTER — INVARIANTS ✓ smt_collision  ✓ protected_fallback  ✓ open_reserve  ✓ layer_mismatch observation overhead, measured: 0.13% of one core — always on

03 The engine

Three layers. Each consumes only the one below.

L0 is a helm-installed DaemonSet and the whole story works with it alone. L1 and L2 are optional and individually switchable — which is why running under Karpenter or Cast AI and running standalone are the same codebase.

L2 · MANAGEMENT PLANE explorer · actions · savings · thread-aware rightsizer L1 · CLUSTER INTELLIGENCE density-aware scheduler plugin · overcommit webhook KERNEL BOUNDARY L0 · NODE ENFORCEMENT scx_layered (sched_ext) — BPF scheduler in the kernel fence critical layers · loan idle cycles · preempt-kick on wake CFS · FAIL-SAFE kill the agent → node reverts to the stock scheduler — measured, no blackout

Inside the scheduler, not above it. Every other capacity product predicts contention and hopes. Temper stands in the kernel’s CPU scheduling path and arbitrates it — the difference between suggesting who should run and deciding who runs.

Thread-group granularity nobody else has. Container-level tools see one number per pod. Temper schedules the threads inside a pod differently — exclusive cores for a hot loop, latency treatment for a wake chain — because at the kernel layer, threads are what actually exist.

Fail-safe is the kernel’s contract, not ours. When the BPF scheduler detaches for any reason — crash, kill, upgrade — the kernel atomically reverts to the stock scheduler. The worst case is the scheduler you already run today. Read the architecture →

04 How it connects

Observe first. Enforce when you say so.

Installation is deliberately boring: a DaemonSet, a ConfigMap, and Kubernetes-native inputs. Nothing changes for your workloads until you assign a PriorityClass.

STEP 01

helm install

One chart deploys the node agent DaemonSet and, optionally, the dashboard. The agent verifies each node’s kernel and attaches the scheduler; nodes that can’t run sched_ext simply stay on the stock scheduler.

helm install infera deploy/helm/infera -n infera --create-namespace
STEP 02

Observe — zero enforcement

With no PriorityClasses assigned, workloads land in default tiers and behave as before. Meanwhile the observation layer streams per-pod, per-thread placement telemetry to /metrics and /observe — you see the contention before you act on it.

kubectl get node NODE -o jsonpath='{.metadata.annotations}'
STEP 03

Assign PriorityClasses

Add priorityClassName: infera-critical to the services whose p99 matters. The agent recomputes the layer config from their real resource requests and the kernel starts enforcing. That is the entire integration.

priorityClassName: infera-critical
Safety first. One annotation — infera.io/safe-mode-requested — is a fleet-wide kill switch that stands the scheduler down and returns every node to stock CFS. We force-killed the agent under load to prove the failover: p99 0.61 ms before, 0.64 ms during, 0.61 ms after. Fail-safe mechanics →

06 FAQ

The questions everyone asks first.

Is it safe to put something in the kernel’s scheduling path?
sched_ext was designed for exactly this: the kernel’s contract is that a misbehaving or detached BPF scheduler causes an instant, atomic revert to the stock scheduler. The worst case is the scheduler you run today. We measured the failover under load — a 0.03 ms p99 blip — and ran an 8-hour soak clean. Fail-safe details →
What happens if the agent dies?
The kernel detects the detach and reverts the node to stock CFS in milliseconds — no operator action needed. The replacement agent (it is a DaemonSet) re-attaches when Ready; measured end to end: p99 0.61 → 0.64 → 0.61 ms with the new agent Ready in 15 s. Degradation is per-node, never per-cluster.
Does it work with Karpenter or Cast AI?
Yes — that is complement mode, and it is the default. Temper’s placement layer ships disabled (scheduler.enabled=false); your existing tool decides where pods go, Temper decides who gets the CPU when the node is busy. Measured with Karpenter in both arms: −40% provisioned vCPU at equal load and SLO. Complement mode →
What kernels and platforms does it need?
A node kernel ≥6.12 with CONFIG_SCHED_CLASS_EXT=y plus BTF, and permission to run a privileged DaemonSet. Verified live today: GKE Standard ≥1.36 and EKS (AL2023, Bottlerocket). AKS is currently a no — Microsoft disables sched_ext in its kernels — and so are Autopilot-style managed modes. Full matrix →

The kernel layer is open.
Nobody else is standing in it.

One helm install. Kubernetes-native inputs. Kernel-enforced p99. Kernel-native rollback.