Benchmarks — Temper

00 Methodology

How to read this page.

A benchmark you cannot reproduce is marketing. Every chart carries its source report path and its honest caveat; single-run arms are labeled as such.

Each comparison runs the same workload, same nodes, same load generator in two arms: stock Kubernetes on CFS, then Temper attached. Density tests step a ladder of background workloads and record where the primary’s SLO breaks in each arm. The CFS baseline is real — obtained by putting Temper’s nodes in safe mode, not by a separate cluster — so hardware, images, and noise are held constant.

Where a run has an anomaly, it stays in the report and on this page. Where an arm was measured once, the note says so. Where the mechanism can be verified with kernel counters instead of inferred from latency, we verify it with kernel counters. Source paths refer to reports committed in the product repository; design partners get the full artifact tree.

01 memcached

Flat p99 at 0.92 node utilization.

The headline result: latency-critical caching holds its tail while the same nodes run nearly full of batch work.

memcached p99 vs. batch-filler density, multi-node Temper flat · CFS 3.1×

3-node GKE cluster, two memcached primaries under memtier load, batch-filler ladder 0→18 plus 6 background spinners. Node utilization 0.81–0.92.

CFS (stock kube-scheduler) Temper (scx_layered)

Live GKE, 3× e2-standard-4, single run per arm (±0.2 ms is noise; the flat-vs-3.1× delta is far outside it). One open anomaly: the second memcached instance tracked CFS in this run (did not reproduce on EKS). source: docs/training-artifacts/binpack/REPORT.md · reproduce: hack/demo-gke.sh step 3

memcached, heavy operating point −88% p99

Saturating memcached primary (4t×8c) vs. background-spinner ladder. GKE c2 dedicated cores.

CFS Temper

−88% p99 at 4 spinners (0.415 vs 3.231 ms), −71% at idle; CFS +202% across the ladder, Temper flat. Same law replicated cross-cloud: EKS heavy point measured −70%. source: docs/training-artifacts/headroom/gke-c2-mc-heavy/REPORT.md · 2× c2-standard-4, 2026-07-02

memcached vs. density, tier QoS only CFS degrades 12.3× · Temper flat

Single node, density ladder 0→8. No workload profile — PriorityClass-derived tiers alone.

CFS Temper

CFS broke the SLO at density 2 (0.335 ms) and reached 1.951 ms at density 8; Temper held 0.183 ms. No custom labels or CRDs — tiers derive from Kubernetes-native PriorityClasses. source: docs/training-artifacts/OVERNIGHT-REPORT.md · GKE c3-standard-8, 2026-06-12

02 OLTP databases

The quota-throttle tail, eliminated.

Databases run with CPU limits get frozen mid-transaction by CFS quota enforcement. This one applies at idle — no noisy neighbor needed.

PostgreSQL under CPU limits throttle tails eliminated

pgbench p99, quota-limited postgres (requests=limits=1500m), background ladder.

CFS Temper

Mechanism verified with kernel counters, not inferred: in a 20 s window CFS logged nr_throttled +199 / 16.48 s throttled; Temper logged zero of both. At doubled client pressure (-c16) the CFS cliff deepens to 68 ms. source: docs/training-artifacts/headroom/gke-c2-pgbench/REPORT.md (+ ../gke-c2-pgbench-c16) · 2026-07-02

A CPU limit means CFS bandwidth control: exhaust the quota and the kernel freezes every thread in the cgroup until the next period. For a database, that freeze lands in the middle of a transaction holding locks — which is how quota-limited PostgreSQL and MySQL produce 33–68 ms p99 tails even on an otherwise idle node. Under Temper the same workloads ran 4–17 ms.

The honest caveat, stated plainly: while Temper’s scheduler is attached, cgroup cpu.max quotas are not kernel-enforced (a sched_ext property) — containment comes from Temper’s layer ceilings instead, so the two arms are not limit-identical. Quota-derived ceilings are on the roadmap, and the kill switch restores CFS with quotas instantly. Full disclosure →

03 JVM workloads

Cassandra: better at every density.

GC threads, compaction, and request handlers are exactly the kind of mixed thread population that benefits from layer separation.

−78% Cassandra p99 vs. CFS on an idle node — before any noisy neighbor arrives docs/training-artifacts/ · Cassandra report

every step of the density ladder measured better under Temper than CFS — the gap widens as the node fills docs/training-artifacts/ · Cassandra report

04 Microservices

Tail amplification is multiplicative. So is fixing it.

A request that crosses 19 services samples 19 scheduling delays; the end-to-end tail is the compounding of every hop’s tail.

DeathStarBench, 19 services 9.4× → ~1.5×

End-to-end tail amplification (p99 / p50) under co-located load, CFS vs. Temper attached.

−75% / −83% on the two measured operating points while attached. Caveat: one scheduler residual under sustained saturation was found in this run and fixed in scheduler v14. source: docs/training-artifacts/ · DeathStarBench report

Microservice architectures are where CFS jitter hurts most: each hop’s scheduling delay compounds into the end-to-end tail, so a modest per-service p99 becomes a 9.4× amplification across a 19-service call graph. Enforcing the latency-critical tier at every node cut that to ~1.5× — without touching a line of application code.

This is also the honest illustration of our caveat culture: the run surfaced a scheduler residual under sustained saturation. It stayed in the report, was root-caused, and the fix shipped in v14 — that loop is the product working as designed.

05 GPU workloads

The most expensive CFS failure is a starved GPU.

The GPU is reserved and billed; the CPU-side DataLoader threads that feed it are not protected. Pack the node and the accelerator idles.

ResNet-18 on NVIDIA L4, batch neighbors Temper flat · CFS −25%

Burstable trainer (2-CPU request, ~7 vCPU demand, 6 DataLoader workers) vs. batch-spinner ladder on one g2-standard-8.

CFS Temper

At 16 neighbors CFS lost 25% of trainer throughput (629→471 samples/s) and the L4’s utilization collapsed into a 0–81% band (mean ~40%); under Temper both held steady. Honest control: a Guaranteed trainer whose demand fits its request is defended by kubelet alone — both arms flat (v1 run, kept in the report). source: docs/training-artifacts/gpu-wedge/REPORT.md · GKE g2-standard-8 + 1× NVIDIA L4, 2026-07-01

PyTorch training at density 8 +67% samples/s

Guaranteed 3-CPU trainer next to 8 noisy neighbors. Kubernetes’ own remedies measure worse than doing nothing.

Static CPU pinning (kubelet cpuset) capped the trainer at 14.6–14.8 samples/s even on an idle node — the pin was SMT-blind. Whole-core SMT-aware placement is part of the L0 win. source: docs/training-artifacts/OVERNIGHT-REPORT.md · GKE c3-standard-8, 2026-06-12

Honest negative: GPU serving with a small model (vLLM) measured parity — the workload is GPU-bound, so there is no CPU-side contention for Temper to remove. We publish that result rather than hide it; the wedge is CPU-fed training and preprocessing, not GPU-bound inference. source: docs/training-artifacts/gpu-wedge/REPORT.md

06 Density & consolidation

The capacity story, end to end.

Enforcement makes packing safe; the L1 placement layer then converts the safety into fewer nodes. These runs measure the whole chain.

31 vs 18 pods placed at equal SLO on the same 3 nodes with the overcommit webhook (+72%), p99 ≤1.87 ms docs/training-artifacts/binpack/REPORT.md

3→2 nodes for a 16-pod fleet (33% shrink), p99 spot-check 1.56 ms — the pods genuinely do not fit at stock requests docs/training-artifacts/binpack/SAVINGS-REPORT.md

−40% provisioned vCPU with Karpenter+Temper vs Karpenter alone at equal load and SLO (12 vs 20 vCPU) docs/training-artifacts/karpenter/REPORT.md

1.89→5.65 background cores reclaimed under a bursty primary — node utilization 0.40→0.85 at −7% primary cost docs/training-artifacts/OVERNIGHT-REPORT.md

07 Reliability

We benchmarked the failure modes too.

A capacity platform you cannot kill safely is a liability. These are the numbers for when things go wrong.

0.61/0.64/0.61 p99 (ms) before / during / after force-killing the agent under load — kernel-native revert, no blackout docs/training-artifacts/binpack/SAVINGS-REPORT.md

8 h soak run clean — no ejects, no leaks, no drift while attached docs/training-artifacts/ · soak report

~52 ms CFS-gap per scheduler reconfig (pod churn cost) — node-local, bounded, and measured rather than assumed docs/training-artifacts/ · churn report

Reconfiguration happens when QoS assignments change on a node (pod added/removed from a tier). During the ~52 ms window the node runs stock CFS — the same failure direction as every other event on this page: absence of benefit, not harm.

Measured on live clusters. Caveats included.

How to read this page.

Flat p99 at 0.92 node utilization.

The quota-throttle tail, eliminated.

Cassandra: better at every density.

Tail amplification is multiplicative. So is fixing it.

The most expensive CFS failure is a starved GPU.

The capacity story, end to end.

We benchmarked the failure modes too.

Reproduce every number on your own cluster.