Benchmarks
Measured on live clusters. Caveats included.
Every number below comes from a committed benchmark report in the repository. The harness is open, both arms run against a genuine CFS baseline on identical hardware, and the open anomalies are published alongside the wins — we lead with the caveats because the deltas are far outside them.
00 Methodology
How to read this page.
A benchmark you cannot reproduce is marketing. Every chart carries its source report path and its honest caveat; single-run arms are labeled as such.
Each comparison runs the same workload, same nodes, same load generator in two arms: stock Kubernetes on CFS, then Temper attached. Density tests step a ladder of background workloads and record where the primary’s SLO breaks in each arm. The CFS baseline is real — obtained by putting Temper’s nodes in safe mode, not by a separate cluster — so hardware, images, and noise are held constant.
Where a run has an anomaly, it stays in the report and on this page. Where an arm was measured once, the note says so. Where the mechanism can be verified with kernel counters instead of inferred from latency, we verify it with kernel counters. Source paths refer to reports committed in the product repository; design partners get the full artifact tree.
01 memcached
Flat p99 at 0.92 node utilization.
The headline result: latency-critical caching holds its tail while the same nodes run nearly full of batch work.
memcached p99 vs. batch-filler density, multi-node Temper flat · CFS 3.1×
3-node GKE cluster, two memcached primaries under memtier load, batch-filler ladder 0→18 plus 6 background spinners. Node utilization 0.81–0.92.
Live GKE, 3× e2-standard-4, single run per arm (±0.2 ms is noise; the flat-vs-3.1× delta is far outside it). One open anomaly: the second memcached instance tracked CFS in this run (did not reproduce on EKS). source: docs/training-artifacts/binpack/REPORT.md · reproduce: hack/demo-gke.sh step 3
memcached, heavy operating point −88% p99
Saturating memcached primary (4t×8c) vs. background-spinner ladder. GKE c2 dedicated cores.
−88% p99 at 4 spinners (0.415 vs 3.231 ms), −71% at idle; CFS +202% across the ladder, Temper flat. Same law replicated cross-cloud: EKS heavy point measured −70%. source: docs/training-artifacts/headroom/gke-c2-mc-heavy/REPORT.md · 2× c2-standard-4, 2026-07-02
memcached vs. density, tier QoS only CFS degrades 12.3× · Temper flat
Single node, density ladder 0→8. No workload profile — PriorityClass-derived tiers alone.
CFS broke the SLO at density 2 (0.335 ms) and reached 1.951 ms at density 8; Temper held 0.183 ms. No custom labels or CRDs — tiers derive from Kubernetes-native PriorityClasses. source: docs/training-artifacts/OVERNIGHT-REPORT.md · GKE c3-standard-8, 2026-06-12
02 OLTP databases
The quota-throttle tail, eliminated.
Databases run with CPU limits get frozen mid-transaction by CFS quota enforcement. This one applies at idle — no noisy neighbor needed.
PostgreSQL under CPU limits throttle tails eliminated
pgbench p99, quota-limited postgres (requests=limits=1500m), background ladder.
Mechanism verified with kernel counters, not inferred: in a 20 s window CFS logged nr_throttled +199 / 16.48 s throttled; Temper logged zero of both. At doubled client pressure (-c16) the CFS cliff deepens to 68 ms. source: docs/training-artifacts/headroom/gke-c2-pgbench/REPORT.md (+ ../gke-c2-pgbench-c16) · 2026-07-02
A CPU limit means CFS bandwidth control: exhaust the quota and the kernel freezes every thread in the cgroup until the next period. For a database, that freeze lands in the middle of a transaction holding locks — which is how quota-limited PostgreSQL and MySQL produce 33–68 ms p99 tails even on an otherwise idle node. Under Temper the same workloads ran 4–17 ms.
The honest caveat, stated plainly: while Temper’s scheduler is attached,
cgroup cpu.max quotas are not kernel-enforced (a sched_ext property) —
containment comes from Temper’s layer ceilings instead, so the two arms are not
limit-identical. Quota-derived ceilings are on the roadmap, and the kill switch restores
CFS with quotas instantly. Full disclosure →
03 JVM workloads
Cassandra: better at every density.
GC threads, compaction, and request handlers are exactly the kind of mixed thread population that benefits from layer separation.
04 Microservices
Tail amplification is multiplicative. So is fixing it.
A request that crosses 19 services samples 19 scheduling delays; the end-to-end tail is the compounding of every hop’s tail.
DeathStarBench, 19 services 9.4× → ~1.5×
End-to-end tail amplification (p99 / p50) under co-located load, CFS vs. Temper attached.
−75% / −83% on the two measured operating points while attached. Caveat: one scheduler residual under sustained saturation was found in this run and fixed in scheduler v14. source: docs/training-artifacts/ · DeathStarBench report
Microservice architectures are where CFS jitter hurts most: each hop’s scheduling delay compounds into the end-to-end tail, so a modest per-service p99 becomes a 9.4× amplification across a 19-service call graph. Enforcing the latency-critical tier at every node cut that to ~1.5× — without touching a line of application code.
This is also the honest illustration of our caveat culture: the run surfaced a scheduler residual under sustained saturation. It stayed in the report, was root-caused, and the fix shipped in v14 — that loop is the product working as designed.
05 GPU workloads
The most expensive CFS failure is a starved GPU.
The GPU is reserved and billed; the CPU-side DataLoader threads that feed it are not protected. Pack the node and the accelerator idles.
ResNet-18 on NVIDIA L4, batch neighbors Temper flat · CFS −25%
Burstable trainer (2-CPU request, ~7 vCPU demand, 6 DataLoader workers) vs. batch-spinner ladder on one g2-standard-8.
At 16 neighbors CFS lost 25% of trainer throughput (629→471 samples/s) and the L4’s utilization collapsed into a 0–81% band (mean ~40%); under Temper both held steady. Honest control: a Guaranteed trainer whose demand fits its request is defended by kubelet alone — both arms flat (v1 run, kept in the report). source: docs/training-artifacts/gpu-wedge/REPORT.md · GKE g2-standard-8 + 1× NVIDIA L4, 2026-07-01
PyTorch training at density 8 +67% samples/s
Guaranteed 3-CPU trainer next to 8 noisy neighbors. Kubernetes’ own remedies measure worse than doing nothing.
Static CPU pinning (kubelet cpuset) capped the trainer at 14.6–14.8 samples/s even on an idle node — the pin was SMT-blind. Whole-core SMT-aware placement is part of the L0 win. source: docs/training-artifacts/OVERNIGHT-REPORT.md · GKE c3-standard-8, 2026-06-12
Honest negative: GPU serving with a small model (vLLM) measured parity — the workload is GPU-bound, so there is no CPU-side contention for Temper to remove. We publish that result rather than hide it; the wedge is CPU-fed training and preprocessing, not GPU-bound inference. source: docs/training-artifacts/gpu-wedge/REPORT.md
06 Density & consolidation
The capacity story, end to end.
Enforcement makes packing safe; the L1 placement layer then converts the safety into fewer nodes. These runs measure the whole chain.
07 Reliability
We benchmarked the failure modes too.
A capacity platform you cannot kill safely is a liability. These are the numbers for when things go wrong.
Reconfiguration happens when QoS assignments change on a node (pod added/removed from a tier). During the ~52 ms window the node runs stock CFS — the same failure direction as every other event on this page: absence of benefit, not harm.