02 Deep dive

CPU limits without the freeze-cliff

Teams delete CPU limits because CFS quota enforcement freezes their databases mid-transaction. We measured the freeze with kernel counters on PostgreSQL, MySQL, and Cassandra — and this article also owns, in full, the one behavioral difference you must understand before running Temper: cpu.max is not kernel-enforced while a sched_ext scheduler is attached.

What a CPU limit actually does

A Kubernetes CPU limit becomes cgroup cpu.max: a quota of CPU-microseconds per enforcement period (100 ms by default). CFS bandwidth control charges every fair-class thread's runtime against the quota; when it is exhausted, the kernel freezes every thread in the cgroup until the next period refill. The freeze is indiscriminate — for a database it lands mid-transaction, while locks are held, and every queued client behind that transaction inherits the stall. This is why quota-limited databases show 30–70 ms p99 cliffs on nodes that are otherwise idle: no noisy neighbor is required, the workload's own burst walks off the quota cliff.

The industry's standard answer is to delete limits, which trades the cliff for unbounded contention — and is exactly the fear that keeps requests padded (see the sideloading article). The interesting question is whether you can keep containment without the freeze.

The freeze, verified with kernel counters

PostgreSQL under CPU limits throttle tails eliminated

pgbench (-c 8) p99, quota-limited postgres (requests=limits=1500m, Critical tier), background ladder. 2× c2-standard-4, 2026-07-02.

CFS Temper
0 10 20 30 p99 (ms) 0 1 2 4 8 background spinners (ladder) 33.6 4.3

5–7× lower p99 at every step, including the idle node (33.6 vs 4.3 ms at bg=0). Mechanism verified with kernel counters: in a 20 s window the CFS arm's cgroup logged nr_throttled +199 and 16.48 s of throttled time; the Temper arm logged zero of both — see the disclosure below for what that zero does and does not mean. Single run per arm. source: docs/training-artifacts/headroom/gke-c2-pgbench/REPORT.md

Doubling the client pressure (-c 16 -j 4) shows the cliff's defining property: it is self-inflicted and scales with load, not with neighbors. CFS p99 sat at 67–68 ms at every background step — flat, because the throttling is the workload's own quota, not contention. Temper started at 13.4 ms and degraded gracefully to 47.2 ms as sixteen clients genuinely exceeded what a 1500m pod can serve under background pressure — worse than its own light-load numbers, still under CFS at every step. We publish the degradation because it is the honest shape: removing the freeze-cliff does not repeal queueing theory.

The same pattern held on two more engines:

The disclosure: cpu.max is not kernel-enforced under sched_ext

Now the part that a skeptical reader should press on. The Temper arm's “zero throttles” is a frozen counter, not a pacing result. CFS bandwidth control is fair-class machinery: quota is charged only on the fair scheduling class's accounting path, sched_ext tasks never run on a fair runqueue, and so they never charge the quota at all — verified against the kernel source on 6.12, and the 6.17 ops.cgroup_set_bandwidth interface is notification-only. While scx_layered is attached, the kernel does not enforce cpu.max and the cgroup's throttle counters stop moving. This is a property of the kernel feature, not a Temper choice, and it means the two benchmark arms above were not limit-identical: the CFS arm was quota-throttled, the Temper arm was bounded by its layer placement instead.

What bounds a pod under Temper is the layer ceiling: a Confined layer's cpus_range and utilization band cap where and how much the pod's threads run. On the MySQL run that ceiling (a whole-core [2,2] allocation) worked out to roughly 1.9 effective cores against a 1.5-core quota — so that record cannot apportion how much of its 4× win came from removing the freeze versus the extra fraction of a core, and the report says so.

So we measured the question directly. The quota-parity measurement (scheduler v15, 2026-07-04): the same 1.5-CPU Guaranteed MySQL pod on a c2-standard-8 with four BestEffort spinners, 120 s per arm, consumption read from the pod's own cpu.stat usage_usec delta:

ArmCores consumedQuotatpsp95
CFS1.4861.5 (kernel-enforced)67556.8 ms
Temper (v15)1.3531.5 (not kernel-enforced under scx)59230.8 ms

The Temper arm consumed less than its quota: on this shape the whole-core layer ceiling binds below cpu.max, so there is no quota free-lunch in the latency win. The honest trade: −12% throughput for −46% p95 — confinement removes the refill-stall cliff and also the burst headroom. source: docs/training-artifacts/mysql-oltp/REPORT.md (quota-parity addendum) · raw: mysql-oltp/quota-parity-v15/qp.txt

This upgrades the claim from an estimate to a measurement, and narrows it honestly: “measured consumption at-or-below quota on the tested shapes,” not “limits enforced.” The layer ceiling is derived from requests today, not from cpu.max; a pod whose limit sits far below the whole-core granularity of its layer could consume above its limit. Three things are unconditional: memory limits are untouched (only CPU quota semantics change), usage accounting keeps working, and the safe-mode kill switch restores CFS with quotas instantly. On the roadmap: deriving layer ceilings from cpu.max so limits are honored equivalently, and implementing the ≥6.17 bandwidth callback. If strict CPU quota enforcement is a compliance requirement on a node, do not attach Temper to that node — mixed fleets are fully supported.

Caveats that travel with these numbers

Raw records

  • docs/training-artifacts/headroom/gke-c2-pgbench/REPORT.md
  • docs/training-artifacts/headroom/gke-c2-pgbench-c16/REPORT.md
  • docs/training-artifacts/mysql-oltp/REPORT.md (incl. quota-parity addendum)
  • docs/training-artifacts/mysql-oltp/quota-parity-v15/qp.txt
  • docs/training-artifacts/cassandra/REPORT.md
  • docs/security/WHITEPAPER.md §8.0 (the canonical cpu.max disclosure)

Committed benchmark records in the product repository; design partners get the full artifact tree.