02 Deep dive

CPU limits without the freeze-cliff

Teams delete CPU limits because CFS quota enforcement freezes their databases mid-transaction. We measured the freeze with kernel counters on PostgreSQL, MySQL, and Cassandra — and this article also owns, in full, the one behavioral difference you must understand before running Temper: cpu.max is not kernel-enforced while a sched_ext scheduler is attached.

workloads: PostgreSQL · MySQL · Cassandra clusters: GKE (c2) records: 5 reports

What a CPU limit actually does

A Kubernetes CPU limit becomes cgroup cpu.max: a quota of CPU-microseconds per enforcement period (100 ms by default). CFS bandwidth control charges every fair-class thread's runtime against the quota; when it is exhausted, the kernel freezes every thread in the cgroup until the next period refill. The freeze is indiscriminate — for a database it lands mid-transaction, while locks are held, and every queued client behind that transaction inherits the stall. This is why quota-limited databases show 30–70 ms p99 cliffs on nodes that are otherwise idle: no noisy neighbor is required, the workload's own burst walks off the quota cliff.

The industry's standard answer is to delete limits, which trades the cliff for unbounded contention — and is exactly the fear that keeps requests padded (see the sideloading article). The interesting question is whether you can keep containment without the freeze.

The freeze, verified with kernel counters

PostgreSQL under CPU limits throttle tails eliminated

pgbench (-c 8) p99, quota-limited postgres (requests=limits=1500m, Critical tier), background ladder. 2× c2-standard-4, 2026-07-02.

CFS Temper

5–7× lower p99 at every step, including the idle node (33.6 vs 4.3 ms at bg=0). Mechanism verified with kernel counters: in a 20 s window the CFS arm's cgroup logged nr_throttled +199 and 16.48 s of throttled time; the Temper arm logged zero of both — see the disclosure below for what that zero does and does not mean. Single run per arm. source: docs/training-artifacts/headroom/gke-c2-pgbench/REPORT.md

Doubling the client pressure (-c 16 -j 4) shows the cliff's defining property: it is self-inflicted and scales with load, not with neighbors. CFS p99 sat at 67–68 ms at every background step — flat, because the throttling is the workload's own quota, not contention. Temper started at 13.4 ms and degraded gracefully to 47.2 ms as sixteen clients genuinely exceeded what a 1500m pod can serve under background pressure — worse than its own light-load numbers, still under CFS at every step. We publish the degradation because it is the honest shape: removing the freeze-cliff does not repeal queueing theory.

The same pattern held on two more engines:

MySQL (sysbench oltp_read_write, 8 threads, 1500m requests=limits): CFS p99 60.0–65.7 ms at every step; Temper 15.6–16.7 ms, dead flat through bg=8 while the node ran at 0.99 utilization. Kernel counters in a 20 s window: CFS nr_throttled +200, 17.93 s throttled.
Cassandra (JVM, 3-core quota, tier-only, no profile): the CFS arm hit nr_throttled +228 (7.55 s throttled) in 20 s on an idle node. A JVM that sizes its pools from availableProcessors()=4 while living under a 3-core quota is throttle bait by construction — this pod ran 84 threads. Idle-node p99: 11.8 vs 2.6 ms (a manual re-check got 10.9 ms, so the honest idle delta is −69…−78%). The JVM thread story continues in the workload-profiles article.

The disclosure: cpu.max is not kernel-enforced under sched_ext

Now the part that a skeptical reader should press on. The Temper arm's “zero throttles” is a frozen counter, not a pacing result. CFS bandwidth control is fair-class machinery: quota is charged only on the fair scheduling class's accounting path, sched_ext tasks never run on a fair runqueue, and so they never charge the quota at all — verified against the kernel source on 6.12, and the 6.17 ops.cgroup_set_bandwidth interface is notification-only. While scx_layered is attached, the kernel does not enforce cpu.max and the cgroup's throttle counters stop moving. This is a property of the kernel feature, not a Temper choice, and it means the two benchmark arms above were not limit-identical: the CFS arm was quota-throttled, the Temper arm was bounded by its layer placement instead.

What bounds a pod under Temper is the layer ceiling: a Confined layer's cpus_range and utilization band cap where and how much the pod's threads run. On the MySQL run that ceiling (a whole-core [2,2] allocation) worked out to roughly 1.9 effective cores against a 1.5-core quota — so that record cannot apportion how much of its 4× win came from removing the freeze versus the extra fraction of a core, and the report says so.

So we measured the question directly. The quota-parity measurement (scheduler v15, 2026-07-04): the same 1.5-CPU Guaranteed MySQL pod on a c2-standard-8 with four BestEffort spinners, 120 s per arm, consumption read from the pod's own cpu.stat usage_usec delta:

Arm	Cores consumed	Quota	tps	p95
CFS	1.486	1.5 (kernel-enforced)	675	56.8 ms
Temper (v15)	1.353	1.5 (not kernel-enforced under scx)	592	30.8 ms

The Temper arm consumed less than its quota: on this shape the whole-core layer ceiling binds below cpu.max, so there is no quota free-lunch in the latency win. The honest trade: −12% throughput for −46% p95 — confinement removes the refill-stall cliff and also the burst headroom. source: docs/training-artifacts/mysql-oltp/REPORT.md (quota-parity addendum) · raw: mysql-oltp/quota-parity-v15/qp.txt

This upgrades the claim from an estimate to a measurement, and narrows it honestly: “measured consumption at-or-below quota on the tested shapes,” not “limits enforced.” The layer ceiling is derived from requests today, not from cpu.max; a pod whose limit sits far below the whole-core granularity of its layer could consume above its limit. Three things are unconditional: memory limits are untouched (only CPU quota semantics change), usage accounting keeps working, and the safe-mode kill switch restores CFS with quotas instantly. On the roadmap: deriving layer ceilings from cpu.max so limits are honored equivalently, and implementing the ≥6.17 bandwidth callback. If strict CPU quota enforcement is a compliance requirement on a node, do not attach Temper to that node — mixed fleets are fully supported.

Caveats that travel with these numbers

Single run per arm throughout; 20–120 s windows. The throttle-counter deltas are ~10× above noise; the p99 deltas are 4–7×.
p99/p95 figures come from the load tools' own client-side percentiles; closed-loop clients are subject to coordinated omission — identically in both arms.
Cassandra: single-node ring, RF=1, fsync-light, 4-CPU node — a contained lab shape, not a production ring. Its first CFS window (11.8 ms) was likely elevated by post-seed compaction; the re-checked idle point is 10.9 ms.
The MySQL Temper arm ran with the mysql-innodb workload profile active; a tier-only comparison arm was not run, so the profile's marginal contribution is not isolated in that record.

Raw records

docs/training-artifacts/headroom/gke-c2-pgbench/REPORT.md
docs/training-artifacts/headroom/gke-c2-pgbench-c16/REPORT.md
docs/training-artifacts/mysql-oltp/REPORT.md (incl. quota-parity addendum)
docs/training-artifacts/mysql-oltp/quota-parity-v15/qp.txt
docs/training-artifacts/cassandra/REPORT.md
docs/security/WHITEPAPER.md §8.0 (the canonical cpu.max disclosure)

Committed benchmark records in the product repository; design partners get the full artifact tree.