02 Deep dive
CPU limits without the freeze-cliff
Teams delete CPU limits because CFS quota enforcement freezes their databases
mid-transaction. We measured the freeze with kernel counters on PostgreSQL, MySQL, and
Cassandra — and this article also owns, in full, the one behavioral difference you
must understand before running Temper: cpu.max is not kernel-enforced while a
sched_ext scheduler is attached.
What a CPU limit actually does
A Kubernetes CPU limit becomes cgroup cpu.max: a quota of CPU-microseconds per
enforcement period (100 ms by default). CFS bandwidth control charges every fair-class
thread's runtime against the quota; when it is exhausted, the kernel freezes every thread
in the cgroup until the next period refill. The freeze is indiscriminate — for a
database it lands mid-transaction, while locks are held, and every queued client behind that
transaction inherits the stall. This is why quota-limited databases show 30–70 ms
p99 cliffs on nodes that are otherwise idle: no noisy neighbor is required, the
workload's own burst walks off the quota cliff.
The industry's standard answer is to delete limits, which trades the cliff for unbounded contention — and is exactly the fear that keeps requests padded (see the sideloading article). The interesting question is whether you can keep containment without the freeze.
The freeze, verified with kernel counters
PostgreSQL under CPU limits throttle tails eliminated
pgbench (-c 8) p99, quota-limited postgres (requests=limits=1500m, Critical tier), background ladder. 2× c2-standard-4, 2026-07-02.
5–7× lower p99 at every step, including the idle node (33.6 vs 4.3 ms at bg=0).
Mechanism verified with kernel counters: in a 20 s window the CFS arm's cgroup logged
nr_throttled +199 and 16.48 s of throttled time; the Temper arm logged zero of both —
see the disclosure below for what that zero does and does not mean. Single run per arm.
source: docs/training-artifacts/headroom/gke-c2-pgbench/REPORT.md
Doubling the client pressure (-c 16 -j 4) shows the cliff's defining property:
it is self-inflicted and scales with load, not with neighbors. CFS p99 sat at
67–68 ms at every background step — flat, because the throttling is the
workload's own quota, not contention. Temper started at 13.4 ms and degraded
gracefully to 47.2 ms as sixteen clients genuinely exceeded what a 1500m pod can
serve under background pressure — worse than its own light-load numbers, still under
CFS at every step. We publish the degradation because it is the honest shape: removing the
freeze-cliff does not repeal queueing theory.
The same pattern held on two more engines:
- MySQL (sysbench oltp_read_write, 8 threads, 1500m requests=limits): CFS p99
60.0–65.7 ms at every step; Temper 15.6–16.7 ms, dead flat through
bg=8 while the node ran at 0.99 utilization. Kernel counters in a 20 s window:
CFS
nr_throttled +200, 17.93 s throttled. - Cassandra (JVM, 3-core quota, tier-only, no profile): the CFS arm hit
nr_throttled +228(7.55 s throttled) in 20 s on an idle node. A JVM that sizes its pools fromavailableProcessors()=4while living under a 3-core quota is throttle bait by construction — this pod ran 84 threads. Idle-node p99: 11.8 vs 2.6 ms (a manual re-check got 10.9 ms, so the honest idle delta is −69…−78%). The JVM thread story continues in the workload-profiles article.
The disclosure: cpu.max is not kernel-enforced under sched_ext
Now the part that a skeptical reader should press on. The Temper arm's
“zero throttles” is a frozen counter, not a pacing result. CFS bandwidth
control is fair-class machinery: quota is charged only on the fair scheduling class's
accounting path, sched_ext tasks never run on a fair runqueue, and so they never charge the
quota at all — verified against the kernel source on 6.12, and the 6.17
ops.cgroup_set_bandwidth interface is notification-only. While
scx_layered is attached, the kernel does not enforce cpu.max
and the cgroup's throttle counters stop moving. This is a property of the kernel feature,
not a Temper choice, and it means the two benchmark arms above were not limit-identical:
the CFS arm was quota-throttled, the Temper arm was bounded by its layer placement instead.
What bounds a pod under Temper is the layer ceiling: a Confined layer's
cpus_range and utilization band cap where and how much the pod's threads run.
On the MySQL run that ceiling (a whole-core [2,2] allocation) worked out to roughly 1.9
effective cores against a 1.5-core quota — so that record cannot apportion how much of
its 4× win came from removing the freeze versus the extra fraction of a core, and the
report says so.
So we measured the question directly. The quota-parity measurement (scheduler v15,
2026-07-04): the same 1.5-CPU Guaranteed MySQL pod on a c2-standard-8 with four BestEffort
spinners, 120 s per arm, consumption read from the pod's own
cpu.stat usage_usec delta:
| Arm | Cores consumed | Quota | tps | p95 |
|---|---|---|---|---|
| CFS | 1.486 | 1.5 (kernel-enforced) | 675 | 56.8 ms |
| Temper (v15) | 1.353 | 1.5 (not kernel-enforced under scx) | 592 | 30.8 ms |
The Temper arm consumed less than its quota: on this shape the whole-core layer ceiling binds below
cpu.max, so there is no quota free-lunch in the latency win. The honest trade: −12% throughput
for −46% p95 — confinement removes the refill-stall cliff and also the burst headroom.
source: docs/training-artifacts/mysql-oltp/REPORT.md (quota-parity addendum) · raw: mysql-oltp/quota-parity-v15/qp.txt
This upgrades the claim from an estimate to a measurement, and narrows it honestly:
“measured consumption at-or-below quota on the tested shapes,” not
“limits enforced.” The layer ceiling is derived from requests today, not from
cpu.max; a pod whose limit sits far below the whole-core granularity of its
layer could consume above its limit. Three things are unconditional: memory limits are
untouched (only CPU quota semantics change), usage accounting keeps working, and the
safe-mode kill switch restores CFS with quotas
instantly. On the roadmap: deriving layer ceilings from cpu.max so limits are
honored equivalently, and implementing the ≥6.17 bandwidth callback. If strict CPU quota
enforcement is a compliance requirement on a node, do not attach Temper to that node —
mixed fleets are fully supported.
Caveats that travel with these numbers
- Single run per arm throughout; 20–120 s windows. The throttle-counter deltas are ~10× above noise; the p99 deltas are 4–7×.
- p99/p95 figures come from the load tools' own client-side percentiles; closed-loop clients are subject to coordinated omission — identically in both arms.
- Cassandra: single-node ring, RF=1, fsync-light, 4-CPU node — a contained lab shape, not a production ring. Its first CFS window (11.8 ms) was likely elevated by post-seed compaction; the re-checked idle point is 10.9 ms.
- The MySQL Temper arm ran with the mysql-innodb workload profile active; a tier-only comparison arm was not run, so the profile's marginal contribution is not isolated in that record.
Raw records
- docs/training-artifacts/headroom/gke-c2-pgbench/REPORT.md
- docs/training-artifacts/headroom/gke-c2-pgbench-c16/REPORT.md
- docs/training-artifacts/mysql-oltp/REPORT.md (incl. quota-parity addendum)
- docs/training-artifacts/mysql-oltp/quota-parity-v15/qp.txt
- docs/training-artifacts/cassandra/REPORT.md
- docs/security/WHITEPAPER.md §8.0 (the canonical cpu.max disclosure)
Committed benchmark records in the product repository; design partners get the full artifact tree.