01 Deep dive

Sideloading: harvesting idle capacity without paying for it in tail latency

Most clusters run their latency-critical nodes at a fraction of capacity on purpose — the padding is fear of the scheduler. This article is about running batch work in those idle cycles and taking it back the microsecond the protected service wakes, with the measured ladders that show the tail does not move.

workloads: memcached · redis · batch fillers clusters: GKE (e2, c2, c3) · EKS (m5) records: 8 reports

Why clusters idle: requests are not usage

Kubernetes packs nodes by declared CPU requests, and requests are padded — routinely two to four times above real usage — because a request is the only protection a pod has. Once two pods share a node, the request stops mattering and the Linux scheduler arbitrates every microsecond. Nobody trusts CFS with a tight node, so everybody buys slack: latency-critical services get whole nodes' worth of headroom that sits idle almost all of the time.

Sideloading is the obvious-sounding fix: put Background-tier batch work — builds, encoding, analytics, spinner-shaped anything — into the idle cycles of the protected nodes, under the condition that it is throttled the instant the protected tier has demand. The condition is the whole problem. If the batch work can delay even one wakeup of the protected service, the padding was cheaper.

Why CFS cannot give you that condition

CFS arbitrates with proportional weights (cpu.weight, the successor of cpu.shares): a Guaranteed pod outweighs a BestEffort spinner by a large ratio, and over a scheduling period each side receives CPU in that ratio. Proportional is not preemptive-priority. Three consequences matter:

Weights divide time; they do not answer wakeups. When the protected service's thread wakes, the batch thread currently on the CPU is entitled to finish out its vruntime slice. On a loaded node, that wait is the p99.
Wakeup placement degrades under load. The idle-CPU search that makes CFS fast on an empty node has less and less to find; woken threads increasingly queue behind running ones instead of starting immediately.
BestEffort still contends. A weight of 1 is not a weight of 0 — the spinners and the service time-slice the same runqueues, so the primary's tail tracks how full the node is. That is exactly the shape our CFS baselines show below.

Temper's node engine replaces this arbitration with explicit layers in the kernel's scheduling path (scx_layered on sched_ext): the Critical tier gets a protected, fenced layer; Background gets an Open layer that runs only on CPUs the fence is not using; and a carry-patch we call protected-while-busy makes the loan honest — a protected layer's CPUs are loaned out only when that layer is idle everywhere, and loaned CPUs are preempt-kicked back the moment the layer's idle→demand transition fires. (An earlier version used a 20 ms loan slice; it showed up as the woken owner's p99 — 21.4 ms, reproduced three times — and was replaced by the preempt-kick. That bug and its fix are in the committed reports.)

The multi-node ladder: flat p99 at 0.81–0.92 utilization

memcached p99 vs. batch-filler density, multi-node Temper flat · CFS 3.1×

3-node GKE cluster (e2-standard-4), two memcached primaries under memtier load, batch-filler ladder 0→18 plus 6 background spinners. Node utilization 0.81–0.92.

CFS (stock kube-scheduler) Temper (scx_layered)

Single run per arm; ±0.2 ms is noise, the flat-vs-3.1× delta is far outside it. Open anomaly, kept in the report: the second memcached instance tracked CFS (~1.6 ms) in this GKE run and is not root-caused; the asymmetry did not reproduce on EKS, where both instances held near-flat (1.2–1.3× vs CFS ~3×). source: docs/training-artifacts/binpack/REPORT.md · EKS replication: docs/training-artifacts/binpack/records/eks/

The same ladder was replicated on EKS with two instrument fixes worth knowing about because they are the kind of thing that invalidates benchmarks: the memtier client was CFS-throttled by its own 500m limit (masking the signal), and the first nodegroup spanned availability zones (a ~1 ms cross-AZ RTT floor swamped a sub-millisecond effect). With both fixed, the EKS run shows the flat line directly: p99 held 0.343–0.399 ms from node utilization 0.36 to 0.95 while background delivered ~3.9 cores of a 4-vCPU node; CFS paid +33% tail for the same reclaim. An earlier EKS run with the broken instrument is committed with a “do not cite” banner — catching our own bad runs is part of the method.

The heavy operating point, and where the win is small

How much sideloading costs under CFS depends on how hard the primary itself is driven. A sensitivity sweep on EKS made that explicit: with a nearly idle primary (1 thread × 4 connections), both schedulers are flat — parity, an honest null; there is no tail to protect. At the default point the gap is −23% p99 at the top of the ladder. At the heavy point (4 threads × 8 connections, the primary saturating its allocation), CFS degrades +256% across the ladder while Temper holds within +70% of its own baseline — 3.3× lower p99 at the top, −70%. We sell with the heavy number and disclose the sweep.

memcached, heavy operating point, GKE dedicated cores −88% p99 at the knee

Saturating memcached primary (4t×8c) vs. background-spinner ladder. 2× c2-standard-4, 2026-07-02.

CFS Temper

−88% p99 at 4 spinners (0.415 vs 3.231 ms), −71% at idle; CFS +202% across the ladder, Temper flat. Cross-cloud: EKS heavy point −70%. Known attribution glitch in this run: one Temper step reported background_cores=0.0 (p99 unaffected) — per-tier core attribution is an open instrumentation item. source: docs/training-artifacts/headroom/gke-c2-mc-heavy/REPORT.md · docs/training-artifacts/headroom/eks-sensitivity/REPORT.md

Redis behaves the same way, with an honest wrinkle: at the default operating point a single hot redis thread is the easiest possible case for CFS's weight arbitration to defend, so the gap is modest (−17…−24%; Temper flat, worst step +13% over its own baseline while background reclaims ~3.8 cores at 0.98 node utilization). At the heavy point the gap opens to −55% at the knee and −40% on the idle node — Confined placement isolates the primary from system-pod noise before any spinner exists.

memcached vs. density, tier QoS only CFS degrades 12.3× · Temper flat

Single node (c3-standard-8), density ladder 0→8. No workload profile — PriorityClass-derived tiers alone.

CFS Temper

CFS broke the SLO at density 2 (0.335 ms) and reached 1.951 ms at density 8; Temper held 0.183 ms through an up-and-down ladder. Honest cost, kept in the record: at zero load, fenced placement idles ~20% above CFS idle (0.191 vs 0.159 ms), and background was squeezed to 1.84 cores vs CFS's 6.22 — the fence trades reclaim for protection, which is what the protected-while-busy work below recovers. source: docs/training-artifacts/OVERNIGHT-REPORT.md · docs/training-artifacts/memcached/SUMMARY.md

Turning protection into capacity

Holding the tail is only half of sideloading; the other half is actually reclaiming the idle cycles. Stock protected layers strand capacity — under a bursty primary the fence held its CPUs even while the primary idled 60% of the time, leaving background at 1.89 cores and node utilization at 0.40. The protected-while-busy demand signal (loans only when the layer is idle everywhere, preempt-kick on demand return) took the same node to 5.65 background cores and 0.85 utilization at −7% primary throughput (the report's figure; raw samples/s read 25.5→23.3). Steady-state was unaffected.

With enforcement underneath, the padding itself becomes recoverable. On the same 3-node cluster, halving the declared requests of non-Critical pods (the opt-in overcommit webhook; Critical untouched, originals kept in annotations) moved the request-packing wall from 18 to 31 fillers — +72% pods placed with p99 still under 1.9 ms. A 16-pod fleet that genuinely does not fit on two nodes at stock requests ran entirely on two after overcommit — 3→2 nodes (33%), p99 spot-check 1.56 ms with the client co-located against us. And with Karpenter running placement in both arms, the same load and SLO provisioned −40% vCPU (12 vs 20) with Temper underneath.

+72% is a requests-packing number at overcommit factor 0.5 on a workload whose true usage fits. A fleet whose real usage exceeds capacity sees Open-tier throughput degrade first — by design — but Critical p99 held in every measured state.
The 3→2 consolidation was manual (cordon + delete); autoscaler-driven consolidation is designed, not built.
All packing runs are single-run-per-arm; the 18-vs-31 delta is far outside the noise band.

Raw records

docs/training-artifacts/binpack/REPORT.md
docs/training-artifacts/binpack/SAVINGS-REPORT.md
docs/training-artifacts/binpack/records/eks/
docs/training-artifacts/headroom/gke-c2/REPORT.md
docs/training-artifacts/headroom/gke-c2-mc-heavy/REPORT.md
docs/training-artifacts/headroom/gke-c2-redis/REPORT.md
docs/training-artifacts/headroom/gke-c2-redis-heavy/REPORT.md
docs/training-artifacts/headroom/eks-valid/REPORT.md
docs/training-artifacts/headroom/eks-sensitivity/REPORT.md
docs/training-artifacts/headroom/eks-inconclusive/FINDINGS.md (invalidated run, kept)
docs/training-artifacts/karpenter/REPORT.md
docs/training-artifacts/memcached/SUMMARY.md
docs/training-artifacts/OVERNIGHT-REPORT.md

Committed benchmark records in the product repository; design partners get the full artifact tree. Single-run arms are labeled in each report; anomalies are published, not pruned.