04 Deep dive

Keeping accelerators fed

The GPU is reserved, scheduled, and billed by the hour. The CPU threads that feed it are not protected by any of that. Pack the node with batch work and the most expensive device in the cluster starts idling — we measured the collapse, the fix, and the shape of workload where none of this applies.

workloads: PyTorch training (ResNet-18) · vLLM serving hardware: GKE g2-standard-8 + NVIDIA L4 · c3-standard-8 records: 4 reports

The asymmetry that wastes accelerators

Kubernetes treats a GPU as an indivisible, exclusively-assigned resource: one pod owns nvidia.com/gpu: 1 and nothing else can touch it. The CPU side of the same training pod — the main loop, the DataLoader workers doing decode and augmentation, the copy threads — gets no such treatment. It competes under ordinary CFS weights with every other pod on the node. When batch neighbors out-weigh the trainer, the feeders stall, the input pipeline drains, and the GPU computes on nothing. You keep paying for the device; the meter that stops is throughput.

The precondition matters, so we state it first: this wedge exists when the trainer's CPU demand exceeds its request. That is what real DataLoader-heavy training usually looks like — teams underprovision CPU requests next to expensive GPUs — but it is a condition, not a universal.

The L4 wedge, measured

ResNet-18 on NVIDIA L4, batch neighbors Temper flat · CFS −25%

Burstable trainer (2-CPU request, no CPU limit, ~7 vCPU demand, 6 DataLoader workers) vs. batch-spinner ladder on one g2-standard-8. 2026-07-01.

CFS Temper

At 16 neighbors CFS lost 25% of trainer throughput (629→471 samples/s) and the L4’s utilization collapsed into a 0–81% band (mean ~40%); under Temper throughput held 636–642 and GPU utilization ~85% at every step. Isolated nvidia-smi zero-samples are sampling artifacts (present at idle too); the starvation signal is the sustained 12–70% band, not single zeros. Single run per arm. source: docs/training-artifacts/gpu-wedge/REPORT.md · GKE g2-standard-8 + 1× NVIDIA L4

The control arm is as important as the headline. In the first configuration (v1) the trainer was Guaranteed, with demand that fit inside its 4-CPU request — and both arms were flat. kubelet's QoS weighting fully defends a Guaranteed pod whose demand fits its request; Temper added +3–4% at high density, real but not headline. We kept that run in the report because it defines the boundary: the wedge appears exactly when demand exceeds request (v2: 2-CPU request, ~7 vCPU demand), which is when weights become the batch neighbors' weapon rather than the trainer's shield. In money terms the report puts it plainly: at 16 neighbors CFS wastes about a quarter of the accelerator; with enforcement the same node safely takes the batch overflow.

What Kubernetes' own remedies measure

PyTorch training at density 8, CPU-only node +67% samples/s

Guaranteed 3-CPU trainer next to 8 noisy neighbors, c3-standard-8 (SMT-2). Kubernetes’ own remedies measure worse than doing nothing. 2026-06-12.

Quota limits partition manually and land mid-pack (16.5). Static CPU pinning — kubelet’s own cpuset manager — was the worst primary arm: the pin was SMT-blind (three logical CPUs sharing physical cores) and capped the trainer at 14.6–14.8 even on an idle node. Whole-core, SMT-aware placement is part of the enforcement win. Honest cost in the same table: Temper’s fence squeezed background to ~1.9 cores (CFS delivered 5.0) — the reclaim side of that trade is measured in the sideloading article. source: docs/training-artifacts/OVERNIGHT-REPORT.md · docs/training-artifacts/arms/FOUR-ARM-SUMMARY.md

The honest negative: GPU-bound serving

We ran the obvious follow-up and it did not go our way, so here it is. vLLM serving a small model (Qwen2.5-0.5B) on the same L4, 8 concurrent request loops, both a right-sized Guaranteed configuration and a deliberately wedge-shaped one (demand over request, weighted neighbors): parity in both. Right-sized, both arms were flat — p99 ~393–411 ms at idle and ~398–408 ms at bg=8, throughput within a few percent. In the wedge configuration, CFS showed only mild degradation at bg=4 (+16% vs Temper's +7%) with throughput parity, single runs, noisy; the bg=0 rows include post-rollout warm-up effects and the record marks them unsettled.

The mechanism is not mysterious: tokenization and scheduling for a 0.5B model at this concurrency costs a fraction of one core. The workload is GPU-bound, so there is no CPU-side contention for a CPU scheduler to remove — the training result does not generalize to this serving shape, and we publish that rather than let the wedge quietly overclaim. Where the wedge does apply: DataLoader-heavy training and preprocessing-heavy pipelines (long prompts, large tokenizers, multimodal encode) — anywhere real CPU work sits between storage and the accelerator. Where it does not: workloads whose CPU side is already negligible.

Caveats that travel with these numbers

Single run per arm in all GPU experiments; 60 s measurement windows.
The +67% arm comparison is a CPU-only node (c3-standard-8, Intel); absolute samples/s are not comparable across machine shapes — a t2d (AMD, no SMT) run of the same trainer idles at 17.2 vs c3’s 25.5.
The L4 run also live-validated agent auto-start on the GPU node (scheduler attached before any user pod) — noted because boot-order bugs are a classic way this class of product fails quietly.

Raw records

docs/training-artifacts/gpu-wedge/REPORT.md
docs/training-artifacts/vllm-l4/REPORT.md
docs/training-artifacts/OVERNIGHT-REPORT.md
docs/training-artifacts/arms/FOUR-ARM-SUMMARY.md
docs/training-artifacts/arms/STAGE1-SUMMARY.md
docs/training-artifacts/shapes/SUMMARY.md (partial run, shape-comparison caveat)

Committed benchmark records in the product repository; design partners get the full artifact tree.