Operations: running it like you mean it
Upgrade patterns, the rollback runbook, what to monitor, and how to size protected tiers so the enforcement math works in your favor.
infera); the commands below are what works today. A rename migration is planned.Upgrades: canary first
The chart has first-class support for canarying a new agent build on a subset of nodes before fleet rollout: enabling canary mode renders a second DaemonSet with the candidate image tag, targeted by a node selector, while the main DaemonSet excludes those nodes.
# label the canary nodes, then:
helm upgrade infera deploy/helm/infera --reuse-values \
--set agent.canary.enabled=true \
--set-string agent.canary.image.tag=$NEW_TAG \
--set agent.canary.nodeSelector.infera-canary=true
Watch the canary nodes’ linter metrics and your SLOs; if the candidate misbehaves, the blast radius is the labeled nodes, and each of them fails toward stock CFS. Promote by moving the main image tag and disabling the canary.
Rollback runbook
- Stop enforcement first, everywhere it hurts — safe mode is instant and does
not require helm:
kubectl annotate node --all infera.io/safe-mode-requested=true - Then roll back the release at leisure:
helm rollback infera - Re-engage by removing the annotation once the fleet is on the version you trust:
kubectl annotate node --all infera.io/safe-mode-requested-
The ordering matters and is the point of the design: enforcement rollback (milliseconds, kernel-native) is decoupled from software rollback (minutes, helm). You never wait on an image pull to get back to stock scheduling. Details: safety & rollback.
Monitoring the agent
| Signal | Where | Alert when |
|---|---|---|
| Agent status | infera.io/agent-status node annotation | Not ready on a node that should be enforcing |
| Agent pods | kubectl -n infera get pods / kube-state-metrics | CrashLoop or not Running on schedulable nodes |
| Linter violations | infera_lint_violation on /metrics | Persistently nonzero — config and reality have drifted |
| Safe-mode state | Metrics + node annotations | Unexpected safe-mode entries (someone pulled the kill switch) |
| Config generation | infera.io/config-generation annotation | Rapid churn — tier assignments are thrashing |
Remember the failure direction when triaging: an agent that is down means the node runs stock CFS — your workloads are un-protected, not broken.
Sizing guidance for protected tiers
- Leave real headroom outside the fence. Keep the aggregate Critical-tier CPU requests on a node at or below total cores minus one — the system, kubelet, and Temper’s own threads live in the open reserve, and a fence that swallows the whole node starves what feeds it.
- Expect graceful demotion, don’t rely on it. If a Confined (Critical) layer demands more whole cores than the node can fence after the open reserve, Temper demotes that layer to Grouped rather than caging it into a too-small allocation. You keep protection but lose exclusivity — fix the requests or the node shape.
- Set honest requests. Layer weights and CPU ranges are computed from your requests; wildly padded requests buy fence you don’t use, and starved requests under-weight tiers that matter. The thread-aware rightsizer tells you which is happening.
- Don’t make everything Critical. The tier system is an economy; if every pod is Critical, no pod is. Most services belong in High or Normal — reserve Critical for the workloads whose p99 is revenue.
- Batch tier changes where possible. Each QoS change costs a ~52 ms reconfiguration window (churn cost); the agent debounces, but a deployment pattern that flaps PriorityClasses works against itself.