Operations: running it like you mean it

Upgrade patterns, the rollback runbook, what to monitor, and how to size protected tiers so the enforcement math works in your favor.

Naming note. Binaries, the helm chart, and annotation keys currently ship under the project’s former name (infera); the commands below are what works today. A rename migration is planned.

Upgrades: canary first

The chart has first-class support for canarying a new agent build on a subset of nodes before fleet rollout: enabling canary mode renders a second DaemonSet with the candidate image tag, targeted by a node selector, while the main DaemonSet excludes those nodes.

# label the canary nodes, then:
helm upgrade infera deploy/helm/infera --reuse-values \
  --set agent.canary.enabled=true \
  --set-string agent.canary.image.tag=$NEW_TAG \
  --set agent.canary.nodeSelector.infera-canary=true

Watch the canary nodes’ linter metrics and your SLOs; if the candidate misbehaves, the blast radius is the labeled nodes, and each of them fails toward stock CFS. Promote by moving the main image tag and disabling the canary.

Rollback runbook

Stop enforcement first, everywhere it hurts — safe mode is instant and does not require helm:
```
kubectl annotate node --all infera.io/safe-mode-requested=true
```
Then roll back the release at leisure:
```
helm rollback infera
```
Re-engage by removing the annotation once the fleet is on the version you trust:
```
kubectl annotate node --all infera.io/safe-mode-requested-
```

The ordering matters and is the point of the design: enforcement rollback (milliseconds, kernel-native) is decoupled from software rollback (minutes, helm). You never wait on an image pull to get back to stock scheduling. Details: safety & rollback.

Monitoring the agent

Signal	Where	Alert when
Agent status	`infera.io/agent-status` node annotation	Not `ready` on a node that should be enforcing
Agent pods	`kubectl -n infera get pods` / kube-state-metrics	CrashLoop or not Running on schedulable nodes
Linter violations	`infera_lint_violation` on `/metrics`	Persistently nonzero — config and reality have drifted
Safe-mode state	Metrics + node annotations	Unexpected safe-mode entries (someone pulled the kill switch)
Config generation	`infera.io/config-generation` annotation	Rapid churn — tier assignments are thrashing

Remember the failure direction when triaging: an agent that is down means the node runs stock CFS — your workloads are un-protected, not broken.

Sizing guidance for protected tiers

Leave real headroom outside the fence. Keep the aggregate Critical-tier CPU requests on a node at or below total cores minus one — the system, kubelet, and Temper’s own threads live in the open reserve, and a fence that swallows the whole node starves what feeds it.
Expect graceful demotion, don’t rely on it. If a Confined (Critical) layer demands more whole cores than the node can fence after the open reserve, Temper demotes that layer to Grouped rather than caging it into a too-small allocation. You keep protection but lose exclusivity — fix the requests or the node shape.
Set honest requests. Layer weights and CPU ranges are computed from your requests; wildly padded requests buy fence you don’t use, and starved requests under-weight tiers that matter. The thread-aware rightsizer tells you which is happening.
Don’t make everything Critical. The tier system is an economy; if every pod is Critical, no pod is. Most services belong in High or Normal — reserve Critical for the workloads whose p99 is revenue.
Batch tier changes where possible. Each QoS change costs a ~52 ms reconfiguration window (churn cost); the agent debounces, but a deployment pattern that flaps PriorityClasses works against itself.