Operations: running it like you mean it

Upgrade patterns, the rollback runbook, what to monitor, and how to size protected tiers so the enforcement math works in your favor.

Naming note. Binaries, the helm chart, and annotation keys currently ship under the project’s former name (infera); the commands below are what works today. A rename migration is planned.

Upgrades: canary first

The chart has first-class support for canarying a new agent build on a subset of nodes before fleet rollout: enabling canary mode renders a second DaemonSet with the candidate image tag, targeted by a node selector, while the main DaemonSet excludes those nodes.

# label the canary nodes, then:
helm upgrade infera deploy/helm/infera --reuse-values \
  --set agent.canary.enabled=true \
  --set-string agent.canary.image.tag=$NEW_TAG \
  --set agent.canary.nodeSelector.infera-canary=true

Watch the canary nodes’ linter metrics and your SLOs; if the candidate misbehaves, the blast radius is the labeled nodes, and each of them fails toward stock CFS. Promote by moving the main image tag and disabling the canary.

Rollback runbook

  1. Stop enforcement first, everywhere it hurts — safe mode is instant and does not require helm:
    kubectl annotate node --all infera.io/safe-mode-requested=true
  2. Then roll back the release at leisure:
    helm rollback infera
  3. Re-engage by removing the annotation once the fleet is on the version you trust:
    kubectl annotate node --all infera.io/safe-mode-requested-

The ordering matters and is the point of the design: enforcement rollback (milliseconds, kernel-native) is decoupled from software rollback (minutes, helm). You never wait on an image pull to get back to stock scheduling. Details: safety & rollback.

Monitoring the agent

SignalWhereAlert when
Agent statusinfera.io/agent-status node annotationNot ready on a node that should be enforcing
Agent podskubectl -n infera get pods / kube-state-metricsCrashLoop or not Running on schedulable nodes
Linter violationsinfera_lint_violation on /metricsPersistently nonzero — config and reality have drifted
Safe-mode stateMetrics + node annotationsUnexpected safe-mode entries (someone pulled the kill switch)
Config generationinfera.io/config-generation annotationRapid churn — tier assignments are thrashing

Remember the failure direction when triaging: an agent that is down means the node runs stock CFS — your workloads are un-protected, not broken.

Sizing guidance for protected tiers