BRO SRE

Reliability practices, infrastructure, automation

← Back to articles

Boring Kubernetes Upgrades: A Quarterly Cadence That Actually Works

2026-05-08 · Kubernetes, Operations, Upgrades

Two years ago our Kubernetes upgrades were events. We blocked off a week, drafted a runbook the size of a thesis, declared a feature freeze, and emerged on the other side with a bunch of new tickets and a slightly traumatized platform team. Today an upgrade takes one engineer two afternoons and does not even rate a Slack announcement. This article is about how we got there.

The Two Cadences That Don't Work

Teams tend to settle on one of two anti-patterns. The first is "upgrade when forced": the cluster runs an old version until cloud-provider support ends, then everyone scrambles to skip three minors at once. The second is "upgrade immediately on release": every new minor lands within a week, including the ones that ship with broken admission webhooks.

The first cadence accumulates risk linearly. Each skipped version layers a new set of deprecations and behavioral changes that you will eventually need to reconcile simultaneously. The second cadence trades stability for currency: you become an unpaid beta tester for the entire ecosystem.

The cadence that actually works for us is quarterly, with one version of lag.

The Quarterly Rhythm

Upstream Kubernetes releases roughly three minors per year. We track N-1: when 1.34 is released, we upgrade from 1.32 to 1.33. The lag is deliberate. It gives:

The actual quarterly schedule is fixed in advance:

The cycle is intentionally generous. Compressing it does not save much engineering time but does compress the recovery window when something goes wrong.

The Pre-Flight Check Is the Whole Game

The actual kubeadm upgrade apply call is anticlimactic. The work is in finding out, before you upgrade, what is going to break.

We run three pre-flight scans against every cluster:

# Detect deprecated API versions in live manifests
kubectl deprecations --target-version=1.33

# Detect deprecated API versions in source-controlled manifests
pluto detect-files -d ./infra --target-versions k8s=v1.33.0

# Detect controllers using deprecated APIs via audit logs
audit2rbac --since=30d --filter=deprecated-apis

The third one — replaying audit logs — is what catches operators and CRDs that the static scanners miss. We discovered one home-grown controller still using the v1beta1 cronjob API only because it showed up in audit logs; the source repo had been updated months ago, but the deployed binary was from the previous build.

The Bits That Used to Hurt

Webhook Compatibility

Admission webhooks frequently break across upgrades because their CA bundle expires, their TLS settings reject newer cipher suites, or their handler logic depends on field defaulting that changed. Our pattern: every cluster runs a synthetic ValidatingWebhookConfiguration health check that fires a known-good request through every registered webhook every five minutes. If any webhook starts failing, we get a page before the next deployment hits it.

Kube-proxy and CNI

We standardized on Cilium with kube-proxy replacement. The single most painful upgrade we had — 1.27 to 1.28 — broke because the Cilium version pinned in our Helm chart did not yet support the new kube-proxy IPVS hashing change. The fix is rigid: upgrade Cilium one cycle ahead of Kubernetes. The CNI is always one minor newer than the version it is deployed on.

Container Runtime

cgroups v2 migration on RHEL-family nodes caused us six months of confusion. Some workloads (specifically those reading memory.limit_in_bytes from inside the container) silently broke under v2 because the file no longer exists at that path. We caught it with a one-time audit against every running pod, but it could easily have been a slow-burn outage. Lesson: when you change anything below the kubelet, run the synthetic workload that exercises the lowest layers, not just kubectl get nodes.

The Rollback Question

Kubernetes does not really support rollback. Once etcd has accepted a new schema, downgrading the control plane is unsupported and the API server may refuse to start.

Our answer is to not need rollback. The lab → staging → low-traffic prod → broader rollout sequence is designed so that any incompatibility shows up before we hit critical traffic. If staging stays healthy under normal load for a week, we have very high confidence the upgrade is safe. If it does not, we identify and fix the issue in staging before touching production.

For control plane rollback in catastrophic scenarios, the only real option is restoring from an etcd snapshot taken immediately before the upgrade. We take this snapshot on every upgrade and retain it for 14 days. We have never used one. We still take them.

What Made Upgrades Boring

The single biggest shift was treating upgrades as a continuous obligation rather than a project. When the schedule is fixed and the pre-flight tooling is mature, the actual upgrade becomes a sequence of small, well-known steps. The interesting work — finding deprecated API usage, validating compatibility — is spread across weeks, not crammed into one week of panic.

The second shift was discipline about lag. N-1 is a rule, not a guideline. The temptation to skip versions because "we're already behind" or to jump on the latest because "the new feature is nice" must be resisted. The cost of being out-of-sync with the cadence is much higher than the cost of being slightly behind the cutting edge.

Kubernetes upgrades will never be exciting. That is the goal.