BRO SRE

Reliability practices, infrastructure, automation

← Back to articles

Redis Cluster in Production: Five Pitfalls We Hit So You Don't Have To

2026-05-15 · Redis, Databases, Operations

Redis Cluster looks deceptively simple. Three masters, three replicas, slots auto-distributed, clients handle the redirects. The README implies you can deploy it in an afternoon. We did, and then spent the next two years discovering that the gap between "running" and "production-ready" is enormous.

This article is a tour of five specific failure modes we hit operating a 24-node Redis Cluster handling 480k operations per second across two regions. Each one cost us either availability or a long debugging session, and most of them are not mentioned in the official documentation.

Pitfall 1: The Resharding Window Kills Throughput

Adding a node to a Redis Cluster requires moving slots, which means copying keys between masters. The documentation describes this as an online operation. It is online in the sense that the cluster continues to accept reads and writes. It is not online in the sense that performance is unaffected.

During slot migration, the source master serializes each key, sends it to the destination, and waits for acknowledgement before deleting it locally. For keys with large values — a Sorted Set with 50k members, a Hash with 100k fields — this serialization is single-threaded on both ends. We measured p99 command latency jumping from 1.2ms to 340ms during a single CLUSTER REBALANCE operation on a cluster that was nominally well-provisioned.

What works: rebalance with --cluster-pipeline 1 --cluster-timeout 60000 during a planned window, monitor migrating and importing slot counts, and pause rebalancing if the rate of MIGRATE-induced latency exceeds your SLO budget. Better: avoid resharding by sizing the cluster correctly from the start.

Pitfall 2: PFCOUNT Across Slots Is a Lie

HyperLogLog operations work fine on a single key. The moment you try PFCOUNT key1 key2 key3 with keys on different slots, Redis Cluster rejects the command with CROSSSLOT. The straightforward fix — hash-tagging all HLL keys so they land on the same node — defeats the entire point of clustering.

What worked for us: maintain per-shard HLL aggregates and merge them application-side using a HyperLogLog library that understands the wire format. We use the Stream-Lib Java implementation; the merged cardinality has the same accuracy guarantees as a single-key PFCOUNT, but the merge happens in the client and stays clear of Redis Cluster's slot constraint.

This pattern generalizes. Any "aggregate across keys" workflow — SUNIONSTORE, ZUNIONSTORE — needs to be redesigned for cluster mode. The naive port from a single Redis instance will either fail with CROSSSLOT or become an accidental hot key when all data is force-routed to one node via hash tags.

Pitfall 3: AOF Rewrites Can Stall Replication

AOF rewrites are triggered automatically when the file doubles in size. The rewrite runs in a forked child process, so the parent continues serving traffic. So far so good.

The trap: the rewrite consumes disk I/O bandwidth that the parent process also needs for its own AOF appends. On nodes with cheap local SSD, we observed AOF writes from the parent blocking for hundreds of milliseconds during the rewrite. Blocked AOF writes block replication acknowledgements. Blocked replication acknowledgements increase replica lag. If replica lag exceeds cluster-replica-validity-factor * cluster-node-timeout, the replica is marked ineligible for failover, and you have effectively lost redundancy until the rewrite completes.

The fix is mostly about disk. Use provisioned-IOPS storage, not burst-capable. Set aof-use-rdb-preamble yes so the rewrite produces a compact RDB header followed by AOF tail, reducing total bytes written. Monitor loading and aof_pending_rewrite as first-class metrics, not afterthoughts.

Pitfall 4: Client Library Topology Caches Go Stale

Cluster-aware clients fetch the slot-to-node map at connect time, then refresh it on receiving MOVED redirects. This works during normal operation. It fails badly during a partial network partition where some nodes can talk to each other but the client cannot reach the current owner of a slot.

Symptom: the client's cached map points to node A as the owner of slot X. Node A is in a partition that excludes the client. The client retries, gets a connection error, retries again, gets the same error. The client never receives a MOVED response that would refresh its map, because the failed-over master is in a different partition.

Mitigation: configure clients to periodically refresh their topology even without a MOVED redirect. In Lettuce we set refreshClusterView to 30 seconds. In go-redis we explicitly call ReloadState after consecutive connection failures. Without this, partial partitions produce extended unavailability that looks like a client bug but is actually a cache coherence problem.

Pitfall 5: Failover Triggers Are Coarser Than You Think

A Redis Cluster master is failed over when a quorum of other masters agree it is unreachable. The quorum is computed using gossip messages, which are also responsible for cluster state propagation. Under load, gossip messages can be delayed, and the cluster will be slow to recognize a degraded master.

The defaults are forgiving: cluster-node-timeout defaults to 15 seconds, meaning a half-dead master can serve degraded traffic for that long before failover. Worse, "half-dead" is the dangerous case. A fully dead master fails fast. A master that responds slowly to gossip but accepts writes ends up with writes that may be lost on failover.

We tuned cluster-node-timeout to 5 seconds and added an external health check that calls CLUSTER FAILOVER FORCE on a replica if the master fails our liveness probe (a write-and-read round trip with a 2-second budget) for three consecutive checks. The external probe is more aggressive than gossip and catches the slow-write scenario the cluster's own consensus misses.

What We Would Do Differently

If we started over today, the decision tree would be:

Redis Cluster works. It is also less of a finished product than the README suggests. The five pitfalls above are not exotic edge cases — they show up in any cluster that runs long enough under real load. Knowing about them in advance turns "weird Redis incident" into "expected operational task."