Console Login

Surviving the Split-Brain: A Practical Guide to etcd Clustering in Production

Surviving the Split-Brain: A Practical Guide to etcd Clustering

Most sysadmins are still deploying configuration files using rsync loops or, God forbid, manual edits. It works fine for two servers. It becomes a nightmare at ten. By the time you hit fifty nodes, you have what I call "Configuration Drift Hell." One web server has an old Nginx config, another has the wrong database connection string, and suddenly your load balancer is throwing 502 errors that are impossible to debug.

The industry is shifting. With the release of Kubernetes 1.0 earlier this summer and the stability of CoreOS, we are moving toward immutable infrastructure. At the heart of this shift is etcd.

It’s not just a key-value store. It’s the source of truth for your cluster. But here is the hard reality: etcd is ruthless about network quality. If your hosting provider has jittery packet loss, your cluster will fail. I’ve seen it happen.

Why etcd? (And why not Redis?)

I get asked this constantly. "Why can't I just use Redis for config management?"

Redis is designed for speed. etcd is designed for correctness. It uses the Raft consensus algorithm. In a distributed system, you are fighting the CAP theorem. etcd favors Consistency and Partition Tolerance. If the network splits, etcd stops accepting writes rather than letting you write bad data. Redis (in standard config) might let you write to a partition that gets wiped later.

When you are storing the IP addresses of your database masters or API keys, you don't want "eventual" consistency. You want actual consistency.

The Architecture of a Fail-Safe Cluster

To run etcd in production, you need an odd number of nodes (3, 5, or 7) to maintain a quorum. A 3-node cluster can survive the loss of 1 node.

The Hardware Reality

Raft relies on two things:

  1. Disk I/O: Every write must be fsync'd to disk. Slow rotating rust (HDDs) will kill your write throughput.
  2. Network Latency: The leader node sends heartbeats to followers. If the round-trip time (RTT) exceeds the election timeout, followers revolt and start a new election. This stops the world.
Pro Tip: Do not run an etcd cluster across the Atlantic. The latency variation will trigger constant leader elections. Keep your nodes in the same region. For us operating out of Scandinavia, that means keeping data within local data centers to satisfy the upcoming data protection regulations.

Setting Up a 3-Node Cluster (The CoreOS Way)

Assuming you are running a modern Linux distro (CoreOS, CentOS 7, or Debian Jessie), here is how we bootstrap a static cluster. We will use the standard ports 2379 for client comms and 2380 for peer comms.

On Node 1 (IP: 10.0.0.1):

./etcd -name infra0 \
  -initial-advertise-peer-urls http://10.0.0.1:2380 \
  -listen-peer-urls http://10.0.0.1:2380 \
  -listen-client-urls http://10.0.0.1:2379,http://127.0.0.1:2379 \
  -advertise-client-urls http://10.0.0.1:2379 \
  -initial-cluster-token etcd-cluster-1 \
  -initial-cluster infra0=http://10.0.0.1:2380,infra1=http://10.0.0.2:2380,infra2=http://10.0.0.3:2380 \
  -initial-cluster-state new

You would repeat this for infra1 and infra2, changing the IPs and names accordingly. Once the cluster is up, verify the member list:

./etcdctl member list

If your network is stable, you will see three healthy nodes.

The "Cheap VPS" Trap

I recently debugged a setup for a client trying to run a Docker cluster on budget VPS instances hosted in Germany. They were seeing random timeouts in their API. The logs pointed to etcd leader elections happening every 20 minutes.

The culprit? CPU Steal and Network Jitter.

On oversold shared hosting, your "dedicated" CPU core is actually fighting with 20 other neighbors. When a neighbor compiles a kernel, your etcd process pauses for 200ms. That pause is enough to miss a heartbeat. The cluster thinks the leader is dead, forces an election, and your API locks up.

Why Infrastructure Matters

This is why we standardized our internal stacks on CoolVDS. We aren't just looking for space; we are looking for guaranteed cycles.

Feature Budget Host CoolVDS (KVM)
Virtualization OpenVZ (Shared Kernel) KVM (Hardware Isolation)
Storage Shared HDD/SATA SSD Dedicated SSD Arrays
Network Priority Best Effort Low Latency to NIX

CoolVDS uses KVM. This means our RAM and CPU operations are hardware isolated. No noisy neighbors stealing our cycles. Furthermore, for those of us in Norway, having the data sit physically in Oslo reduces latency to the bare minimum, ensuring those Raft heartbeats are practically instantaneous.

Security: Don't Ignore It

By default, etcd talks over HTTP. If you are running this over a public network (which you shouldn't be, but sometimes you have to), anyone can read your configs.

With the Data Protection Directive and the stricter enforcement we are seeing from Datatilsynet, leaving config data unencrypted is negligence. You must generate TLS certificates for your cluster. It’s a pain to set up, but it is mandatory for production.

Final Thoughts

Distributed systems are hard. They amplify the weaknesses in your underlying infrastructure. You can have the best etcd configuration in the world, but if your disk write latency spikes or your network packets drop, the system fails.

Stop fighting with hardware that wasn't built for distributed consensus. If you are building the future of infrastructure, build it on ground that doesn't shake.

Need stable ground? Deploy your etcd cluster on a CoolVDS SSD instance today and see what 1ms latency does for your stability.