Kubernetes 1.1 Networking: Surviving the Overlay Chaos in Production
Let's be honest. We all love the idea of Kubernetes. Since version 1.0 dropped last summer, itâs been the shiny new toy every CTO wants to play with. But if you have actually tried to deploy a multi-node cluster manuallyâwithout a script wrapping it all upâyou know the truth. Kubernetes networking is where dreams go to die.
I spent the last week debugging a packet loss issue on a three-node cluster that defied logic. The pods were running, the services were discovered, but random TCP connections between the frontend and the Redis backend were dropping. The culprit? An MTU mismatch inside a VXLAN tunnel overlaid on a network that was already struggling with jitter.
If you are building a platform in 2016, you can't ignore this complexity. Docker's default bridge networking doesn't scale across hosts, and the Kubernetes requirementâflat pod-to-pod communication without NATâforces us to get our hands dirty with overlay networks and iptables.
The Architecture: How K8s 1.1 Moves Packets
In the current v1.1 release, we are seeing a shift. The old userspace kube-proxy is reliable but slow. It involves a context switch for every packet. The new hotness is the iptables mode. It offers much higher throughput because the kernel handles the routing without passing packets up to a userspace daemon.
Here is what happens when a packet hits a Service IP (ClusterIP):
- The packet arrives at the node's interface.
iptablesrules (managed by kube-proxy) trap the packet destined for the virtual Service IP.- The rules DNAT (Destination Network Address Translation) the packet to a specific Pod IP selected by probability (random round-robin).
If you are running high-traffic workloads, you need to enable iptables mode. But debugging it looks like this:
$ iptables-save | grep KUBE-SVC
-A KUBE-SVC-6N4SJQ535N7 -m comment --comment "default/my-nginx:" -m statistic --mode random --probability 0.50000000000 -j KUBE-SEP-CJ3H429 -m comment --comment "default/my-nginx:"
-A KUBE-SVC-6N4SJQ535N7 -m comment --comment "default/my-nginx:" -j KUBE-SEP-XYZ123
If your underlying VPS has unstable CPU stealing (noisy neighbors), the latency in processing these chains spikes. This is where "cheap" hosting gets expensive. You save 50 NOK on the instance, but you lose hours debugging latency.
The Overlay Wars: Flannel vs. Weave
Right now, the community is split. We have Flannel (from CoreOS) and Weave. For a standard setup, I prefer Flannel for its simplicity, specifically using the VXLAN backend. It encapsulates packets in UDP.
However, VXLAN adds overhead. You lose 50 bytes per packet. If your physical network interface has an MTU of 1500, your Flannel interface (flannel.1) must have an MTU of 1450. If you don't configure this explicitly, packets get fragmented or dropped.
Configuring Flannel in ETCD
Don't rely on defaults. Store your config in etcd before starting the flanneld service. Here is the configuration I use for production clusters targeting Norwegian infrastructure:
# Populate the config in etcd
etcdctl set /coreos.com/network/config '{
"Network": "10.244.0.0/16",
"SubnetLen": 24,
"Backend": {
"Type": "vxlan",
"VNI": 1
}
}'
Once flanneld starts, it reads this, allocates a /24 subnet to the host (e.g., 10.244.15.0/24), and writes the subnet keys back to /run/flannel/subnet.env. You must then pass these environment variables to the Docker daemon start options (--bip=${FLANNEL_SUBNET} --mtu=${FLANNEL_MTU}).
If you forget the MTU flag on the Docker daemon, Docker defaults to 1500, Flannel tries to push 1500 bytes into a 1450-byte tunnel, and your database connections hang indefinitely. Ask me how I know.
The "CoolVDS" Reference Architecture
We test our Kubernetes deployments strictly on KVM-based virtualization. Containers on top of Containers (like OpenVZ) are a nightmare for overlay networks because you often lack the kernel modules (bridge, nf_conntrack) required to manipulate the network stack effectively.
At CoolVDS, we expose the full CPU instruction set and necessary kernel modules to the guest. This is critical for tools like Weave or Flannel to function without kernel panics. Furthermore, with the invalidation of Safe Harbor last October (thanks, Schrems), keeping data inside Norway is no longer optional for many of our clientsâit's a compliance requirement. Latency to the NIX (Norwegian Internet Exchange) in Oslo is under 3ms from our datacenter, which keeps `etcd` cluster convergence fast and happy.
Pro Tip: Monitor yourconntracktable usage. Kubernetes creates massive amounts of entries. If you see "table full, dropping packet" indmesg, you need to bump the sysctl settings:
sysctl -w net.netfilter.nf_conntrack_max=131072
Validating the Setup
Don't assume it works just because the pods are green. Run a bandwidth test between pods on different nodes.
# Start a listener on Node A
kubectl run -i --tty --image=alpine iperf-server -- restart=Never -- iperf -s
# Run client on Node B
kubectl run -i --tty --image=alpine iperf-client -- restart=Never -- iperf -c [POD_IP_FROM_ABOVE]
On a standard CoolVDS NVMe instance, we typically see near line-rate performance even with VXLAN encapsulation, because the I/O wait is virtually zero.
Final Thoughts
Kubernetes 1.1 is powerful, but it assumes your underlying network is robust. Overlay networks like Flannel are essentially software routers. If your CPU is busy fighting for cycles on a crowded host, your network throughput suffers. Treat your infrastructure with the same respect you treat your code.
If you are planning a Kubernetes rollout in 2016, stop fighting the "Noisy Neighbor" effect. Spin up a KVM instance on CoolVDS and focus on your yaml files, not packet fragmentation.