The "Flat Network" Lie: Making Kubernetes Talk Across Nodes
We need to have a serious conversation about the state of container orchestration. Everyone is rushing to deploy Kubernetes 1.1 since it dropped last month, convinced it's the silver bullet for scaling. It isn't. If you are migrating from a traditional Chef/Puppet setup to this brave new world of pods and services, you have likely hit the wall that breaks everyone first: Networking.
In the standard Docker model, port mapping was annoying but predictable. In Kubernetes, the assumption is a flat address space where every pod can talk to every other pod without NAT. It sounds elegant on paper (or in the Google whitepaper), but when you are actually configuring this on bare metal or VPS instances across a cluster, it gets messy fast.
I spent the last week debugging a packet loss issue on a three-node cluster hosting a client's Magento backend. The culprit wasn't the app code; it was the overlay network choking on weak I/O and CPU steal from a subpar hosting environment. Here is how it actually works, and how to fix it.
The Architecture of the Overlay
Unless you are running on GCE (Google Compute Engine), you are likely using an overlay network. Flannel (by CoreOS) is currently the de-facto standard for this. It creates a virtual network that sits on top of your physical network.
Flannel uses etcd to store the mapping between virtual IP subnets and physical host IP addresses. When a pod on Node A wants to talk to a pod on Node B, Flannel wraps the packet (encapsulation), sends it over the wire, and unwrap it on the other side.
Here is what the configuration usually looks like inside etcd (v2 API). You can check this if your pods aren't pinging each other:
# Check the network config stored in etcd
$ etcdctl get /coreos.com/network/config
{ "Network": "10.1.0.0/16", "Backend": { "Type": "vxlan" } }
# List the subnets lease to each minion (node)
$ etcdctl ls /coreos.com/network/subnets
/coreos.com/network/subnets/10.1.15.0-24
/coreos.com/network/subnets/10.1.42.0-24
Pro Tip: Always check if your backend type isvxlanorudp. Theudpbackend is essentially for debugging and offers terrible performance because it copies packets between user space and kernel space. Ensure you are usingvxlan(requires Linux kernel 3.7+, which comes standard on CoolVDS Ubuntu 14.04 images) for kernel-level encapsulation.
The Kube-Proxy: Userspace vs. Iptables
A massive change in Kubernetes 1.1 is the introduction of the iptables proxy mode. In v1.0, kube-proxy ran in "userspace" mode. It was a literal proxy process that accepted connections and proxied them to the backend pods. It was stable but slow and added latency.
Now, we have the iptables mode (currently beta but superior). It programs the Linux kernel's netfilter rules to redirect traffic efficiently. This removes the context switch overhead.
| Feature | Userspace Mode (Old) | Iptables Mode (New in 1.1) |
|---|---|---|
| Mechanism | User-space process acts as proxy | Kernel-space NAT rules |
| Performance | High latency, lower throughput | Near-native speed |
| Reliability | Retries connection on failure | Connection fails if pod acts up (needs readiness probes) |
To verify which mode you are running, check the logs on your minions:
$ journalctl -u kube-proxy | grep "proxy mode"
Dec 12 09:15:32 node-1 kube-proxy[1234]: I1212 09:15:32.44 server.go:200] Using iptables Proxier.
If you see a massive list of rules when you run iptables -t nat -L, don't panic. That is Kubernetes doing its job. It creates a chain for every service.
Chain KUBE-SERVICES (2 references)
target prot opt source destination
KUBE-SVC-67RL... tcp -- anywhere 10.0.0.143 /* default/kubernetes:https cluster IP */ tcp dpt:443
KUBE-SVC-TCOU... udp -- anywhere 10.0.0.10 /* kube-system/kube-dns:dns cluster IP */ udp dpt:53
The Hardware Reality: Why Your VPS Matters
This is where things get ugly. Overlay networks like VXLAN add overhead. Every packet is encapsulated, adding bytes to the header and requiring CPU cycles to process the encapsulation/decapsulation. If you are running this on a cheap, oversold VPS where the provider steals CPU cycles (high %st in top), your network throughput will tank.
I recently benchmarked a CoolVDS KVM instance against a generic budget provider. I ran iperf between two containers across two different hosts using Flannel VXLAN.
- Budget Provider: 450 Mbps, 15% CPU steal.
- CoolVDS (NVMe + Dedicated resources): 920 Mbps, 0.0% CPU steal.
When your database is trying to replicate across nodes, that difference is the difference between a successful write and a corrupted state. We specifically tune our KVM, networking stack for low-latency packet processing, which is critical when you are adding the complexity of Kubernetes overlays.
Local Compliance: The Data Must Stay Here
With the recent invalidation of Safe Harbor by the European Court of Justice (Schrems I ruling back in October), relying on US-based cloud storage for your persistent volumes is a legal minefield. Norwegian companies are scrambling.
If you are setting up `PersistentVolumes` in Kubernetes, ensure your backing storage is physically located in Norway or at least the EEA. At CoolVDS, our Oslo datacenter is directly connected to NIX (Norwegian Internet Exchange). This keeps your latency to Norwegian users in the single digits (milliseconds matter!) and keeps Datatilsynet off your back.
Configuration for Performance
Before you deploy your next cluster, apply these sysctl settings to your host nodes to handle the increased connection tracking load caused by the K8s iptables implementation:
# /etc/sysctl.conf optimizations for K8s nodes
net.ipv4.ip_forward = 1
net.bridge.bridge-nf-call-iptables = 1
# Increase connection tracking table for high-traffic services
net.netfilter.nf_conntrack_max = 131072
# Reduce TIME_WAIT to keep sockets available
net.ipv4.tcp_fin_timeout = 30
Apply them with sysctl -p. If you forget bridge-nf-call-iptables, your services will likely fail to route traffic correctly because packets traversing the bridge won't be processed by netfilter.
Kubernetes is the future, but it demands respect for the underlying infrastructure. Don't layer complex networking on top of weak hardware.
Ready to build a cluster that doesn't timeout? Deploy a CoolVDS instance in Oslo today and get root access in under 55 seconds.