Demystifying Kubernetes 1.1 Networking: Flannel, Iptables, and Metal
If you have spent the last few months migrating from Chef/Puppet scripts to Kubernetes 1.1, you have likely hit the wall. Not the container wall—the networking wall. The promise of Kubernetes is elegant: a flat network space where every Pod can talk to every other Pod without Network Address Translation (NAT). It sounds perfect on paper.
In practice, achieving this on standard infrastructure is a battlefield. I have seen production clusters stall because of misconfigured MTU settings in overlay networks, and I have watched kube-proxy consume 40% of a CPU in userspace mode just to shuffle packets. If you are building a cluster in 2016, you cannot treat the network as an abstraction. You need to understand the pipes.
The "Flat Network" Lie (And How to Make It True)
Kubernetes imposes a specific networking model: IP-per-Pod. Unlike the standard Docker model, where containers get an IP on a private bridge and map ports to the host, Kubernetes demands that Pod IPs are routable across the cluster.
On a bare-metal switch where you control BGP, this is manageable. But on virtualized infrastructure—like a standard VPS—you cannot just assign arbitrary subnets to your VMs and expect the datacenter router to know where to send the traffic. This is where Overlay Networks come in, and where performance usually goes to die.
The Flannel Approach
Most of us are using CoreOS's Flannel right now. It is the pragmatic choice. It creates an overlay network (usually VXLAN) that encapsulates Layer 3 packets inside UDP packets to transport them between nodes.
Here is the reality of that encapsulation: Overhead. Every packet has headers added and removed. If your underlying VPS has "noisy neighbors" or high CPU steal, that encapsulation process introduces latency. In a microservices architecture where a user request hits five different backend services, that latency compounds.
Here is a typical Flannel configuration stored in etcd that I use for deployments requiring a 10.244.0.0/16 network:
{
"Network": "10.244.0.0/16",
"SubnetLen": 24,
"Backend": {
"Type": "vxlan",
"VNI": 1
}
}
If you are seeing dropped packets, check your MTU. The physical interface usually has an MTU of 1500. VXLAN adds 50 bytes of overhead. If your inner Flannel interface (flannel.1) is also set to 1500, packets will fragment or drop. You must ensure the inner MTU is 1450.
Pro Tip: Do not rely on default MTU detection in hybrid clouds. Force the MTU in your Flannel options if your hosting provider uses Jumbo Frames or QinQ encapsulation.
The Shift to Iptables Proxy
Prior to Kubernetes 1.1, the kube-proxy component ran in "userspace" mode. It was a literal proxy: traffic came into a port, went up to the userspace process, and was copied to the backend Pod socket. It was stable, but slow and resource-heavy.
With Kubernetes 1.1 (and the upcoming 1.2), the iptables mode is becoming the standard. In this mode, kube-proxy simply manages Linux iptables rules. Traffic is handled entirely by the kernel's netfilter subsystem. It is O(1) mostly, and significantly faster.
However, debugging it is harder. Instead of reading a log file, you are grepping netfilter rules. Here is what a service VIP routing looks like in the NAT table on a healthy node:
$ sudo iptables -t nat -L KUBE-SERVICES -n | head
Chain KUBE-SERVICES (2 references)
target prot opt source destination
KUBE-SVC-X7Q3V4X5X5X5X5X5 tcp -- 0.0.0.0/0 10.0.0.144 /* default/my-nginx-service:http cluster IP */ tcp dpt:80
KUBE-SVC-Y8R4W5Y6Y6Y6Y6Y6 tcp -- 0.0.0.0/0 10.0.0.215 /* kube-system/kube-ui: cluster IP */ tcp dpt:80
If you see the rules but traffic times out, check if IP forwarding is enabled in your kernel. This catches me every time I provision a new node:
$ sysctl net.ipv4.ip_forward
net.ipv4.ip_forward = 0 # This breaks Kubernetes
# Fix it permanently
$ echo "net.ipv4.ip_forward=1" >> /etc/sysctl.conf
$ sysctl -p
Why Underlying Hardware Dictates Stability
This is where the "Pragmatic CTO" mindset kicks in. You can have the best iptables rules and the cleanest Flannel config, but if your etcd cluster is slow, your Kubernetes cluster will fall apart.
etcd uses the Raft consensus algorithm. It is extremely sensitive to disk write latency. If a follower takes too long to write a transaction to disk (fsync), it triggers a leader election. Leader elections pause the cluster. Paused clusters drop API requests.
We recently migrated a client's cluster from a generic budget VPS provider to CoolVDS specifically for this reason. The budget provider used spinning rust (HDD) or shared SATA SSDs where I/O wait times spiked unpredictably. CoolVDS offers NVMe storage, which in 2016 is still a luxury in many datacenters, but essential for distributed databases.
Benchmarking Etcd Storage
Before you add a node to your cluster, run fio to ensure the disk can handle the write load. Here is the benchmark we run on new CoolVDS instances:
fio --rw=write --ioengine=sync --fdatasync=1 --directory=test-data --size=22m --bs=2300 --name=mytest
If your 99th percentile fsync latency is above 10ms, do not run etcd there. On our CoolVDS NVMe instances, we consistently see sub-1ms latencies, which keeps the Raft heartbeats stable.
Data Sovereignty and The "Safe Harbor" Fallout
We are operating in a post-Safe Harbor world (invalidated Oct 2015). For Norwegian companies, sending customer data to US-controlled clouds is legally risky right now. The Datatilsynet (Norwegian Data Protection Authority) is clear about the responsibilities of data controllers.
Running Kubernetes on a sovereign Norwegian provider like CoolVDS isn't just about latency to the NIX (Norwegian Internet Exchange)—though seeing 2ms pings from Oslo is nice. It is about compliance. You know exactly where the physical drive sits.
Summary: Build on Rock, Not Sand
Kubernetes 1.1 is powerful, but it assumes a robust network and fast storage. Don't layer complex overlay networks on top of unstable, oversold virtual machines.
- Use KVM: Ensure your VPS provider uses KVM (like CoolVDS). You need your own kernel for overlay modules. Container-based VPS (OpenVZ) often lacks the necessary kernel flags for Docker/Flannel networking.
- Check MTU: Account for VXLAN overhead.
- Monitor I/O: Use NVMe for your etcd nodes.
If you are tired of debugging random network timeouts and fighting with noisy neighbors, it is time to upgrade the foundation. Deploy a KVM-based, NVMe-powered instance on CoolVDS today and give your packets the highway they deserve.