Kubernetes Networking in Production: Surviving the iptables Maze
Let's be honest. The first time you deployed a pod and it actually pinged another pod on a different node, you felt like a wizard. Then you looked at the routing table. Then you looked at iptables. The magic faded, and the panic set in.
With the release of Kubernetes 1.4 yesterday, the ecosystem is maturing fast. kubeadm is finally making cluster bootstrapping less painful, but networking remains the dark art of container orchestration. I've spent the last week debugging a latency issue that turned out to be a conflict between Docker 1.12's bridge mode and a CNI plugin. It wasn't pretty.
If you are running Kubernetes in production—or planning to move your legacy monoliths over—you need to understand what happens below the API server. In this post, we are ripping out the abstraction layer. We are going to look at packets.
The "Flat Network" Lie
Kubernetes mandates a flat network structure. Every pod gets an IP. Every pod can talk to every other pod without NAT. Simple, right? In theory, yes. In practice, achieving this on top of a standard provider's network requires an Overlay Network (like VXLAN or UDP encapsulation) or complex BGP routing.
This is where performance goes to die.
Pro Tip: If you are running on a provider that blocks standard L2 traffic between nodes (common in cheap VPS hosting), you are forced into using encapsulation. This adds CPU overhead for every packet processed. On CoolVDS, we provide full KVM isolation and allow configurable L2 access on private networks, meaning you can run routing protocols like BGP if you're brave enough.
Kube-Proxy: Userspace is Dead, Long Live Iptables
Until recently, kube-proxy defaulted to "userspace" mode. It was stable but slow. It acted as a literal proxy, shuffling bytes back and forth between kernel and user space. As of recent versions, the default is now iptables mode. This is a massive performance win, but it makes debugging a nightmare.
In iptables mode, kube-proxy watches the API server and programs Linux netfilter rules to redirect traffic. There is no actual proxy process touching the packets in the data path. It is pure kernel routing.
Here is what your NAT table looks like on a healthy node serving traffic to a Service VIP:
$ sudo iptables -t nat -L KUBE-SERVICES -n
Chain KUBE-SERVICES (2 references)
target prot opt source destination
KUBE-SVC-X7Q7P7... tcp -- 0.0.0.0/0 10.0.0.144 /* default/my-nginx:http cluster IP */ tcp dpt:80
KUBE-NODEPORTS all -- 0.0.0.0/0 0.0.0.0/0 /* kubernetes service nodeports */
If you see thousands of rules here, you might start hitting race conditions during updates. We have seen clusters with 5,000+ services take 2-3 minutes to propagate rule updates. For high-frequency trading or real-time bidding platforms in Oslo, that latency is unacceptable.
CNI Wars: Flannel vs. Calico vs. Weave
The Container Network Interface (CNI) allows you to swap out networking providers. In 2016, you have three main contenders. Your choice dictates your latency and debugging capability.
| CNI Plugin | Mechanism | Pros | Cons |
|---|---|---|---|
| Flannel (VXLAN) | Encapsulation | Easy setup. Works everywhere. | Overhead. Difficult to debug packet headers. |
| Calico | BGP (Layer 3) | Native speeds. Security policies. | Complex. Requires underlying network support. |
| Weave Net | Mesh / Encap | Encryption out-of-the-box. | Can be slower than raw routing. |
We recently migrated a client from Flannel to Calico. They were hosting a high-traffic media site targeting the Nordic market. Their "Time to First Byte" (TTFB) dropped by 15ms simply by removing the VXLAN encapsulation overhead.
Here is a snippet of a standard Flannel config. If you are using this on a network that supports Layer 2 routing, switch the backend to host-gw immediately for a free performance boost:
{
"Network": "10.244.0.0/16",
"Backend": {
"Type": "vxlan" <-- CHANGE THIS TO "host-gw" IF POSSIBLE
}
}
The Hardware Reality: Why Virtualization Matters
You can optimize your iptables and tune your CNI all day, but if your underlying hypervisor is stealing CPU cycles, it doesn't matter. Kubernetes is noisy. It generates a lot of system calls.
Many "Cloud" providers use OpenVZ or heavily overcommitted Xen setups. When a neighbor on the host node gets hit with a DDoS, your network stack suffers because the host kernel is locking up. This is the "Steal Time" metric in top.
This is why we built CoolVDS on KVM with NVMe storage. KVM gives us strict hardware isolation. The Linux kernel inside your VM is your kernel. We don't mess with it. When you define network buffers, they are yours.
Tuning for Norwegian Latency
If your user base is in Norway, physics is your friend, but configuration is your enemy. Connecting to NIX (Norwegian Internet Exchange) in Oslo requires clean routing.
We see developers leave the MTU at the default 1500, even when running inside a VXLAN tunnel (which adds headers). This causes packet fragmentation. The result? Retransmits and jitter.
The Fix: Calculate your overhead. If you are using VXLAN, drop your interface MTU inside the container to 1450.
# Inside your Docker daemon config or CNI config
"mtu": 1450
Data Sovereignty and The "Post-Safe Harbor" World
It has been a year since the Safe Harbor agreement was invalidated, and while the "Privacy Shield" was adopted this July, skepticism remains high. The Datatilsynet (Norwegian Data Protection Authority) is watching closely.
For Norwegian businesses, the safest technical architecture is keeping data on Norwegian soil. Routing traffic through external load balancers in Frankfurt or London adds latency and legal ambiguity. By running your Kubernetes Ingress controllers directly on CoolVDS instances in Oslo, you ensure the TLS termination happens within the jurisdiction. Low latency. High compliance.
Summary
- Check your mode: Ensure kube-proxy is running in
iptablesmode (v1.2+). - Choose CNI wisely: Use
host-gwbackends if your infrastructure allows it. Avoid encapsulation if you crave raw speed. - Watch the MTU: Fragmentation kills TCP performance.
- Own the infrastructure: Shared kernels (containers-on-containers) are bad for reliable networking. Use KVM.
Kubernetes is the future, but it requires old-school Linux networking knowledge to run efficiently. Don't let default settings throttle your application.
Ready to test your cluster's true performance? Spin up a CoolVDS KVM instance with NVMe in Oslo. You bring the config; we bring the raw power.