Console Login

Kubernetes v1.2 Networking Deep Dive: Packet Flow, Iptables, and Why Latency Kills Clusters

Surviving the Kubernetes Networking Maze: A Packet's Journey

Let’s be honest. You didn’t deploy Kubernetes because it was easy. You deployed it because managing 50 monolithic VMs manually was driving your sysadmin team to the brink. But now, instead of managing servers, you are debugging VXLAN headers and wondering why your service discovery timeouts are spiking.

With the release of Kubernetes 1.2 last month, we have seen massive improvements—specifically the shift in kube-proxy defaulting to iptables mode. This is huge for performance, but it makes debugging a nightmare if you don't understand the underlying Linux networking stack.

I’ve spent the last week debugging a split-brain etcd cluster for a client in Stavanger. The culprit wasn't configuration; it was packet loss on their budget hosting provider. Here is the unvarnished truth about how K8s networking actually works, and how to keep it from collapsing.

The Fundamental Rule: Flat Networking

Kubernetes imposes a strict requirement: Every Pod must be able to communicate with every other Pod without NAT.

In the Docker world we came from, we were used to port mapping (-p 8080:80). Forget that. In K8s, the Pod gets its own IP. This creates a massive routing challenge. If Node A (10.10.0.5) has a Pod (10.244.1.2), and it tries to talk to a Pod (10.244.2.2) on Node B, the underlying network must know how to route that packet.

The Overlay Solution: Flannel

Unless you have a sophisticated L3 datacenter setup with BGP, you are likely using an overlay network. Flannel (by CoreOS) is currently the standard for this. It uses etcd to store the network configuration and distributes subnet leases to each host.

When you run flanneld, it creates a virtual interface (usually flannel0). Here is what the config in etcd looks like for a typical setup:

{
  "Network": "10.244.0.0/16",
  "SubnetLen": 24,
  "Backend": {
    "Type": "vxlan",
    "VNI": 1
  }
}

When a packet leaves your Pod, it hits the docker0 bridge (or cbr0 in AWS), gets encapsulated in a VXLAN packet by the kernel, and is shipped over UDP port 8472 to the destination node. The destination kernel decapsulates it and passes it to the target Pod.

Pro Tip: Check your MTU. The default Ethernet MTU is 1500. VXLAN adds a 50-byte header. If you don’t lower the MTU on the flannel.1 interface to 1450, you will see random packet drops on large payloads (like MySQL syncs) while ping works fine. It’s a silent killer.

The Shift: Userspace vs. Iptables Proxy

In Kubernetes v1.0 and v1.1, kube-proxy ran in "userspace" mode. It was a literal proxy process. Packet -> Kernel -> Kube-Proxy (User Space) -> Kernel -> Destination. This context switching was expensive.

As of v1.2, the default is iptables mode. Now, kube-proxy simply watches the API server and writes huge lists of iptables rules. The traffic is handled entirely inside the Linux kernel. It is O(1) mostly, and much faster. But it looks terrifying when you inspect it.

Here is what happens when you create a Service. Let's look at the NAT table on a CoolVDS node running a test cluster:

root@node-1:~# iptables -t nat -L KUBE-SERVICES -n
Chain KUBE-SERVICES (2 references)
target     prot opt source               destination         
KUBE-SVC-X7Q3...  tcp  --  0.0.0.0/0            10.100.23.155        /* default/my-nginx: cluster IP */ tcp dpt:80

And if we follow that chain KUBE-SVC-X7Q3...:

Chain KUBE-SVC-X7Q3... (1 references)
target     prot opt source               destination         
KUBE-SEP-I5...  all  --  0.0.0.0/0            0.0.0.0/0            /* default/my-nginx: */ statistic mode random probability 0.50000000000
KUBE-SEP-O8...  all  --  0.0.0.0/0            0.0.0.0/0            /* default/my-nginx: */

See that statistic mode random? That is the load balancer. It’s pure kernel magic. No HAProxy required for internal traffic.

Why Infrastructure Matters (The