Kubernetes 1.5 Networking Deep Dive: Surviving the Overlay Chaos
It is 3:00 AM. Your pager is screaming because the API latency just spiked to 5 seconds, but the CPU load is idle. You check the logs. Nothing. You check the disk I/O. Flat. Then you run a simple ping between pods on different nodes and see it: 20% packet loss.
Welcome to the brutal reality of Kubernetes networking. While Google makes it sound like magic with Borg, for the rest of us running bare metal or VPS clusters in 2017, it is a minefield of encapsulation overhead, MTU mismatches, and `iptables` race conditions.
I have spent the last six months migrating a high-traffic payment gateway from a monolithic LAMP stack to Kubernetes v1.5. I have seen clusters implode not because the code was bad, but because the underlying network fabric couldn't handle the VXLAN chatter. Today, we are going to cut through the marketing fluff and look at what actually happens when a packet leaves a Pod.
The Fundamental Lie: "Flat Network"
Kubernetes imposes a strict requirement: Every Pod must be able to communicate with every other Pod without NAT.
On a local laptop with Minikube, this is easy. On a distributed cluster across multiple VPS instances, this requires an overlay network (unless you have direct control over the datacenter routers for BGP). If you are hosting on standard cloud instances, you are likely wrapping packets inside packets. This is where performance goes to die if you aren't careful.
The CNI Battleground: Flannel vs. Calico
Right now, in early 2017, the Container Network Interface (CNI) landscape is fragmented. We mostly see two contenders in production environments:
- Flannel: The "old reliable." It usually uses VXLAN. It's simple to set up but introduces encapsulation overhead. Every packet includes an extra header, reducing the effective MSS (Maximum Segment Size).
- Calico: The performance choice. It uses Layer 3 routing and BGP. No encapsulation if you are within the same L2 segment.
Pro Tip: If your VPS provider blocks BGP or unknown MAC addresses (which many do to prevent spoofing), Calico in "IP-in-IP" mode is your fallback. But be warned: nested virtualization acts weird here. This is why we deploy on CoolVDS—their KVM instances expose the raw network interface cleanly, allowing Calico to peer correctly without the "noisy neighbor" packet drops common on OpenVZ containers.
The `iptables` Nightmare
In Kubernetes 1.5, `kube-proxy` defaults to `iptables` mode. This is a massive improvement over the old userspace mode, but it has a scaling limit. Every Service you create generates a set of iptables rules. When a packet hits your node, the kernel has to traverse these rules sequentially to find the destination.
Here is what it looks like on one of our production nodes handling just 50 services:
# iptables -t nat -S KUBE-SERVICES
-N KUBE-SERVICES
-A PREROUTING -m comment --comment "kubernetes service portals" -j KUBE-SERVICES
-A OUTPUT -m comment --comment "kubernetes service portals" -j KUBE-SERVICES
-A KUBE-SERVICES -d 10.96.0.1/32 -p tcp -m comment --comment "default/kubernetes:https cluster IP" -m tcp --dport 443 -j KUBE-SVC-NPX46P4PT2N71Z36
-A KUBE-SERVICES -d 10.103.22.14/32 -p tcp -m comment --comment "kube-system/kube-dns:dns-tcp cluster IP" -m tcp --dport 53 -j KUBE-SVC-ERIFXISQEP7F7OF4
... [Truncated 200 lines] ...
If you have 5,000 services, that list becomes massive. Every packet checks every rule until a match is found. We noticed that on budget VPS providers with "shared CPU" slices, the system time (kernel processing) spikes during high network throughput because the CPU is busy traversing these linked lists.
Optimizing the Kernel for Overlays
You cannot just run default sysctl settings. The overlay network relies on Linux bridges and efficient ARP resolution. If you are seeing `Inter-host` latency, check your neighbor table.
Here is the exact `sysctl.conf` block we inject into our CoolVDS nodes via Cloud-Init/Ansible to prevent ARP table thrashing:
net.ipv4.neigh.default.gc_thresh1 = 128
net.ipv4.neigh.default.gc_thresh2 = 512
net.ipv4.neigh.default.gc_thresh3 = 1024
net.ipv4.ip_forward = 1
# Critical for high-connection workloads
net.netfilter.nf_conntrack_max = 131072
net.netfilter.nf_conntrack_tcp_timeout_established = 86400
The MTU Trap
This is the most common reason for "it works for small requests but hangs on large JSON payloads."
Standard Ethernet MTU is 1500 bytes. VXLAN adds a 50-byte header. If your inner container tries to send a 1500-byte packet, it gets encapsulated to 1550 bytes. The physical host interface (eth0) drops it because it exceeds the MTU. The result? TCP retransmissions and hung connections.
You must configure your CNI to use a smaller MTU (e.g., 1450) or—if your provider supports it—enable Jumbo Frames on the host.
{
"name": "flannel-conf",
"type": "flannel",
"delegate": {
"isDefaultGateway": true,
"mtu": 1450
}
}
We specifically benchmarked this on CoolVDS NVMe instances. Because the internal datacenter network in Oslo supports high throughput, we could push the packet rate significantly higher, but only after correcting the MTU. Stability is not an accident; it is configuration.
Local Context: Why Latency Matters in Norway
If you are building services for the Nordic market, physics is your enemy. Routing traffic through Frankfurt or London adds 20-30ms of round-trip time. In a microservices architecture where a single user request might spawn 10 internal RPC calls (Pod-to-Pod), that latency compounds.
Hosting in Norway, close to the NIX (Norwegian Internet Exchange), is mandatory for real-time applications. Furthermore, with the uncertainty surrounding the Privacy Shield framework and the looming GDPR (General Data Protection Regulation) set for 2018, keeping data within Norwegian borders satisfies the Datatilsynet requirements for data sovereignty. It’s not just about speed; it’s about compliance risk mitigation.
A Warning on Etcd Sensitivity
Kubernetes stores all cluster state in `etcd`. Etcd uses the Raft consensus algorithm. If the network latency between your controller nodes varies too much (jitter), Raft triggers a leader election. during an election, your cluster API is read-only or down.
We saw this happen on a competitor's "cheap" VPS. Their disk I/O was shared, causing the etcd write-ahead log (WAL) fsync to stall. The network heartbeat timed out, and the cluster collapsed. Since moving to CoolVDS, where the NVMe storage guarantees consistent IOPS, our etcd cluster hasn't flinched.
Validating Your Cluster Network
Before you deploy production workloads, run this sanity check. Use `iperf` between two pods on different nodes.
# On Server Pod
kubectl exec -it net-test-1 -- iperf -s
# On Client Pod (different node)
kubectl exec -it net-test-2 -- iperf -c 10.244.2.5
If you are getting less than 80% of the host's raw line speed, your overlay configuration is broken, or your CPU is stealing cycles during encapsulation.
Conclusion
Kubernetes networking in 2017 is not "set and forget." It requires a deep understanding of Linux kernel networking, careful MTU calculations, and hardware that doesn't steal CPU cycles when you need them most.
Do not build a skyscraper on a swamp. Start with a solid foundation. If you need low-latency access to the Nordic market and KVM instances that respect your `iptables` rules, spin up a node on CoolVDS.
Next Step: Stop guessing. Run the `iperf` test on your current setup. If the results scare you, deploy a high-performance NVMe instance on CoolVDS and see the difference a clean network path makes.