Kubernetes Networking Deep Dive: Surviving the Overlay Chaos in Production
Let's be honest: Kubernetes networking is where the abstraction leaks like a sieve. You deploy a Service, everything looks green in the dashboard, but curl times out and you're suddenly staring at three thousand lines of iptables rules wondering where your packet went to die.
I’ve spent the last month debugging a microservices cluster for a fintech client in Oslo. The symptoms? Intermittent 502 errors and latency spikes that didn't show up in application APM tools. The culprit wasn't code; it was the network overlay choking on packet encapsulation because the underlying virtual machines were fighting for CPU cycles.
In this deep dive, we aren't looking at "Hello World." We are looking at how packets actually move in K8s v1.15+, why your CNI plugin choice matters more than you think, and why running this on subpar infrastructure is a death sentence for performance.
The Flat Network Lie
Kubernetes promises a flat network structure: every pod gets an IP, and every pod can talk to every other pod without NAT. It sounds elegant. Under the hood, it is a complex beast of routing tables, bridges, and encapsulation.
When you run kubectl get pods -o wide, you see IPs like 10.244.1.5. Those IPs don't exist on your physical router. They exist inside the virtual network space created by your CNI (Container Network Interface) plugin.
The CNI War: Flannel vs. Calico
In 2019, if you aren't using a managed cloud CNI, you are likely deciding between Flannel and Calico. This isn't just a preference; it's an architectural decision.
- Flannel (VXLAN): The default for many. It encapsulates Layer 2 frames inside UDP packets (Layer 4). It is simple but adds overhead. Every packet is wrapped and unwrapped. If your VPS has weak CPU performance or "noisy neighbors" stealing cycles, this encapsulation/decapsulation process (encap/decap) introduces jitter.
- Calico (BGP): Uses the Border Gateway Protocol to distribute routing information. No encapsulation (in pure Layer 3 mode). It is faster, but requires an underlying network that allows BGP peering or at least doesn't block the traffic.
Pro Tip: If you are running on CoolVDS, I recommend testing Calico with IPIP encapsulation disabled if your architecture permits, or sticking to high-performance VXLAN backends. Our KVM instances provide the raw CPU instructions needed to handle encapsulation without the latency spikes seen on OpenVZ containers.
Service Discovery: IPVS is the New King
Until recently, kube-proxy used iptables to handle Service routing. When traffic hit a Service IP, the kernel ran through a list of rules to forward packets to a Pod. This works for 50 services. It fails hard at 5,000 services.
As of Kubernetes 1.11, IPVS (IP Virtual Server) became generally available, and in late 2019, you should be using it. IPVS uses hash tables instead of linear lists. The lookup time is constant, O(1), regardless of cluster size.
To enable IPVS mode in kube-proxy, you need to edit the config map. Here is how we enforce it on our clusters:
apiVersion: kubeproxy.config.k8s.io/v1alpha1
kind: KubeProxyConfiguration
mode: "ipvs"
ipvs:
excludeCIDRs: null
minSyncPeriod: 0s
scheduler: "rr" # Round Robin is usually fine, try 'lc' (Least Connection) for long-lived sockets
strictARP: false
syncPeriod: 30s
Before applying this, ensure your nodes have the kernel modules loaded:
# Check for IPVS modules
lsmod | grep ip_vs
# If missing, load them (persists depending on your distro)
modprobe ip_vs
modprobe ip_vs_rr
modprobe ip_vs_wrr
modprobe ip_vs_sh
Ingress: The Gatekeeper
Exposing services via NodePort or LoadBalancer is fine for testing, but in production, you need an Ingress Controller. NGINX remains the battle-standard here. It terminates SSL and routes traffic based on Host headers.
However, a common mistake is neglecting the keep-alive settings and buffer sizes, leading to dropped connections under load. Here is a production-ready snippet for nginx-configuration ConfigMap ensuring high throughput for a Norwegian e-commerce site we host:
kind: ConfigMap
apiVersion: v1
metadata:
name: nginx-configuration
namespace: ingress-nginx
labels:
app.kubernetes.io/name: ingress-nginx
data:
keep-alive: "75"
keep-alive-requests: "1000"
upstream-keepalive-connections: "64"
worker-processes: "auto"
# Crucial for maximizing I/O on CoolVDS NVMe instances
log-format-upstream: '$remote_addr - $remote_user [$time_local] "$request" $status $body_bytes_sent "$http_referer" "$http_user_agent" $request_length $request_time [$proxy_upstream_name] $upstream_addr $upstream_response_length $upstream_response_time $upstream_status $req_id'
Troubleshooting: When It Breaks
When a pod can't reach another pod, don't guess. Use nsenter to step into the pod's network namespace directly from the node. This bypasses the container runtime abstraction and lets you use the node's tools.
- Find the Process ID (PID) of the container:
docker inspect --format '{{.State.Pid}}' <container-id>
- Enter the namespace:
nsenter -t <PID> -n ip addr show
If you see the interface but no traffic flows, check the MTU (Maximum Transmission Unit). A common issue with overlay networks is that the VXLAN header adds 50 bytes. If your physical interface is 1500 MTU and your CNI tries to push 1500 MTU packets inside the tunnel, they get dropped or fragmented. Set your CNI MTU to 1450 to be safe.
The Physical Layer: Why "Where" Matters
You can optimize iptables and tune NGINX all day, but you cannot configure your way out of physics. Network latency is a killer for microservices. If Service A calls Service B, which calls Service C, a 20ms latency between nodes compounds rapidly.
This is particularly relevant for Norwegian businesses targeting local customers. Routing traffic through Frankfurt or Amsterdam adds unnecessary milliseconds. You need data residency within Norway, not just for GDPR compliance, but for the speed of light.
Latency and Etcd
Kubernetes stores its state in etcd. Etcd uses the Raft consensus algorithm, which is extremely sensitive to disk write latency (fsync). If your disk is slow, the leader election times out, and your cluster goes down. I have seen entire clusters fail because they were running on standard SATA SSDs or, heaven forbid, spinning rust.
This is why we standardized on NVMe storage at CoolVDS. The I/O wait is negligible.
| Feature | Standard VPS | CoolVDS NVMe |
|---|---|---|
| Storage Latency | 2-10ms | <0.5ms |
| Network Drivers | VirtIO (often unoptimized) | VirtIO-Net (Tuned) |
| Virtualization | Container/OpenVZ | KVM (Kernel Isolation) |
Conclusion: Own Your Traffic
Kubernetes networking in 2019 is powerful, but it assumes you have the underlying hardware to support it. Don't layer complex overlays on top of congested, oversold infrastructure.
Whether you are adhering to strict data privacy regulations or just want your API to respond in under 50ms, the foundation is everything. Stop fighting the "noisy neighbors" on cheap shared hosting.
Ready to build a cluster that actually performs? Deploy a high-performance KVM instance in Oslo with CoolVDS today and see the difference NVMe makes to your etcd convergence times.