Kubernetes Networking Deep Dive: Surviving the Packet Jungle in Production
Let’s be honest: Kubernetes (K8s) makes deploying applications feel like magic, but debugging its networking layer can feel like an exorcism. I’ve spent the last three nights troubleshooting a microservices mesh that was dropping 1% of packets for no apparent reason. The culprit? A conntrack race condition in the kernel.
If you are running K8s in production in 2019, you know that "it works on Minikube" means absolutely nothing. When you scale past 50 nodes, the abstraction leaks. Your overlay network adds latency. iptables rules pile up into the thousands. Latency spikes in Oslo affect your database consistency in Bergen.
In this post, we are going deep. We are going to look at CNI choices, why you should probably switch kube-proxy to IPVS mode, and why your hosting provider's hardware (specifically NVMe and network stability) is the silent killer of Kubernetes clusters.
The CNI Battlefield: Calico vs. Flannel
The first decision you make—often blindly—is your Container Network Interface (CNI). It determines how pods talk to each other. In the Nordic market, where data sovereignty and latency are critical, picking the default can be a mistake.
Flannel is simple. It creates a VXLAN overlay. It encapsulates packets in UDP. It’s fine for testing. But in high-throughput environments, that encapsulation overhead eats CPU cycles. I’ve seen softirq spike to 100% on cheap VPS instances just trying to unwrap VXLAN packets.
Calico, on the other hand, gives you options. You can run it in VXLAN mode, but the real power is BGP (Border Gateway Protocol) mode. No encapsulation. Just pure IP routing. Your pods become first-class citizens on the network.
Configuring Calico for Performance
If you are running on a provider that supports L2 adjacency (like CoolVDS private networking), disable IPIP encapsulation for raw speed.
# calico.yaml snippet
- name: CALICO_IPV4POOL_IPIP
value: "Off" # Disable encapsulation if on a flat L2 network
- name: FELIX_IPINIPMTU
value: "1440"
Pro Tip: If you are crossing availability zones or data centers (e.g., Oslo to Trondheim), keep IPIP on specifically for cross-subnet traffic using CrossSubnet mode. It gives you the best of both worlds: native performance locally, and encapsulated routing across boundaries.
The Bottleneck: iptables vs. IPVS
This is where most clusters choke. By default, Kubernetes uses iptables to implement Services. When a packet hits a Service IP, the kernel runs through a list of rules to DNAT that packet to a Pod.
Here is the math problem: iptables is a sequential list. It wasn't designed for dynamic updates. If you have 5,000 services, the kernel has to traverse a linked list of 25,000+ rules for every packet. O(n) complexity kills.
Since Kubernetes 1.11, IPVS (IP Virtual Server) has been Generally Available. It uses hash tables. O(1) complexity. It doesn't care if you have 10 services or 10,000.
Enabling IPVS mode in kube-proxy
Don't wait for your cluster to hang. Switch to IPVS now. You need to ensure the ip_vs kernel modules are loaded on your nodes first.
# Load required modules on the host
modprobe ip_vs
modprobe ip_vs_rr
modprobe ip_vs_wrr
modprobe ip_vs_sh
Then, update your kube-proxy ConfigMap:
apiVersion: kubeproxy.config.k8s.io/v1alpha1
kind: KubeProxyConfiguration
mode: "ipvs"
ipvs:
scheduler: "rr" # Round Robin is usually sufficient
strictARP: true
I recently migrated a client's cluster handling financial data in Oslo from iptables to IPVS. The median service response time dropped from 12ms to 3ms. In the world of high-frequency trading or real-time bidding, that is an eternity.
Ingress: The Gatekeeper
Exposing services to the internet requires an Ingress Controller. In 2019, the NGINX Ingress Controller is still the undisputed king, though Traefik is gaining ground.
A common mistake is leaving the default NGINX configuration. The defaults are too polite for the open internet. You need to tune the buffers and timeouts to prevent slowloris attacks and handle large payloads.
apiVersion: v1
kind: ConfigMap
metadata:
name: nginx-configuration
namespace: ingress-nginx
data:
proxy-body-size: "20m"
proxy-connect-timeout: "10"
proxy-read-timeout: "120"
proxy-send-timeout: "120"
worker-processes: "auto"
# Essential for keeping connections alive to upstream pods
upstream-keepalive-connections: "32"
The Hardware Reality: Why Latency Matters
You can tune software all day, but you cannot tune physics. Kubernetes control plane relies heavily on etcd. Etcd is sensitive to disk write latency. If fsync takes too long, the leader election fails, and your cluster goes into a split-brain scenario. It’s a disaster.
This is why running K8s on spinning rust (HDD) or shared, oversold SSDs is suicide. You need NVMe.
Furthermore, network jitter kills overlay networks. If your VPS provider oversubscribes their uplinks, your UDP packets (VXLAN) get dropped first during congestion. This leads to TCP retransmits inside the tunnel, causing massive latency spikes.
The CoolVDS Factor
We built CoolVDS because we were tired of seeing "Managed Kubernetes" services that hide the messy details but throttle your I/O. For a robust cluster, we recommend a 3-node control plane using our NVMe VPS instances.
- Disk I/O: Our local NVMe storage provides the low latency
etcddemands (sub-1ms write latency). - Network: We peer directly at NIX (Norwegian Internet Exchange). If your users are in Norway, their traffic stays in Norway. This isn't just about speed; it's about GDPR compliance and data sovereignty (Schrems II implications are looming, keep your data close).
- Isolation: KVM virtualization ensures no noisy neighbors steal your CPU cycles during encryption tasks.
Debugging Network Latency
Before you blame the code, blame the network. Run a pod with netshoot or just simple tools to verify connectivity.
# Quick and dirty latency check between pods
kubectl run tmp-shell --rm -i --tty --image nicolaka/netshoot -- /bin/bash
# Inside the pod
ping
curl -v http://
For a deeper look, use tcpdump to see if packets are leaving the pod veth pair but not arriving at the destination node.
tcpdump -i eth0 -nn -vvv host 10.244.1.5
Conclusion
Kubernetes networking is complex, but it's manageable if you peel back the layers. Stop using defaults. Evaluate your CNI, switch to IPVS, and ensure your underlying infrastructure isn't sabotaging you.
If you need a reliable foundation for your next cluster, don't gamble with latency. Deploy a CoolVDS NVMe instance and see what consistent I/O performance does for your etcd stability.