Kubernetes Networking Deep Dive: IPVS, CNI, and Surviving MTU Hell
Let’s be honest for a second. Everyone loves drawing boxes on a whiteboard and drawing lines between "Service A" and "Service B." It looks clean. It looks architectural. But when you actually deploy Kubernetes (k8s) v1.16 into production, those lines aren't just abstractions—they are packets traversing a hostile landscape of encapsulation, NAT tables, and context switches.
I have spent the last six months migrating a high-traffic fintech platform from a monolithic legacy setup to a microservices architecture on Kubernetes. The application logic was fine. The database schemas were solid. But the network? The network tried to kill us. If you think kubectl apply -f is the end of your job, you are going to wake up at 3 AM to a PagerDuty alert screaming about 502 Bad Gateways, only to find out it’s a conntrack exhaustion issue that no amount of RAM can fix.
This post is not a "Getting Started" guide. This is a look under the hood at how Kubernetes networking actually works in late 2019, why iptables is becoming a bottleneck, and how to configure your cluster on high-performance infrastructure like CoolVDS to avoid the latency penalties that plague generic cloud providers.
The Lie of the "Flat Network"
Kubernetes mandates a flat network structure: every pod must be able to talk to every other pod without NAT. This is a beautiful lie. To achieve this on top of existing infrastructure (especially if you aren't running raw BGP with top-of-rack switches), we rely on CNI (Container Network Interface) plugins to build an overlay network.
In 2019, you are likely choosing between Flannel (VXLAN) and Calico (IPIP or BGP). If you pick Flannel, you are encapsulating packets inside UDP packets. This adds CPU overhead for every single byte sent. On a generic VPS with stolen CPU cycles, this encapsulation/decapsulation process (encap/decap) introduces jitter. I've seen database latency spike from 2ms to 15ms just because the neighbor on the host node decided to mine crypto, starving the hypervisor of the cycles needed to process VXLAN headers.
This is why we strictly use KVM-based virtualization at CoolVDS. Containers are great for deployment, but for isolation, you need a hypervisor that respects your CPU time. When you are pushing gigabits of traffic through an overlay network, "CPU Steal" is your enemy.
Moving Beyond iptables: The IPVS Era
For years, Kubernetes used iptables to handle Service routing (kube-proxy). When a request hits a Service IP, iptables rules rewrite the destination to a specific Pod IP. This works fine for 50 services. It works okay for 500.
But recently, in a load test involving 5,000 services, we saw the kernel struggling. iptables is a linear list. To find the rule for a specific packet, the kernel has to traverse the list. O(n) complexity in networking is a death sentence.
The solution in Kubernetes 1.14+ (and stable now in 1.16) is IPVS (IP Virtual Server). IPVS is built on top of netfilter but uses a hash table for lookups, giving you O(1) complexity. It doesn't matter if you have 10 services or 10,000; the routing lookup time remains constant.
Configuring kube-proxy for IPVS
If you are still running iptables mode in 2019, stop. Here is how you force IPVS mode in your cluster configuration. You need to ensure the IPVS kernel modules are loaded on your worker nodes first.
# Load required modules on the host (CoolVDS instances support this out of the box)
modprobe ip_vs
modprobe ip_vs_rr
modprobe ip_vs_wrr
modprobe ip_vs_sh
modprobe nf_conntrack_ipv4
# Check if they are loaded
lsmod | grep -e ip_vs -e nf_conntrack_ipv4
Once the host is ready, you update your kube-proxy ConfigMap. If you are using kubeadm, you can edit the config directly:
apiVersion: kubeproxy.config.k8s.io/v1alpha1
kind: KubeProxyConfiguration
mode: "ipvs"
ipvs:
excludeCIDRs: null
minSyncPeriod: 0s
scheduler: "rr" # Round Robin is usually sufficient, but 'lc' (Least Connection) is better for long-lived connections
strictARP: false
syncPeriod: 30s
After a restart of the kube-proxy daemonset, you can verify it on the node using ipvsadm. If you don't have this tool, install it. It is the only way to see the actual routing table effectively.
# ipvsadm -Ln
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
-> RemoteAddress:Port Forward Weight ActiveConn InActConn
TCP 10.96.0.1:443 rr
-> 192.168.1.10:6443 Masq 1 2 0
TCP 10.96.0.10:53 rr
-> 10.244.1.2:53 Masq 1 0 0
-> 10.244.2.3:53 Masq 1 0 0
If you see this output, congratulations. You have just future-proofed your cluster's scalability.
The MTU Trap: Why Packets Get Dropped
Here is a war story from last month. We deployed a cluster for a client in Oslo. Internal pod-to-pod communication was fast. But when a pod tried to connect to an external legacy API via a VPN tunnel, connections would hang or time out. curl would work for small JSON payloads but fail for large ones.
This screams MTU (Maximum Transmission Unit) mismatch.
The standard Ethernet MTU is 1500 bytes. However, your overlay network (like Flannel/VXLAN) needs space for its own headers (usually 50 bytes). So the Pod interface should have an MTU of 1450. If your Pod tries to send a 1500-byte packet, it gets fragmented or dropped if the DF (Don't Fragment) bit is set.
Pro Tip: Never assume the default CNI config detects the host MTU correctly. If you are running on a specialized network or a VPS with QinQ (802.1ad), you must manually clamp the MSS.
To fix this in Calico, you need to edit the Felix configuration. We recommend being explicit rather than relying on auto-detection, especially in hybrid cloud environments.
kubectl patch felixconfiguration default --type='merge' -p '{"spec":{"ipipEnabled":true, "mtuIfacePattern": "eth0", "vethMTU": 1440}}'
Setting the MTU slightly lower (e.g., 1440) provides a safety buffer for additional encapsulation headers that might occur upstream.
Ingress Performance: NGINX tuning
The standard NGINX Ingress Controller is the workhorse of Kubernetes. But out of the box, it is configured for compatibility, not speed. In a recent deployment handling high-frequency trading data, we noticed NGINX was choking on SSL handshakes.
You must tune the nginx-configuration ConfigMap. Specifically, you need to optimize the worker processes and keepalive connections. On CoolVDS NVMe instances, where I/O is not a bottleneck, you can push these numbers higher than on standard SATA VPS hosting.
data:
worker-processes: "auto"
max-worker-connections: "65536"
keep-alive: "60"
upstream-keepalive-connections: "100"
ssl-ciphers: "ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384"
ssl-protocols: "TLSv1.2 TLSv1.3" # Yes, enable 1.3 if your clients support it in 2019!
Furthermore, if you are serving traffic to Norway and Northern Europe, latency is your primary metric. Terminating TLS at the ingress controller requires CPU. If your "vCPU" is a thread on an overcommitted host, your handshake times will fluctuate. We ensure dedicated resource availability, meaning your NGINX process gets the cycles it needs immediately.
Storage IOPS and etcd Latency
You cannot talk about Kubernetes networking without talking about etcd. Every networking change, every pod IP assignment, is written to etcd. If etcd is slow, your network updates are slow.
etcd is extremely sensitive to disk write latency (fsync). The official recommendation is 10ms p99 latency. If you are running your control plane on cheap storage, etcd heartbeat timeouts will occur, causing the scheduler to think nodes are dead, triggering a storm of pod evictions and network reprogramming.
This is where hardware choice becomes an architectural decision. Using local NVMe storage (which we standardize on at CoolVDS) essentially eliminates etcd IO wait times. In our benchmarks, a 3-node etcd cluster on NVMe sustains 5x the write throughput compared to standard SSDs over a network block device.
Why Location Matters: The Norwegian Context
Finally, physics still applies. If your users are in Oslo, Bergen, or Trondheim, hosting your Kubernetes cluster in Frankfurt or London adds 20-30ms of round-trip time (RTT). For a single request, that's negligible. for a microservices architecture where one frontend request triggers 10 backend calls, that latency compounds.
By keeping the cluster infrastructure in Norway, you slash that latency. Additionally, with the Datatilsynet (Norwegian Data Protection Authority) strictly enforcing GDPR, keeping data resident within Norwegian borders simplifies your compliance posture significantly.
Conclusion
Kubernetes networking is brittle if you treat it as a black box. By switching to IPVS, manually tuning your MTU/MSS settings, and ensuring your underlying hardware (CPU and Disk) can keep up with the encapsulation overhead, you can build a platform that is actually stable.
Do not let your infrastructure be the bottleneck. If you need a sandbox to test ipvsadm configurations or verify MTU settings without the noise of shared hosting, spin up a high-performance instance today.
Ready to optimize your packet path? Deploy a CoolVDS NVMe instance in Oslo and see the difference raw performance makes.