Kubernetes Networking Deep Dive: Escaping the Overlay Tax
I recently spent 48 hours debugging a Kafka cluster on Kubernetes that was dropping packets faster than I drop poorly written pull requests. The culprit wasn't the JVM, and it wasn't the disk. It was the network overlay. When you stack a virtual network (Kubernetes) on top of a virtual machine (your VPS) which sits on a physical network, you are paying a "packet tax" on every single byte.
If you are running Kubernetes in production in 2019, you cannot treat the network as a black box. Understanding the interaction between your CNI (Container Network Interface) and the underlying Linux kernel is what separates a stable cluster from one that times out every time traffic spikes.
The CNI Battlefield: Flannel vs. Calico
When you initialize a cluster with kubeadm init, you have a choice. For years, Flannel was the default choice because it’s simple. It creates a VXLAN overlay—essentially wrapping Layer 2 packets inside Layer 3 UDP packets. It works everywhere, but it eats CPU cycles for breakfast due to encapsulation/decapsulation overhead.
For high-performance workloads, specifically here in the Nordic region where latency to the NIX (Norwegian Internet Exchange) matters, I almost exclusively recommend Calico. Calico can run in pure Layer 3 mode using BGP (Border Gateway Protocol), routing packets without the heavy encapsulation overhead, provided your underlying network supports it.
Here is how we typically deploy Calico 3.7 (the current stable release) on a fresh cluster:
kubectl apply -f https://docs.projectcalico.org/v3.7/manifests/calico.yaml
However, simply applying the YAML isn't enough. You need to verify that the IP pools are configured to match your Pod CIDR. If you are running on a cloud provider or a VPS where you don't control the physical routers, Calico defaults to IPIP (IP-in-IP) mode. This is still an overlay, but often lighter than VXLAN.
Pro Tip: If you are seeing high latency, check the MTU (Maximum Transmission Unit). The default Ethernet MTU is 1500. If you wrap a packet (overlay), the inner packet must be smaller. If your CNI tries to push 1500 bytes through a tunnel that adds headers, you trigger packet fragmentation. That kills performance.
Tuning the MTU for Calico
Inside your calico-config ConfigMap, ensure the MTU accounts for the encapsulation header. For IPIP, we usually set this to 1480.
kind: ConfigMap
apiVersion: v1
metadata:
name: calico-config
namespace: kube-system
data:
veth_mtu: "1480"
Kernel Tuning: The Forgotten Optimization
Kubernetes relies heavily on iptables (or IPVS if you are living on the edge with K8s 1.11+). When you have thousands of Services and Pods, the iptables ruleset grows massive. Every packet has to traverse this list. This is O(n) complexity. It hurts.
Regardless of your CNI, you need to tune the underlying Linux kernel on your nodes. Most default distro settings (Ubuntu 18.04 / CentOS 7) are tuned for desktop use, not high-throughput packet forwarding.
Here is the sysctl.conf baseline I apply to every Worker Node before it joins the cluster:
# /etc/sysctl.d/k8s-net.conf
# Increase the connection tracking table.
# If this fills up, packets get dropped silently.
net.netfilter.nf_conntrack_max = 131072
# Reduce the time we hold onto closed connections
net.netfilter.nf_conntrack_tcp_timeout_time_wait = 30
# Allow IP forwarding (Required for K8s)
net.ipv4.ip_forward = 1
# Maximize the backlog for incoming packets
net.core.netdev_max_backlog = 5000
Apply these with sysctl -p /etc/sysctl.d/k8s-net.conf. If you skip this, your Ingress controller will choke under load, and you will blame the software when it is actually the OS limits.
Debugging in the Trenches
When a service isn't reachable, "it's DNS" is usually the answer. But when it's not DNS, you need to see the traffic. The challenge in 2019 is that many container images are stripped down (Alpine, Distroless) and don't have tools.
I recommend keeping a "netshoot" style pod manifest handy to join the network namespace of a troubled pod.
apiVersion: v1
kind: Pod
metadata:
name: net-debug
namespace: default
spec:
containers:
- name: net-debug
image: nicolaka/netshoot
command: ["/bin/bash"]
stdin: true
tty: true
Once inside, use tcpdump to verify if the SYN packets are actually arriving:
tcpdump -i eth0 port 80 -n -vv
The Hardware Reality: Why CoolVDS Matters
You can tune sysctl all day, but software-defined networking (SDN) is CPU intensive. Every time a packet is encapsulated, routed, and decapsulated, the CPU has to do work. On a shared hosting environment or a cheap VPS, you are often fighting for CPU time with "noisy neighbors."
If your CPU "steal" time (%st in top) is high, your network latency will jitter. This is why for production Kubernetes, we only use CoolVDS instances. They utilize KVM virtualization which provides strict resource isolation. Unlike container-based VPS solutions (like OpenVZ) where the kernel is shared, KVM gives your node its own kernel to tune.
Furthermore, CoolVDS infrastructure is built on NVMe storage. While we are talking about networking, remember that etcd (the K8s brain) is extremely sensitive to disk write latency. If etcd is slow because of cheap spinning disks, the API server lags, and network updates (like new Endpoints) get delayed.
Data Sovereignty and Latency
For those of us operating in Norway, the legal landscape is tightening. With the GDPR now fully enforceable, keeping data within national borders is a significant compliance advantage. Hosting on CoolVDS ensures your data stays in local data centers, subject to Norwegian law and Datatilsynet regulations, not hidden away in a generic "EU-West" zone.
Plus, physics is physics. Pinging a server in Oslo from Oslo takes 2ms. Pinging Frankfurt takes 25ms+. For microservices databases, that round-trip time adds up.
Final Thoughts
Kubernetes networking is complex, but it is manageable if you respect the overhead. Choose the right CNI, tune your kernel connection tracking, and never underestimate the value of high-performance underlying infrastructure.
Don't let I/O wait times or CPU steal kill your cluster's performance. Spin up a KVM-based, NVMe-backed instance on CoolVDS and give your packets the lane they deserve.