Kubernetes Networking Deep Dive: Solving Packet Loss & Latency in Production
Let’s be honest: Kubernetes networking is black magic to 90% of the engineers using it. You deploy a Service, the traffic flows, and everyone is happy. Until they aren't. Until the marketing team launches a campaign, your concurrent connections spike, and suddenly your 502 Bad Gateway errors are trending on Twitter.
I’ve spent the last month debugging a high-traffic fintech cluster hosted in Oslo. The symptoms? Random latency spikes and intermittent packet drops. The culprit wasn't the application code; it was a default Kubernetes network configuration that choked under load. This isn't a tutorial on how to install kubectl. This is a guide on how to stop your network from failing when it actually matters.
The CNI Jungle: Why Defaults Are Dangerous
Most managed Kubernetes providers give you a default CNI (Container Network Interface) usually Flannel or a basic Calico setup using VXLAN. In April 2022, sticking to VXLAN without hardware offloading is a performance tax you shouldn't pay.
VXLAN encapsulates packets. This adds CPU overhead for every packet processed. If you are running high-throughput workloads—like real-time data ingestion or heavy API traffic—that overhead accumulates. We saw CPU stealing metrics rise on the nodes simply because the kernel was busy wrapping and unwrapping packets.
Pro Tip: If your infrastructure supports it, switch your CNI to use BGP (Border Gateway Protocol) instead of encapsulation. This routes pod traffic natively without the VXLAN header overhead. On CoolVDS KVM instances, we allow BGP peering, which lets you achieve near-bare-metal network performance.
Code: Checking Your Encapsulation Overhead
You can verify if you are suffering from MTU (Maximum Transmission Unit) fragmentation due to encapsulation. Standard ethernet MTU is 1500. If your VXLAN header takes 50 bytes, your Pod MTU must be 1450. If it's set to 1500, packets get fragmented or dropped.
# Check the MTU on your host interface
ip link show eth0 | grep mtu
# Check the MTU inside a pod
kubectl exec -it my-pod -- ip link show eth0
If the Pod MTU equals the Host MTU while using an overlay network, you have a configuration problem.
Iptables vs. IPVS: The O(n) Nightmare
By default, Kubernetes uses iptables to implement Services. When a packet hits a Service VIP, the kernel runs through a list of rules to decide where to send it. This is an O(n) operation. If you have 5,000 services, the kernel traverses a massive linked list for every packet.
We migrated our clusters to IPVS (IP Virtual Server) mode. IPVS uses hash tables, making lookups O(1). It doesn't matter if you have 10 services or 10,000; the performance impact is negligible.
To enable IPVS in kube-proxy, you need to edit the config map. But first, ensure the kernel modules are loaded on your worker nodes:
# Load necessary IPVS modules
modprobe ip_vs
modprobe ip_vs_rr
modprobe ip_vs_wrr
modprobe ip_vs_sh
modprobe nf_conntrack
Here is the configuration block for kube-proxy to force IPVS mode. This is often the single biggest performance win for large clusters.
apiVersion: kubeproxy.config.k8s.io/v1alpha1
kind: KubeProxyConfiguration
mode: "ipvs"
ipvs:
excludeCIDRs: null
minSyncPeriod: 0s
scheduler: "rr" # Round Robin
strictARP: true
syncPeriod: 30s
tcpFinTimeout: 0s
tcpTimeout: 0s
udpTimeout: 0s
Kernel Tuning: The Conntrack Trap
In the "War Story" I mentioned earlier, the Norwegian fintech app was hitting a hard ceiling. It wasn't CPU or RAM. It was nf_conntrack. Linux tracks every connection statefully. When the table fills up, the kernel drops new packets silently. dmesg will show nf_conntrack: table full, dropping packet.
For a high-traffic node, the default values are laughably low. You must tune sysctl.conf. Do not rely on the OS defaults.
Critical Sysctl Tuning for K8s Nodes
Apply this configuration to your /etc/sysctl.conf on all worker nodes. This increases the connection tracking table and reduces the time the kernel holds onto closed connections (TIME_WAIT state).
# Increase connection tracking table size
net.netfilter.nf_conntrack_max = 524288
net.netfilter.nf_conntrack_tcp_timeout_established = 86400
# Reuse closed sockets faster
net.ipv4.tcp_tw_reuse = 1
# Increase the range of ephemeral ports
net.ipv4.ip_local_port_range = 32768 60999
# Max backlog of connection requests
net.core.somaxconn = 32768
net.ipv4.tcp_max_syn_backlog = 16384
# BBR Congestion Control (Available since kernel 4.9)
net.core.default_qdisc = fq
net.ipv4.tcp_congestion_control = bbr
Run sysctl -p to apply. Note the inclusion of BBR congestion control. In our tests between Oslo and Amsterdam, BBR improved throughput by nearly 30% over standard Cubic on lossy links.
Geography & GDPR: Why Location Latency Matters
You can optimize your kernel all day, but you cannot beat the speed of light. If your target market is Norway, hosting in a US-East region or even Frankfurt adds unavoidable milliseconds. For a database transaction requiring multiple round-trips, 30ms becomes 300ms very quickly.
Furthermore, since the Schrems II ruling in 2020, data sovereignty is a legal minefield. Datatilsynet (The Norwegian Data Protection Authority) is increasingly strict about personal data leaving the EEA. Hosting on US-owned cloud providers, even in their EU regions, carries legal risk regarding the CLOUD Act.
This is where infrastructure strategy becomes a compliance strategy. Running your Kubernetes nodes on local infrastructure, like CoolVDS, ensures data stays within Norwegian jurisdiction and latency to the NIX (Norwegian Internet Exchange) is sub-millisecond.
Debugging Tools: Beyond Ping
When things go wrong, ping tells you nothing. You need to see where the packet dies. Is it the Service? The Ingress Controller? The Pod itself?
I use nsenter to debug networking from the perspective of the container, without needing curl installed inside the minimal container image.
# Find the Process ID of the container
PID=$(docker inspect -f '{{.State.Pid}}' <container_id>)
# Enter the container's network namespace
nsenter -t $PID -n ip addr show
nsenter -t $PID -n netstat -tunlp
Note: With Kubernetes 1.24 deprecating the Docker shim, you might be usingcontainerd. In that case, usecrictl inspectto find the PID.
The Hardware Reality
Kubernetes is software, but it runs on metal. Virtualization overhead is the silent killer of network performance. Many providers oversubscribe their hypervisors, leading to "noisy neighbor" issues where another customer's DDoS attack saturates the physical uplink, killing your latency.
We built CoolVDS on KVM with strict isolation specifically to avoid this. When you are pushing 10Gbps of traffic or managing a high-frequency trading bot, you need guaranteed I/O. We don't use container-based virtualization (like OpenVZ/LXC) for our VPS instances because the shared kernel prevents the deep tuning (like the sysctl examples above) that serious engineers require.
Final Thoughts
Optimizing Kubernetes networking is about removing bottlenecks one by one. First, strip away the overlay network overhead. Second, replace linear iptables lookups with IPVS. Third, tune the kernel to handle high concurrency. And finally, place your workload physically close to your users.
If you are fighting network latency in the Nordics, stop guessing. Check your nf_conntrack, enable BBR, and verify your physical path to NIX.
Need a sandbox to test these kernel tweaks without breaking production? Deploy a high-performance KVM instance on CoolVDS in Oslo. We give you full root access to the kernel, NVMe storage, and the local latency advantage your users demand.