Console Login

Kubernetes Networking Deep Dive: Escaping the Iptables Hell

Kubernetes Networking Deep Dive: Escaping the Iptables Hell

Let’s be honest. When you first ran kubectl apply -f on your laptop, it felt like magic. But now you’re in production, traffic is scaling, and that magic has turned into a nightmare of random 502 errors and latency spikes that don't show up in your application logs. Networking in Kubernetes is not magic; it is a complex stack of Linux kernel primitives, often held together by duct tape and hope.

I have spent the last three weeks debugging a cluster that was dropping 1% of packets randomly. It wasn't the app code. It wasn't the database. It was the conntrack table overflowing on the worker nodes because the underlying VPS provider was throttling CPU, causing soft interrupts to queue up and die. If you are running Kubernetes on budget shared hosting, you are building a skyscraper on a swamp.

This guide dives into the gritty reality of K8s networking in early 2020—specifically focusing on kube-proxy modes, CNI selection, and why your infrastructure choice determines whether your cluster survives a traffic spike.

The Bottleneck: Why Iptables is Killing You

By default, Kubernetes uses iptables for service discovery. In a small cluster, this is fine. iptables is a robust firewall tool. But it was never designed to be a load balancer. Every time you add a Service or a Pod, the list of rules grows linearly. When you have 5,000 services, the kernel has to traverse a massive sequential list of rules to figure out where to send a single packet.

The Symptoms:

  • High CPU usage on nodes just processing network traffic.
  • Unexplained latency when establishing new connections.
  • Service updates (scaling a deployment) take seconds or minutes to propagate.

In 2020, if you are serious about performance, you need to switch kube-proxy to IPVS (IP Virtual Server) mode. IPVS is built for load balancing and uses hash tables, meaning performance remains constant regardless of how many services you have.

Pro Tip: Before enabling IPVS, ensure your kernel modules are loaded. On a CoolVDS KVM instance, you have full kernel control, so this is trivial. On restricted container environments, you might be out of luck.

Enabling IPVS on Kubernetes v1.17

First, load the necessary kernel modules on all your worker nodes:

# Load IPVS modules
modprobe ip_vs
modprobe ip_vs_rr
modprobe ip_vs_wrr
modprobe ip_vs_sh
modprobe nf_conntrack

Next, edit your kube-proxy ConfigMap. You can find it in the kube-system namespace.

kubectl edit configmap kube-proxy -n kube-system

Look for the mode setting and change it from "" (which defaults to iptables) to "ipvs".

apiVersion: kubeproxy.config.k8s.io/v1alpha1
kind: KubeProxyConfiguration
mode: "ipvs"
ipvs:
  scheduler: "rr" # Round Robin
  strictARP: true

After saving, kill the kube-proxy pods to force a restart:

kubectl -n kube-system delete pod -l k8s-app=kube-proxy

You should see an immediate drop in CPU usage on your nodes if you have a high service count.

CNI Wars: Flannel vs. Calico

Choosing a Container Network Interface (CNI) is like choosing a database; migrating later is painful. In the Nordic hosting landscape, where latency to the NIX (Norwegian Internet Exchange) is a competitive advantage, you don't want your internal cluster network adding overhead.

Feature Flannel (VXLAN) Calico (BGP)
Mechanism Encapsulation (Overlay) Direct Routing (Underlay)
Overhead High (Packet in Packet) Near Native
Network Policy No Yes (Security)
Complexity Low Medium

Flannel is great for "it just works." But it encapsulates packets in VXLAN headers. This reduces the Maximum Transmission Unit (MTU) available for your payload and requires CPU cycles to encap/decap every packet.

For high-performance clusters on CoolVDS, I recommend Calico in BGP mode. Because CoolVDS provides true KVM virtualization with a clean network stack, you can run BGP between your nodes. This allows pods to communicate without the overhead of encapsulation.

Tuning the MTU for Performance

A common mistake I see in Norway-based deployments is ignoring MTU. The standard internet MTU is 1500 bytes. If you use an overlay network, the header takes 50 bytes. If your inner packet is 1500 bytes, it gets fragmented. Fragmentation kills performance.

Check your interface MTU:

ip addr | grep mtu

If you must use VXLAN, ensure your inner MTU is set to 1450 (or lower, depending on the header). But on CoolVDS, we support Jumbo Frames on the private network backend, allowing you to push 9000 MTU for internal cluster traffic if configured correctly.

The Infrastructure Factor: Why "Shared" Kills Networking

You can tune sysctl.conf all day, but if your neighbor on the physical host is mining crypto or compiling GCC, your network latency will suffer. This is the "Noisy Neighbor" effect.

Network packet processing is CPU-bound. When a packet arrives, the network card generates an interrupt. The CPU must stop what it's doing to handle that packet. If your VPS provider oversubscribes CPU (which most budget providers do), your "Steal Time" goes up. High Steal Time means the hypervisor isn't giving your VM the CPU cycles it needs to process packets.

The Result? Packet drops at the kernel level. Retransmissions. Latency.

This is why we architect CoolVDS differently. We prioritize consistent CPU scheduling and use NVMe storage to prevent I/O wait from blocking the CPU. When you are hosting critical applications—perhaps subject to GDPR requirements where data must stay in Norway and remain accessible—stability is not optional.

Kernel Tuning for High Load

Don't let default Linux settings throttle your cluster. Apply these settings to /etc/sysctl.conf on your worker nodes to handle high connection rates (typical for Ingress controllers):

# Increase the range of ephemeral ports
net.ipv4.ip_local_port_range = 1024 65535

# Allow reusing sockets in TIME_WAIT state for new connections
net.ipv4.tcp_tw_reuse = 1

# Increase max open files for high concurrency
fs.file-max = 2097152

# Max backlog of connection requests
net.core.somaxconn = 65535

# Increase the max number of memory map areas (critical for Elasticsearch/Java)
vm.max_map_count = 262144

Apply them with:

sysctl -p

Conclusion: Build on Solid Ground

Kubernetes is powerful, but it amplifies infrastructure weaknesses. Using iptables instead of IPVS limits your scalability. Using overlay networks on slow CPUs limits your throughput. And running on oversubscribed hardware guarantees inconsistent latency.

If you are building a production cluster in 2020, stop fighting the "noisy neighbors." Get a clean, high-performance KVM environment where you control the kernel, the modules, and the resources.

Ready to fix your latency? Spin up a CoolVDS instance in Oslo. With our NVMe-backed storage and uncompromised CPU allocation, your kube-proxy will finally breathe easy.