Console Login

Kubernetes Networking in 2020: Surviving the iptables Hell on High-Traffic Clusters

Kubernetes Networking in 2020: Surviving the iptables Hell on High-Traffic Clusters

Let’s cut the marketing noise. Everyone loves talking about Kubernetes orchestration, rolling updates, and self-healing pods. But almost nobody talks about the messy, chaotic plumbing underneath until it explodes at 3:00 AM. I’m talking about networking. Specifically, the moment your iptables ruleset grows so large that your CPU spends more time traversing chains than serving requests.

I recently audited a cluster for a fintech startup in Oslo. They were running a standard Kubeadm setup on generic cloud instances. The symptom? Random 502 errors and latency spikes hitting 400ms on internal API calls. They blamed the code. I blamed the network.

It turned out their kube-proxy was struggling to manage 15,000 services using standard iptables mode. The kernel was choking. In this post, we are going to dissect how to fix this using IPVS, choose the right CNI for 2020, and why your hosting provider's physical location (hello, NIX) matters more than your YAML configuration.

The CNI Battlefield: Flannel vs. Calico

In mid-2020, your choice of Container Network Interface (CNI) defines your cluster's performance ceiling. If you are still using Flannel with the VXLAN backend, stop. It is simple, yes, but the encapsulation overhead is a performance killer for high-throughput applications.

For production workloads where every millisecond of latency counts, Calico is the reference implementation we rely on. It allows for pure Layer 3 routing without the overhead of encapsulation if your underlying network supports it (BGP), or highly optimized IPIP encapsulation if it doesn't.

Here is a snippet of a high-performance Calico configuration for Kubernetes 1.18, specifically tuning the MTU to avoid fragmentation—a silent killer of network performance.

kind: ConfigMap
apiVersion: v1
metadata:
  name: calico-config
  namespace: kube-system
data:
  # Typha is needed for scaling beyond 50 nodes
  typha_service_name: "none"
  # Configure the MTU based on your CoolVDS interface (usually 1500 or 1450 for encap)
  veth_mtu: "1440"
  # Enable IP-in-IP encapsulation only across subnets
  calico_backend: "bird"

Pro Tip: Always check the MTU of your host interface. If your VPS provider uses an overlay network with an MTU of 1450 and you set Docker to 1500, packets will fragment. This causes CPU spikes during reassembly. On CoolVDS NVMe instances, we provide standard frames, but you must align your CNI config to match the host exactly.

Escaping iptables: The Shift to IPVS

By default, kube-proxy uses iptables to handle Service discovery and load balancing. This works fine for small clusters. But iptables is a sequential list solver. If you have 5,000 services, the kernel has to check rules sequentially. It is O(n).

IPVS (IP Virtual Server) uses hash tables. It is O(1). Whether you have 10 services or 10,000, the lookup time is virtually identical.

To enable IPVS in Kubernetes 1.18, you first need to ensure the kernel modules are loaded on your worker nodes. This is often missed in basic tutorials.

# Load IPVS modules on the host
modprobe -- ip_vs
modprobe -- ip_vs_rr
modprobe -- ip_vs_wrr
modprobe -- ip_vs_sh
modprobe -- nf_conntrack

Once the modules are loaded, you must update the kube-proxy ConfigMap. If you are using kubeadm, you can edit it directly:

kubectl edit configmap kube-proxy -n kube-system

Look for the mode setting and change it from "" (which defaults to iptables) to "ipvs".

apiVersion: kubeproxy.config.k8s.io/v1alpha1
kind: KubeProxyConfiguration
mode: "ipvs"
ipvs:
  excludeCIDRs: null
  minSyncPeriod: 0s
  scheduler: "rr" # Round Robin is usually sufficient
  strictARP: false
  syncPeriod: 30s

After saving, kill the kube-proxy pods to trigger a restart. You should see a massive drop in CPU usage on your nodes if you run high service counts.

Kernel Tuning for High Concurrency

A default Linux install is optimized for a desktop, not a high-traffic container host. When running hundreds of pods, you will hit the nf_conntrack limit. This results in the dreaded "table full, dropping packet" error in dmesg.

On a CoolVDS instance, we give you full root access to tune these parameters. Do not be shy with them.

# /etc/sysctl.conf tuning for K8s nodes

# Increase connection tracking max
net.netfilter.nf_conntrack_max = 1000000
net.netfilter.nf_conntrack_tcp_timeout_established = 86400

# Allow more pending connections
net.core.somaxconn = 32768

# Expand the range of ephemeral ports
net.ipv4.ip_local_port_range = 1024 65535

# Fast recycling of TIME_WAIT sockets (use with caution, but often necessary)
net.ipv4.tcp_tw_reuse = 1

Apply these with sysctl -p. If you do not tune somaxconn, your fancy NGINX Ingress controller will bottleneck regardless of how much CPU you throw at it.

The Hardware Reality: Why Latency to NIX Matters

You can optimize software until you are blue in the face, but you cannot code your way out of bad physics. If your servers are physically located in a massive datacenter in Frankfurt or Amsterdam, but your customers are in Oslo or Bergen, you are adding 20-30ms of round-trip time (RTT) to every packet.

In the world of microservices, where a single frontend request might spawn 10 internal backend calls, that latency compounds. 20ms becomes 200ms.

This is why we engineered CoolVDS infrastructure with direct peering at the Norwegian Internet Exchange (NIX). We keep the packets local. Furthermore, Kubernetes relies heavily on etcd for state. Etcd is incredibly sensitive to disk write latency (fsync). If your VPS provider is running on standard SSDs (or worse, spinning rust) with noisy neighbors, etcd will elect new leaders, causing cluster instability.

We strictly use enterprise NVMe storage. In our benchmarks, fsync latency on CoolVDS NVMe instances consistently stays below 2ms, whereas standard cloud SSDs often spike to 10-50ms under load. For a database like etcd, that is the difference between a healthy cluster and a split-brain disaster.

Ingress: The Gatekeeper

Finally, let's talk about getting traffic into the cluster. The NGINX Ingress Controller is the standard workhorse. However, the default configuration is conservative. To handle DDoS attempts or massive traffic spikes, you need to modify the ConfigMap for the controller.

apiVersion: v1
kind: ConfigMap
metadata:
  name: nginx-configuration
  namespace: ingress-nginx
data:
  # optimize for throughput
  worker-processes: "auto"
  keep-alive: "65"
  # Security: hide version
  server-tokens: "false"
  # Buffer sizes for large headers (common with OIDC/OAuth)
  proxy-buffer-size: "16k"
  client-header-buffer-size: "2k"
  # Timeouts to prevent slowloris attacks
  client-body-timeout: "10"
  client-header-timeout: "10"

Conclusion

Kubernetes in 2020 is powerful, but it is not magic. It requires a deep understanding of the Linux kernel networking stack. By switching to IPVS, tuning your sysctls, and ensuring your underlying infrastructure provides low-latency NVMe storage and local peering, you turn a fragile cluster into a fortress.

Do not let high latency or IO wait kill your application's reputation. If you need a battle-tested environment that respects data sovereignty and delivers raw speed in Norway, it is time to upgrade.

Ready to drop your latency? Deploy a high-performance NVMe KVM instance on CoolVDS in under 60 seconds.