Console Login

Kubernetes Networking Deep Dive: Optimizing CNI and Overlay Latency on KVM

Kubernetes Networking Deep Dive: Optimizing CNI and Overlay Latency on KVM

Let’s be honest: Kubernetes networking is where 90% of your production incidents will eventually live. The abstraction is beautiful until a packet drops, and you find yourself staring at `iptables` rules generated by a script you didn't write, trying to figure out why a pod in Oslo can't talk to a database in Bergen.

I've spent the last decade debugging distributed systems, and the move from bare metal to virtualized containers introduced a layer of obfuscation that many Ops teams underestimate. If you are running K8s on standard VPS providers, you are likely suffering from the "double encapsulation" tax—VXLAN inside your cluster wrapped in the provider's own network virtualization.

This guide cuts through the vendor noise. We are going to look at how to architect K8s networking for performance, specifically within the context of the Norwegian infrastructure landscape (NIX) and strict compliance requirements (GDPR/Schrems II).

The CNI Battlefield: IPVS vs. eBPF

For years, `iptables` was the standard. It works. It’s also a linear list of rules. When your cluster scales to 5,000 services, `iptables` becomes a CPU bottleneck because the kernel has to traverse that list for every packet. I’ve seen clusters choke not on bandwidth, but on rule evaluation latency.

By May 2024, if you are still defaulting to `iptables` proxy mode, you are doing it wrong. Your two viable paths are IPVS (IP Virtual Server) or eBPF (Extended Berkeley Packet Filter).

Why eBPF (Cilium) Wins

We rely heavily on Cilium at this point. It bypasses `iptables` entirely, loading BPF programs directly into the kernel for packet processing. It’s faster, more observable, and handles high churn rates significantly better.

Here is a production-ready Helm configuration for deploying Cilium on a CoolVDS KVM instance. Note the emphasis on disabling `kube-proxy` replacement to let Cilium handle everything:

helm install cilium cilium/cilium --version 1.15.4 \
  --namespace kube-system \
  --set kubeProxyReplacement=true \
  --set k8sServiceHost=API_SERVER_IP \
  --set k8sServicePort=6443 \
  --set ipam.mode=kubernetes \
  --set bpf.masquerade=true \
  --set hubble.relay.enabled=true \
  --set hubble.ui.enabled=true
Pro Tip: On CoolVDS, because we provide KVM virtualization with VirtIO drivers, you can enable bpf.masquerade=true safely. On legacy hypervisors (like older Xen implementations), this often breaks due to lack of kernel support for BPF maps. Always check your kernel version: uname -r needs to be 5.10+ for full feature support.

The MTU Trap: Fragmentation Kills Latency

This is the most common misconfiguration I see in Norway. The standard Ethernet MTU is 1500 bytes. If your CNI uses an overlay network (like VXLAN or Geneve), it adds headers (usually 50 bytes). If your Pod tries to send a 1500-byte packet, the CNI wraps it, resulting in a 1550-byte packet.

The physical interface on the host drops this. The kernel attempts to fragment it. Performance tanks. CPU usage spikes.

On CoolVDS NVMe instances, we support Jumbo Frames in the internal backend, but your public interface is likely 1500. You must configure your CNI MTU to be lower than the host MTU.

Check your host interface first:

ip link show eth0 | grep mtu

If it says 1500, set your CNI MTU to 1450 (for VXLAN) or 1460 (for Geneve). Here is how you patch it in Calico if you aren't using Cilium:

kubectl patch configmap/calico-config -n calico-system --type merge \
  -p '{"data":{"veth_mtu": "1440"}}'

Ingress and Load Balancing without Cloud Magic

When you use AWS or GKE, you request a LoadBalancer and get an IP. On a VPS or bare-metal setup, you stay in `Pending` state forever. You need MetalLB.

MetalLB allows your CoolVDS node to announce IPs via ARP (Layer 2) or BGP (Layer 3). For a small cluster (under 50 nodes), Layer 2 is sufficient and robust.

Step 1: Install MetalLB

kubectl apply -f https://raw.githubusercontent.com/metallb/metallb/v0.14.5/config/manifests/metallb-native.yaml

Step 2: Configure the IP Address Pool

You will need a secondary IP range attached to your CoolVDS instance. Do not try to use your primary SSH IP for this.

apiVersion: metallb.io/v1beta1
kind: IPAddressPool
metadata:
  name: first-pool
  namespace: metallb-system
spec:
  addresses:
  - 192.168.10.0/24  # Replace with your assigned public subnet
---
apiVersion: metallb.io/v1beta1
kind: L2Advertisement
metadata:
  name: example
  namespace: metallb-system

Kernel Tuning for High-Throughput

Default Linux distributions are tuned for general-purpose computing, not for routing gigabits of container traffic. If you are running a high-traffic e-commerce site targeting the Nordic market, default settings will result in `nf_conntrack` table exhaustion during sales events.

We recommend applying the following `sysctl` configurations on your CoolVDS nodes before bootstrapping the cluster. These settings optimize the TCP stack and increase the limits for connection tracking.

# /etc/sysctl.d/k8s.conf

# Increase connection tracking limits (Critical for NAT)
net.netfilter.nf_conntrack_max = 1000000
net.netfilter.nf_conntrack_tcp_timeout_established = 86400

# Allow more pending connections
net.core.somaxconn = 32768
net.ipv4.tcp_max_syn_backlog = 8192

# Enable IP forwarding (Required for K8s)
net.ipv4.ip_forward = 1

# Optimize neighbor table for large clusters
net.ipv4.neigh.default.gc_thresh1 = 4096
net.ipv4.neigh.default.gc_thresh2 = 8192
net.ipv4.neigh.default.gc_thresh3 = 16384

Apply them with `sysctl --system`. If you don't do this, you will see random connection resets under load, which developers usually blame on the application code. It's almost always the kernel dropping packets because the conntrack table is full.

Data Sovereignty and Network Policies

In Norway, compliance is not optional. The Datatilsynet (Norwegian Data Protection Authority) is strict. If you have a multi-tenant cluster, you cannot rely on namespace separation alone. By default, K8s allows all Pods to talk to all other Pods. This is a security nightmare.

You must implement a Default Deny policy. This ensures that only explicitly allowed traffic flows. It is the network equivalent of "Zero Trust."

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-all
  namespace: default
spec:
  podSelector: {}
  policyTypes:
  - Ingress
  - Egress

Once applied, nothing works. This forces you to whitelist traffic. It is painful to set up initially, but it is the only way to ensure that a compromised frontend container cannot scan your internal database or reach out to a C2 server.

Why Infrastructure Matters

You can have the best CNI config in the world, but if the underlying hypervisor steals CPU cycles or throttles I/O, your latency will suffer. Kubernetes is noisy. Etcd requires extremely low latency write syncing.

This is why we built CoolVDS on pure NVMe storage with KVM isolation. We don't oversubscribe CPU to the point of contention. When your CNI needs to process a packet, the CPU cycles are there. When Etcd needs to write to disk, the NVMe IOPS are there. In the Norwegian market, where milliseconds to Oslo or Stockholm matter for user experience, the infrastructure layer is your foundation.

Don't let a bad overlay network or a fragmented packet ruin your uptime. Audit your MTU, switch to eBPF, and ensure your host nodes can handle the heat.