Console Login

Kubernetes Networking in 2025: Bypassing the Overlay Tax for Sub-Millisecond Latency

Kubernetes Networking in 2025: Bypassing the Overlay Tax for Sub-Millisecond Latency

There is a specific kind of silence that falls over a DevOps room when a microservice starts timing out only during peak traffic. It’s not the code. The application logs are clean. The database is sleeping. Yet, the 99th percentile latency just spiked from 20ms to 400ms. If you are running Kubernetes, the culprit is almost always the network layer you decided to ignore during the initial setup.

In 2025, deploying a cluster is easy. Making it performant is an entirely different beast. Most tutorials still default to simple overlay networks using VXLAN or Geneve encapsulation. While convenient, this introduces the "overlay tax"β€”CPU cycles wasted on packing and unpacking packets between nodes. On a high-traffic cluster, this tax bankrupts your latency budget.

I've spent the last six months migrating a fintech workload from a generic cloud provider to bare-metal performance KVM instances. The goal was simple: get the packet from the ingress controller to the pod in Oslo in under 1ms. Here is how we stripped down the Kubernetes networking stack, ditched iptables, and why underlying infrastructure like CoolVDS makes the difference between a compliant architecture and a slow one.

1. The Death of iptables and the Rise of eBPF

If you are still using kube-proxy in iptables mode, you are running technology that was obsolete for high-scale environments five years ago. Iptables is a list. When you have 5,000 services, the kernel has to traverse that list sequentially to find the rule matching a packet. It’s O(n) complexity. It hurts.

By August 2025, eBPF (Extended Berkeley Packet Filter) has firmly established itself as the standard for serious networking. Unlike iptables, eBPF allows us to run sandboxed programs in the kernel without changing source code or loading modules. It turns that O(n) lookup into O(1) hash table lookup.

We use Cilium as our CNI (Container Network Interface) of choice. It bypasses the host networking stack significantly, allowing for direct routing.

Configuring Cilium for Native Routing

To eliminate the encapsulation overhead, we must use native routing. This requires your underlying network (the VPS or metal) to know how to route Pod CIDRs. On CoolVDS, where you have full control over the KVM network stack, this is straightforward.

Here is the Helm configuration we use to deploy Cilium with kube-proxy replacement enabled:

helm install cilium cilium/cilium --version 1.16.1 \
  --namespace kube-system \
  --set kubeProxyReplacement=true \
  --set k8sServiceHost=${API_SERVER_IP} \
  --set k8sServicePort=${API_SERVER_PORT} \
  --set tunnel=disabled \
  --set autoDirectNodeRoutes=true \
  --set ipv4.nativeRoutingCIDR=10.0.0.0/8 \
  --set bpf.masquerade=true \
  --set hubble.relay.enabled=true \
  --set hubble.ui.enabled=true

Note the flags: tunnel=disabled turns off VXLAN. autoDirectNodeRoutes=true tells Cilium to inject routes directly into the Linux kernel routing table. This is the difference between "it works" and "it flies."

2. Kernel Tuning: The Forgotten Layer

Your CNI can only be as fast as the kernel allows it to be. Default Linux distros are tuned for general purpose usage, not for handling 50k packets per second (PPS) on a Kubernetes node.

I frequently see developers throw more vCPUs at a problem when the bottleneck is actually the netfilter tracking limits or the backlog queue. When hosting in Norway, specifically if you are leveraging the low latency to the NIX (Norwegian Internet Exchange), you need to ensure the kernel doesn't drop packets before your application even sees them.

Apply these settings via a DaemonSet or Cloud-Init on your CoolVDS nodes:

# /etc/sysctl.d/99-k8s-network.conf

# Increase the maximum number of connections
net.core.somaxconn = 32768

# Increase the range of ephemeral ports
net.ipv4.ip_local_port_range = 1024 65000

# Allow reusing sockets in TIME_WAIT state for new connections
net.ipv4.tcp_tw_reuse = 1

# Increase the read/write buffer sizes for TCP
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216

# BPF JIT compiler (Performance boost for eBPF)
net.core.bpf_jit_enable = 1
Pro Tip: Always check net.netfilter.nf_conntrack_max. If your pods make massive amounts of external API calls, you will hit the connection tracking limit, resulting in silent packet drops that look like random timeouts.

3. The Infrastructure Reality: Why "Managed" Often Fails

Here is the uncomfortable truth: You can optimize your Kubernetes config to perfection, but if your neighbor on the physical host is running a crypto miner or a poorly optimized video transcoder, your network stability will suffer. This is the "noisy neighbor" effect.

This is why, for mission-critical K8s clusters, we moved away from shared container-based VPS solutions to KVM-based instances like CoolVDS. KVM provides hardware-level virtualization.

When we provision a CoolVDS NVMe instance in Oslo, we are getting dedicated instruction sets. The I/O wait is negligible. This matters for networking because high I/O wait often blocks the CPU from processing soft interrupts (softirq), which are responsible for handling incoming network packets. If your CPU is waiting on a slow disk, your network latency increases.

Benchmarking Network Throughput

Don't trust the marketing. Verify the pipe. We use iperf3 to measure the real throughput between two pods on different nodes.

Step 1: Launch a server pod.

kubectl run iperf-server --image=networkstatic/iperf3 -- -s

Step 2: Launch a client pod on a different node (use nodeSelector).

kubectl run iperf-client --image=networkstatic/iperf3 --overrides='{"spec": {"nodeName": "worker-node-2"}}' -- -c 10.244.1.5 -t 30

If you see retries (Retr column > 0), you have packet loss, likely due to a noisy neighbor or bad MTU configuration.

4. MTU and Jumbo Frames

Another common mistake is MTU (Maximum Transmission Unit) fragmentation. The standard MTU is 1500 bytes. If you use an overlay network (VXLAN), the encapsulation header adds 50 bytes. If your physical interface is strictly 1500, the inner packet must be smaller (1450). If a pod tries to send a full 1500-byte packet, it gets fragmented or dropped.

The solution is typically to enable Jumbo Frames (MTU 9000) on the physical network interface, allowing the overlay ample room.

To check your current MTU on a node:

ip link show | grep mtu

If you cannot change the physical MTU (common in public clouds), you must configure your CNI to account for the overhead.

# Example Cilium MTU config map patch
kubectl -n kube-system patch configmap cilium-config --type merge --patch '{"data":{"mtu": "1450"}}'

5. Norway Specifics: GDPR and NIX

Operating in 2025 means strictly adhering to data sovereignty. With the tightening of Schrems II interpretations, many Norwegian entities are moving workloads back from US-owned clouds to local infrastructure. Latency is the bonus here.

Hosting on CoolVDS in Oslo connects you directly to the local peering ecosystem. A request from a user in Bergen to your K8s cluster should not route through Frankfurt. We verified this using mtr (My Traceroute).

Command to verify routing path:

mtr --report --rwc 10 google.no

If you see hops outside of Norway (e.g., telia.net routing via Sweden/Denmark unnecessarily), your BGP routing needs adjustment, or your provider is cutting corners on transit costs.

Summary: The Perfect Stack

Building a high-performance Kubernetes cluster is an exercise in removing layers. We remove the overlay network using native routing. We remove iptables using eBPF. And we remove resource contention using dedicated KVM instances.

Feature Standard VPS CoolVDS Implementation
Virtualization Container/Shared KVM (Kernel-based Virtual Machine)
CNI Compatibility Limited (often flannel) Full Support (Cilium/Calico eBPF)
Disk I/O SATA/SAS HDD NVMe (Low Wait for SoftIRQ)
Location Central Europe Oslo (Low Latency)

The network is the computer. In Kubernetes, this is doubly true. Don't let your infrastructure be the reason your architecture fails.

Ready to drop your latency? Deploy a KVM-based node on CoolVDS today and test the iperf3 benchmarks yourself. The raw speed speaks for itself.