Console Login

Kubernetes Networking on Bare Metal: Fixing Latency & CNI Nightmares in 2024

The Packet Never Lies: A Deep Dive into K8s Networking Performance

Kubernetes networking is magic until it stops working. Then, it becomes a crime scene where the primary suspect is always DNS, but the actual killer is usually latency or a misconfigured CNI. I have spent more Friday nights than I care to admit staring at tcpdump output, trying to figure out why a microservice in Namespace A treats Namespace B like it's on a different planet. Most tutorials give you the happy path: install Flannel, expose a Service, and go home. In production, especially here in the Nordic region where strict data sovereignty meets high-performance demands, the happy path is a myth.

If you are running K8s on bare metal or high-performance KVM instances (which you should be, if you care about IOPS), the default networking stack is likely choking your throughput. We are going to look at how to strip away the overhead, implement eBPF to bypass iptables hell, and ensure your traffic stays within Norway's borders to satisfy the Datatilsynet auditors.

The Overlay Tax: Why Your Network is Slow

By default, Kubernetes uses an overlay network. It encapsulates packets (VXLAN or IPIP), sends them across the wire, and decapsulates them on the other side. This adds CPU overhead. On a standard cloud instance with noisy neighbors, this is negligible because the noise masks the inefficiency. On dedicated resources or high-performance VPS Norway solutions like CoolVDS, this overhead is the bottleneck.

We recently migrated a high-traffic fintech workload from a managed cloud to CoolVDS NVMe instances. The goal was to reduce latency for transactions hitting the NIX (Norwegian Internet Exchange). We found that by switching the CNI (Container Network Interface) mode, we dropped p99 latency by 4ms. That sounds small, but in HFT (High Frequency Trading) or real-time bidding, it is an eternity.

Ditching Kube-Proxy for eBPF

In early 2024, if you are still using the default kube-proxy with iptables mode for a cluster larger than 50 nodes, you are doing it wrong. Iptables rules are updated sequentially. When you have thousands of Services, the kernel spends more time traversing lists of rules than routing packets. The solution is Cilium with eBPF.

eBPF allows us to run sandboxed programs in the operating system kernel. It bypasses the iptables bottleneck entirely. Here is a production-ready Helm configuration for Cilium that we use to enable direct routing (no tunneling) on CoolVDS instances. This requires your underlying network to support L2 routing, which is standard on our infrastructure.

helm install cilium cilium/cilium --version 1.15.1 \
  --namespace kube-system \
  --set tunnel=disabled \
  --set autoDirectNodeRoutes=true \
  --set nativeRoutingCIDR=10.0.0.0/8 \
  --set kubeProxyReplacement=true \
  --set loadBalancer.mode=dsr \
  --set k8sServiceHost=API_SERVER_IP \
  --set k8sServicePort=6443

Note the loadBalancer.mode=dsr (Direct Server Return). This allows the backend pod to reply directly to the client without hopping back through the load balancer node, cutting the return path latency in half. This is critical for bandwidth-heavy applications.

Preserving the Source IP (The `externalTrafficPolicy` Trap)

A common pain point I see in my consulting gigs: The application logs show the internal IP of the ingress controller instead of the actual client IP. This breaks geo-blocking logic and fraud detection systems. You might think adding X-Forwarded-For headers at the ingress layer solves this. It doesn't if the packet has already been SNAT'd (Source Network Address Translation) before it hits the ingress.

To fix this on a bare-metal setup, you must configure your LoadBalancer service correctly. You need to force traffic to land only on nodes that actually run the pod.

apiVersion: v1
kind: Service
metadata:
  name: ingress-nginx-controller
  namespace: ingress-nginx
spec:
  type: LoadBalancer
  externalTrafficPolicy: Local
  ports:
  - name: http
    port: 80
    targetPort: 80
  - name: https
    port: 443
    targetPort: 443
  selector:
    app.kubernetes.io/name: ingress-nginx

Setting externalTrafficPolicy: Local drops packets if the node receiving the traffic doesn't host the pod. This forces your external Load Balancer (or BGP router) to be health-check aware, but the payoff is visibility and slightly better performance since you eliminate an internal hop.

Storage Latency is Networking Latency

This sounds contradictory, but in Kubernetes, they are inextricably linked via etcd. Etcd is the brain of the cluster. It requires fsync latency to be under 10ms. If your disk I/O is slow, etcd slows down. If etcd slows down, the API server delays updates. If the API server lags, endpoint updates for Services don't propagate. Suddenly, your networking is pointing traffic to dead pods.

I once debugged a cluster where network timeouts were rampant every day at 02:00. It wasn't a DDoS. It was a backup job saturating the slow SATA SSDs on the control plane nodes. The NVMe storage provided by CoolVDS eliminates this class of problem entirely. We use KVM virtualization which provides near-native disk access.

Before you blame the network, benchmark your etcd disk performance:

# Run this on your control plane node
fio --rw=write --ioengine=sync --fdatasync=1 \
    --directory=test-data --size=22m --bs=2300 \
    --name=mytest

If your fsync times are consistently above the 99th percentile of 10ms, your network issues are actually storage issues.

GDPR and The "Schrems II" Reality

In Norway, compliance is not optional. The "Schrems II" ruling effectively killed the legality of blindly shipping personal data to US-controlled clouds without massive legal gymnastics. Technical sovereignty is the answer. You need to know exactly where your packets are going.

Using NetworkPolicies is the standard way to enforce this firewalling inside the cluster. A "default deny" policy is mandatory for any serious production environment. It ensures that if a pod is compromised, the attacker cannot easily scan your internal network.

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-all
  namespace: sensitive-data
spec:
  podSelector: {}
  policyTypes:
  - Ingress
  - Egress

Once you apply this, nothing talks to nothing. You then whitelist specific paths. It is tedious, but it is the only way to pass a rigorous security audit. Running this on CoolVDS adds a layer of physical security: your data resides on servers in Oslo, governed by Norwegian law, not in a nebulous "availability zone" that might legally fall under foreign jurisdiction.

Pro Tip: Use `crictl` to debug networking on the node level when `kubectl` is lying to you.
crictl inspectp <pod-id> will give you the raw CNI configuration details that Kubernetes abstracts away.

The CoolVDS Architecture Difference

We built CoolVDS because we were tired of "noisy neighbor" syndrome on oversold cloud platforms. When you deploy a Kubernetes node with us, you aren't fighting for CPU cycles to process packets. You get dedicated resources. Our infrastructure is designed for the low latency demands of modern microservices.

We don't offer a managed Kubernetes service that hides the control plane from you. We give you the raw, high-performance KVM instances so you can build the cluster exactly how you need it—whether that's a sprawling Cilium mesh or a tight, security-focused RKE2 deployment. You control the kernel, the CNI, and the data.

Final Optimization Checklist

  • MTU Sizes: Ensure your interface MTU matches the host (usually 1500, but 9000 for Jumbo Frames if supported) to avoid fragmentation.
  • Conntrack: Increase `net.netfilter.nf_conntrack_max`. The default is often too low for high-traffic ingress nodes.
  • DNS: Switch to NodeLocal DNSCache to prevent conntrack races on UDP packets.

Networking in Kubernetes is complex, but on the right hardware, it is deterministic. Don't let virtualization overhead kill your architecture. If you are ready to stop debugging timeouts and start pushing traffic, spin up a CoolVDS instance. The pings to NIX don't lie.