Console Login

Packet Loss, Latency, and the CNI Jungle: A Kubernetes Networking Deep Dive for 2025

Packet Loss, Latency, and the CNI Jungle: A Kubernetes Networking Deep Dive

Most developers treat Kubernetes networking as magic. You define a Service, throw in an Ingress, and assume packets will flow. Then, usually at 2 AM on a Saturday, the magic dies. DNS lookups start timing out, conntrack tables overflow, and you realize that your overlay network is adding 30% latency to your microservices.

I have spent the last decade debugging distributed systems, from bare-metal racks in cellar datacenters to massive multi-region cloud deployments. The truth is simple: Kubernetes doesn't solve networking; it abstracts it. And leaky abstractions cause outages.

In 2025, with the maturity of eBPF and the Gateway API, we have tools to fix this. But tools are useless if your underlying infrastructure is garbage. This guide dissects the packet flow, tears down the CNI (Container Network Interface) choices available today, and explains why hosting your control plane on high-performance infrastructure—like we offer at CoolVDS—is not a luxury, it's a requirement for stability.

The CNI Battlefield: Iptables vs. eBPF

For years, kube-proxy using iptables was the standard. It works, but it scales linearly with the number of services. If you have 5,000 services, iptables becomes a linked-list nightmare, causing massive CPU consumption just to route a packet. If you are still running a default kubeadm setup without tuning, you are likely hitting this bottleneck.

In 2025, the industry standard for high-performance clusters is eBPF (Extended Berkeley Packet Filter). Tools like Cilium bypass iptables entirely, handling routing directly in the kernel space. This isn't just about speed; it's about visibility.

When we deploy high-load clusters for clients in Norway, specifically those requiring low latency to the NIX (Norwegian Internet Exchange), we mandate eBPF. Why? Because every millisecond of overhead counts when you are serving real-time data.

Configuring Cilium for Direct Routing

Don't just install the defaults. Tunneling (VXLAN/Geneve) adds overhead (MTU reduction and encapsulation CPU cost). If your underlying network supports it—and on CoolVDS KVM instances, it does—you should use Direct Routing.

helm install cilium cilium/cilium --version 1.15.5 \
  --namespace kube-system \
  --set tunnel=disabled \
  --set autoDirectNodeRoutes=true \
  --set ipv4.nativeRoutingCIDR=10.0.0.0/8 \
  --set loadBalancer.mode=dsr \
  --set kubeProxyReplacement=true

This configuration disables encapsulation. Packets travel with their pod IP directly on the wire. This requires your underlying VPC or VLAN to be aware of pod CIDRs, but the performance gain is massive—often a 15-20% reduction in latency compared to VXLAN.

The Hidden Killer: MTU Fragmentation

One of the most common issues I see in ticket queues is intermittent connection resets. This is almost always an MTU (Maximum Transmission Unit) mismatch. If your host interface is 1500 bytes, and your CNI adds 50 bytes of VXLAN headers, your Pod interface must be 1450. If it's set to 1500, packets get dropped or fragmented.

Check your actual link MTU inside a node:

ip -d link show eth0 | grep mtu

If you are on CoolVDS, our network fabric supports jumbo frames in specific availability zones, but standardizing on 1500 is safe. Ensure your CNI config matches the host overhead.

Pro Tip: Never assume the default MTU is correct. Use tracepath to determine the actual Path MTU to your destination. A mismatched MTU is the silent killer of TLS handshakes.

Service Mesh and the Gateway API

The Ingress resource is legacy. In 2025, the Gateway API is the mature standard we use for complex traffic splitting and blue/green deployments. It separates the Gateway (infrastructure) from the Route (application logic).

Here is how you define a clean HTTPRoute that splits traffic between two versions of an app, a common requirement for DevOps teams in Oslo adhering to strict deployment pipelines:

apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: store-route
  namespace: production
spec:
  parentRefs:
  - name: external-gateway
  rules:
  - matches:
    - path:
        type: PathPrefix
        value: /checkout
    backendRefs:
    - name: checkout-v1
      port: 8080
      weight: 90
    - name: checkout-v2
      port: 8080
      weight: 10

This declarative approach eliminates the "spaghetti configuration" of Nginx annotations. However, the Gateway implementation (like Envoy or Istio) requires significant memory. If you run this on a cheap, oversold VPS where "2GB RAM" actually means "2GB until the neighbor wakes up," your Envoy proxy will OOM kill (Out of Memory) during traffic spikes.

This is where the CoolVDS architecture differs. We use strict KVM isolation. Memory is reserved. If you pay for 4GB, you get 4GB dedicated to your control plane, ensuring your networking components don't crash when you need them most.

Storage Networking: ETCD needs IOPS

You cannot talk about Kubernetes networking without mentioning etcd. Every network change, every IP assignment, every service update is a write to etcd. If etcd is slow, your network convergence is slow.

Etcd requires low latency storage. If fsync takes longer than 10ms, the cluster degrades. On spinning rust or shared SATA SSDs, this is common. We run our Norwegian clusters exclusively on NVMe arrays.

Verify your etcd storage latency with fio before installing K8s:

fio --rw=write --ioengine=sync --fdatasync=1 \
  --directory=/var/lib/etcd --size=100m --bs=2300 \
  --name=etcd_benchmark

If the 99th percentile latency (fsync) is above 10ms, do not build a cluster there. You will suffer from "Leader Election Lost" errors and phantom network outages.

Data Sovereignty and Latency in Norway

For our clients operating under strict GDPR requirements or reporting to Datatilsynet, data location is not optional. But beyond compliance, there is physics. If your users are in Oslo or Bergen, routing traffic through Frankfurt adds 20-30ms of round-trip time. In a microservices architecture where one user request triggers 50 internal service calls, that latency compounds.

Hosting locally in Norway reduces that physical distance. However, local hosting often means expensive, legacy providers. CoolVDS bridges this gap by offering cloud-native speeds (NVMe, 10Gbps uplinks) with local presence.

Debugging the Black Box

When things break, kubectl logs isn't enough. You need to see the wire. One of the most useful techniques is launching a transient debug pod with network tools pre-installed, attached to the host network namespace.

kubectl run tmp-shell --rm -i --tty \
  --image nicolaka/netshoot \
  --overrides='{"spec": {"hostNetwork": true}}' \
  -- bash

Inside this shell, you can use tcpdump, calicoctl, or cilium CLI tools to inspect traffic exactly as the node sees it.

For example, checking if a specific Service IP is being translated correctly by IPVS:

ipvsadm -L -n | grep 10.96.0.1

Conclusion

Kubernetes networking is deterministic. It only feels chaotic when you lack observability or when the underlying hardware fluctuates. By 2025, the tools have matured—eBPF and Gateway API are production-ready. The variable that remains is your infrastructure provider.

Don't build a Ferrari engine (K8s) and put it inside a rusted chassis (oversold hosting). For your next cluster, prioritize NVMe storage for etcd stability and dedicated CPU for packet processing.

Ready to stabilize your production workloads? Deploy a CoolVDS instance in our Oslo zone today and experience the difference raw IOPS makes for your Kubernetes control plane.