Console Login

Kubernetes Networking Deep Dive: Surviving the Packet Jungle in 2024

Kubernetes Networking: Why Your Packets Are Dropping (And How to Fix It)

Let's be honest. Kubernetes networking is a black box for 90% of the developers using it. You run kubectl apply -f service.yaml, you see an IP address appear, and you assume traffic flows magically from point A to point B. It works great until you hit 5,000 requests per second, or until you try to debug why your gRPC calls to a pod in a different node are timing out intermittently.

I've spent the last month debugging a cluster for a fintech client in Oslo. Their previous provider claimed the issue was "application latency." It wasn't. It was a saturated conntrack table on the host node combined with a noisy neighbor stealing CPU cycles needed for packet switching. This is the reality of K8s in 2024: the abstraction leaks.

Today, we aren't talking about basic Service definitions. We are ripping open the hood to look at CNI (Container Network Interface) choices, eBPF acceleration, and why the underlying hardware of your VPS determines whether your cluster flies or dies.

The CNI Decision: Move Beyond Flannel

If you spun up your cluster using default settings (kubeadm defaults or some managed click-and-deploy tool), you are likely running Flannel with VXLAN. VXLAN wraps your packets in UDP packets to tunnel them across nodes. This encapsulation adds overhead. CPU overhead. MTU headaches.

In 2024, if you are running a serious production workload, you should be looking at Cilium. Cilium uses eBPF (Extended Berkeley Packet Filter) to bypass much of the kernel's iptables, which becomes a massive bottleneck as your service count grows.

Pro Tip: Traditional kube-proxy relies on iptables. When you have 5,000 services, the kernel has to iterate through massive lists of rules for every packet. It's O(N). eBPF is O(1). The math doesn't lie.

Deploying Cilium for Performance

Don't just install it. Configure it to replace kube-proxy entirely. This is how we set it up on CoolVDS instances to maximize the throughput of our NVMe-backed nodes:

helm install cilium cilium/cilium --version 1.15.0 \
  --namespace kube-system \
  --set kubeProxyReplacement=true \
  --set k8sServiceHost=${API_SERVER_IP} \
  --set k8sServicePort=${API_SERVER_PORT} \
  --set bpf.masquerade=true \
  --set bandwidthManager.enabled=true

The bandwidthManager.enabled=true flag is critical here. It utilizes the EDT (Earliest Departure Time) rate-limiting which is far more efficient than the old TBF (Token Bucket Filter). If your VPS provider restricts kernel headers, this won't work. (Note: CoolVDS KVM instances support this natively because we don't lock you out of your own kernel).

Ingress Tuning: The Silent Killer

So your CNI is fast. Now let's talk about how traffic gets in. The NGINX Ingress Controller is the standard, but out of the box, it is configured for compatibility, not speed.

A common issue I see in Norway, especially with strict GDPR requirements demanding heavy encryption, is SSL handshake failures under load. By default, NGINX Ingress keeps connections open for a very short time and has conservative buffer sizes.

Here is a battle-tested ConfigMap for high-traffic scenarios:

apiVersion: v1
kind: ConfigMap
metadata:
  name: ingress-nginx-controller
  namespace: ingress-nginx
data:
  keep-alive: "75"
  keep-alive-requests: "10000"
  upstream-keepalive-connections: "100"
  worker-processes: "auto"
  max-worker-connections: "65535"
  compute-full-forwarded-for: "true"
  use-forwarded-headers: "true"

Why these values?

  • keep-alive-requests: The default is often too low (100). Bumping this to 10,000 reduces the TCP handshake overhead significantly for persistent clients.
  • upstream-keepalive-connections: This keeps the connection between NGINX and your pods open. Without this, NGINX opens a new connection to your backend pod for every single request. That is a waste of ephemeral ports and CPU.

Debugging Network Drops: tcpdump inside Containers

When a developer says "the network is flaky," you need proof. You can't just run tcpdump on the host and hope to match the packets to the container IP easily, especially with complex NAT rules.

You need to enter the container's network namespace. Use nsenter. It allows you to run host-level tools inside the context of a container.

  1. Find the Container ID:
    crictl ps | grep my-app
  2. Find the Process ID (PID) of that container:
    crictl inspect --output go-template --template '{{.info.pid}}' <CONTAINER_ID>
  3. Enter the net namespace and capture traffic:
    nsenter -t <PID> -n tcpdump -i eth0 -w /tmp/capture.pcap

Now you have a pure capture file you can pull to your local machine and analyze in Wireshark. Look for TCP Retransmissions. If you see them, check your underlying hypervisor metrics.

The Infrastructure Layer: Where CoolVDS Fits In

You can tune Kubernetes all day, but if your underlying VPS has "noisy neighbors" or uses cheap network storage, your latency will spike. In networking, Consistency > Raw Speed.

This is where the choice of hosting provider becomes an architectural decision, not just a billing one. In the Nordic market, routing matters. If your users are in Oslo or Bergen, you want your packets hitting NIX (Norwegian Internet Exchange) directly, not bouncing through a datacenter in Frankfurt or Amsterdam first.

Furthermore, virtualization technology plays a massive role.

Feature Container VPS (LXC/OpenVZ) CoolVDS (KVM)
Kernel Access Shared (Cannot load eBPF modules) Dedicated (Full eBPF support)
Network Stack Virtual Bridge (Slower) Virtio-net (Near metal speed)
Isolation Process Level Hardware Level

Check Your Hardware

Run this on your current node. If you see high "steal" time in top or low numbers here, move.

# Check for packet handling capability
cat /proc/net/softnet_stat

# Test disk I/O latency (crucial for etcd)
fio --name=fsync-test --filename=testfile --bs=4k --iodepth=1 --rw=write --fsync=1 --ioengine=libaio --runtime=30

For Kubernetes, specifically etcd, disk latency is network latency. If etcd cannot write to disk fast enough (fsync latency), it delays the API server response, which causes timeouts in your controllers. CoolVDS utilizes local NVMe storage which typically yields fsync latencies under 0.5ms, ensuring etcd remains stable even during cluster storms.

Final Thoughts

Kubernetes networking is deterministic. If it fails, there is a reason. Usually, it's a conflict between default configurations and high-load reality, or a limitation of the virtual infrastructure underneath.

Stop accepting random timeouts as "normal behavior." Switch to an eBPF-based CNI, tune your ingress keep-alives, and ensure your hosting provider gives you the dedicated resources and kernel access required to run a modern stack.

Ready to stop fighting your infrastructure? Deploy a KVM-based instance on CoolVDS today and see what sub-millisecond latency does for your cluster stability.