Kubernetes Networking Deep Dive: Stop Trusting Defaults and Fix Your Latency
I still remember the 3 AM page I got last winter. A client's fintech cluster in Oslo had ground to a halt. The pods were running, the nodes were healthy, but traffic was disappearing into the void. The culprit? A default iptables based CNI implementation choking on 5,000 services. The latency wasn't coming from the application logic; it was coming from the kernel trying to traverse a massive ruleset for every single packet.
Kubernetes networking is effectively a distributed system problem wrapped in a Linux kernel configuration puzzle. If you are running K8s in production in 2024 and still using the default network settings provided by kubeadm or your cloud provider, you are leaving performance on the table. Worse, you are introducing jitter that no amount of code optimization can fix.
This isn't a "Hello World" tutorial. We are going to rip open the overlay network, look at eBPF, and discuss why your choice of infrastructure—specifically the VPS or VDS underneath—makes or breaks your networking throughput.
The Lie of the "Flat Network"
Kubernetes promises a flat network where every pod can talk to every other pod. It's a beautiful abstraction. Under the hood, however, it is a chaotic mess of encapsulation (VXLAN/Geneve), NAT, and conntrack table lookups. When you run this on a standard VPS in Norway that overcommits CPU, you run into what I call the "Steal Time Latency Trap."
Packet processing requires CPU cycles. If your neighbor on a shared host decides to mine crypto, your CPU steal time goes up. Your kernel delays processing the interrupt for that incoming packet. Suddenly, your microservice latency spikes from 2ms to 200ms.
Pro Tip: Always check your underlying node's steal time. If%stintopis above 0.0 on an idle cluster, migrate immediately. This is why at CoolVDS we strictly isolate CPU cores via KVM. We don't believe in "burstable" performance for networking nodes; you need consistent packet processing power.
CNI Wars 2024: Just Use eBPF
In the early days (2018-2020), we debated Flannel vs. Calico. In late 2024, the debate is largely settled for high-performance clusters: Cilium (eBPF) is the gold standard.
Legacy iptables-based networking (like standard kube-proxy) essentially manages a long list of firewall rules. As your cluster grows, this list becomes a sequential lookup bottleneck. eBPF (Extended Berkeley Packet Filter) allows us to run sandboxed programs in the kernel to route packets without the overhead of iptables.
Deploying Cilium with Strict Performance Tuning
Don't just helm install. You need to replace kube-proxy entirely for maximum efficiency.
helm install cilium cilium/cilium --version 1.16.1 \
--namespace kube-system \
--set kubeProxyReplacement=true \
--set bpf.masquerade=true \
--set routingMode=native \
--set autoDirectNodeRoutes=true \
--set ipv4.enabled=true \
--set loadBalancer.mode=dsr
Breakdown of these flags:
kubeProxyReplacement=true: Removes the iptables bottleneck entirely.routingMode=native: If your underlying network supports it (like CoolVDS private networking), this avoids encapsulation (VXLAN) entirely, improving throughput by removing the packet header overhead.loadBalancer.mode=dsr: Direct Server Return. The response packet goes directly to the client rather than traversing back through the load balancer node. Massive bandwidth saver.
Kernel Tuning for High-Throughput Nodes
Regardless of your CNI, the Linux kernel defaults are tuned for 1990s desktop usage, not 2024 container orchestration. You must tune sysctl.conf on your worker nodes. We include these tunings in our CoolVDS "High-Performance K8s" templates by default.
# /etc/sysctl.d/k8s-net.conf
# Increase the range of ephemeral ports
net.ipv4.ip_local_port_range = 1024 65535
# Maximize the backlog for high connection rates (crucial for ingress)
net.core.somaxconn = 65535
net.core.netdev_max_backlog = 5000
# Enable fast recycling of TIME_WAIT sockets (use with caution, but often necessary)
net.ipv4.tcp_tw_reuse = 1
# BBR Congestion Control - Essential for WAN latency optimization
net.core.default_qdisc = fq
net.ipv4.tcp_congestion_control = bbr
Verify the congestion control is active. BBR is significantly better at handling packet loss on trans-Atlantic links or unstable mobile connections.
sysctl net.ipv4.tcp_congestion_control
Network Policies: The GDPR Firewalls
Operating in Norway or the EU implies strict adherence to data minimization. You cannot have a "flat network" where the frontend can talk to the database directly. That is a security violation waiting to happen.
If you are using Cilium, use CiliumNetworkPolicy for Layer 7 filtering. Standard K8s NetworkPolicy is Layer 3/4 only.
apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
name: secure-backend-ingress
namespace: production
spec:
endpointSelector:
matchLabels:
app: backend-api
ingress:
- fromEndpoints:
- matchLabels:
app: frontend-proxy
toPorts:
- ports:
- port: "8080"
protocol: TCP
rules:
http:
- method: "POST"
path: "/v1/transaction"
This policy ensures that only the frontend can POST to the transaction endpoint. If an attacker compromises a logging sidecar, they can't pivot to the database.
Debugging: When Pings Fail
Standard tools like ping often don't exist in distroless containers. Learn to use ephemeral debug containers.
kubectl debug -it pod/backend-api-xyz --image=nicolaka/netshoot --target=backend-api
Once inside, use tcpdump to verify if packets are actually hitting the interface. If you see SYN packets but no SYN-ACK, check your MTU settings. A common issue on virtualized infrastructure is MTU mismatch (1500 vs 1450 due to VXLAN).
| Scenario | Recommended MTU | Reason |
|---|---|---|
| Native / Direct Routing | 1500 (or 9000 Jumbo) | No overhead. Best performance. |
| VXLAN Overlay | 1450 | 50 bytes reserved for VXLAN headers. |
| WireGuard Encryption | 1420 | Standard WireGuard overhead deduction. |
The Infrastructure Factor: Why CoolVDS?
We need to talk about where these packets actually live. In a virtualized environment, every network call hits the hypervisor. If your provider uses cheap, shared networking gear or standard HDD storage for logs, your I/O wait times will bleed into your network latency.
At CoolVDS, we optimized our stack for exactly this use case:
- NVMe Storage: High-speed logging (Fluentd/Prometheus) doesn't block kernel I/O, ensuring network interrupts are processed instantly.
- 10Gbps Uplinks to NIX: We peer directly at the Norwegian Internet Exchange. If your customers are in Oslo, their traffic shouldn't route through Frankfurt.
- KVM Isolation: No noisy neighbors stealing CPU cycles during packet encapsulation.
While massive clouds offer "infinite" scale, they often throttle IOPS and network burst limits on smaller instances. We provide raw, unthrottled performance because we know that in a microservices architecture, network is the computer.
Conclusion
Kubernetes networking in 2024 is about shedding the legacy weight of iptables and optimizing for the data plane. By switching to eBPF with Cilium, tuning your kernel BBR settings, and hosting on infrastructure that respects data locality and CPU isolation, you turn your network from a bottleneck into a competitive advantage.
Don't let latency kill your conversion rates. Spin up a CoolVDS NVMe instance today, apply these sysctl tunings, and watch your p99 latency drop.