Console Login

Surviving the Packet Storm: A Kubernetes Networking Deep Dive for Nordic Ops

Surviving the Packet Storm: A Kubernetes Networking Deep Dive for Nordic Ops

It was 3:00 AM on a freezing Tuesday in Oslo. My phone screamed with PagerDuty alerts. The latency on our primary API gateway had spiked from 25ms to 400ms. The application logs were clean. The database metrics were boring. Yet, packets were dying somewhere between the NIX (Norwegian Internet Exchange) and our pods.

The culprit? A noisy neighbor on a budget VPS provider stealing CPU cycles, causing soft interrupt queues to overflow on our worker nodes. The overlay network couldn't encapsulate packets fast enough.

Most developers treat Kubernetes networking as magic. You create a Service, traffic flows. But when you are running high-throughput workloads compliant with strict Norwegian data regulations, magic isn't enough. You need engineering.

The CNI Battlefield: Moving Beyond Flannel

If you are still running Flannel in production in 2024, stop. It’s fine for a homelab, but it relies heavily on iptables and lacks the observability required for enterprise environments. In the modern stack, the choice usually boils down to Calico (standard) or Cilium (performance).

I strictly prefer Cilium for one reason: eBPF. Instead of routing packets through the agonizingly slow iptables chains, Cilium injects logic directly into the kernel.

Here is how we deploy Cilium with strict kube-proxy replacement to shave off those critical milliseconds:

helm install cilium cilium/cilium --version 1.14.4 \
  --namespace kube-system \
  --set kubeProxyReplacement=strict \
  --set k8sServiceHost=${API_SERVER_IP} \
  --set k8sServicePort=${API_SERVER_PORT} \
  --set nativeRoutingCIDR=10.0.0.0/8 \
  --set tunnel=disabled \
  --set autoDirectNodeRoutes=true

Note the tunnel=disabled flag. By using direct routing instead of VXLAN encapsulation, we reduce the MTU overhead. This requires a network infrastructure that supports L3 routing between nodes—something standard shared hosting struggles with, but a properly architected environment like CoolVDS handles natively.

Kernel Tuning: The Defaults Will Kill You

Linux defaults are designed for general-purpose computing, not for a Kubernetes node shuffling 100k packets per second. If you spin up a node without touching sysctl, you are bottlenecking your own throughput.

I recently audited a cluster for a Bergen-based fintech startup. They were hitting connection timeouts despite low CPU usage. The issue was the conntrack table size. Here is the standard sysctl configuration I apply to every worker node before it joins the cluster:

# /etc/sysctl.d/99-k8s-networking.conf

# Increase the connection tracking table size
net.netfilter.nf_conntrack_max = 131072

# Shorten keepalive to detect dead connections faster
net.ipv4.tcp_keepalive_time = 600
net.ipv4.tcp_keepalive_intvl = 60
net.ipv4.tcp_keepalive_probes = 10

# Enable IP forwarding (Mandatory for K8s)
net.ipv4.ip_forward = 1

# Maximize the backlog for high packet rates
net.core.netdev_max_backlog = 5000

# Increase TCP buffer sizes for 10Gbps+ links
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216

Apply this with sysctl -p. If your hosting provider restricts kernel parameter modifications, migrate immediately.

Ingress vs. Gateway API: The 2024 Reality

While the Gateway API reached v1.0 recently, the NGINX Ingress Controller remains the workhorse for 90% of production workloads. The mistake everyone makes? Running it with default replica counts.

You need to pin your ingress controllers to specific nodes using nodeSelector to ensure they have dedicated network bandwidth. Furthermore, avoid the "Hairpin Mode" where traffic hits a node, gets DNAT'd, and sent to a pod on another node. Use externalTrafficPolicy: Local.

Configuring MetalLB for Bare-Metal/VDS

Since we aren't using a managed cloud load balancer (and saving 40% on costs), we use MetalLB. It advertises your Service IP via ARP (Layer 2) or BGP (Layer 3). For a setup within a single datacenter like CoolVDS's Oslo zone, Layer 2 is robust enough.

apiVersion: metallb.io/v1beta1
kind: IPAddressPool
metadata:
  name: coolvds-public-pool
  namespace: metallb-system
spec:
  addresses:
  - 185.xxx.xxx.10-185.xxx.xxx.20
---
apiVersion: metallb.io/v1beta1
kind: L2Advertisement
metadata:
  name: advertise-all
  namespace: metallb-system
spec:
  ipAddressPools:
  - coolvds-public-pool
Pro Tip: When using Layer 2 mode, traffic for a specific IP goes to one node. Ensure your nodes have sufficient network capacity. We usually provision CoolVDS NVMe instances with 10Gbps uplinks to handle the ingress flood without sweating.

The Hardware Impact on Overlay Networks

Kubernetes networking is CPU intensive. Every time a packet is encapsulated (VXLAN) or decrypted (WireGuard/IPsec), the CPU pays a tax. In a noisy public cloud, "vCPU" is a marketing term. You are fighting for cycles with twenty other tenants.

When the hypervisor steals CPU time (checking %st in top), your network latency jitters. For a database replication stream between pods, this jitter causes leader elections and downtime.

I ran a benchmark comparing standard shared VPS vs. dedicated resource VDS using `iperf3` between two pods on different nodes:

Metric Standard Shared VPS CoolVDS (Dedicated Core)
Throughput (Gbps) 0.8 Gbps (Unstable) 9.2 Gbps (Stable)
Latency (P99) 145ms 12ms
Packet Loss 1.2% 0.0%

This is why we treat the infrastructure as a software dependency. You wouldn't use a buggy library; don't use flaky compute.

Security: The Norwegian Context (GDPR & NetworkPolicies)

Running k8s in Norway means adhering to strict data residency laws. But compliance isn't just about where the data lives, it's about how it moves. By default, Kubernetes allows all pods to talk to all pods. This is a security nightmare.

If you have a frontend pod, it should talk to the backend, but never to the database directly. Implementing a default deny policy is the first step in any audit.

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-all
  namespace: production
spec:
  podSelector: {}
  policyTypes:
  - Ingress
  - Egress

Once applied, you explicitly whitelist traffic. Use cilium monitor --type=drop to debug blocked connections during the setup phase.

Debugging When It All Breaks

When DNS resolution fails (and it's always DNS), don't just restart CoreDNS. Check the underlying link.

Check for dropped packets on the interface:

ip -s link show eth0

Trace the packet path inside the container namespace:

nsenter -t $(pgrep -n kube-proxy) -n tcpdump -i any port 53

Final Thoughts

Building a robust Kubernetes cluster isn't just about YAML files. It's about understanding the path of a packet from the trans-Atlantic fiber cables hitting Norway, through the NIX exchange, down to the hypervisor, and finally into your container.

You can optimize your CNI configuration and tune your sysctls until you memorize the kernel documentation. But if the underlying metal is oversubscribed, you are optimizing for failure. Stability starts at the infrastructure layer.

For workloads where latency is money and data sovereignty is law, generic clouds don't cut it. Deploy a test cluster on CoolVDS today and see what dedicated NVMe performance actually feels like.