Console Login

Kubernetes Networking on Bare Metal: Stop Packet Drops Before They Kill Your Pods (2019 Edition)

The Packet Never Lies: A Deep Dive into K8s Networking

Let’s be honest. Kubernetes networking is magic until it breaks. Then it’s a nightmare of iptables rules, virtual interfaces, and cryptic CNI errors. If you are deploying Kubernetes 1.14 or the fresh 1.15 on bare metal or VPS instances—rather than succumbing to the "click-and-pray" managed services of the big US cloud giants—you have likely hit the wall. The pod can ping the node, but the node can't ping the pod. Or worse, random TCP resets occur only under high load.

I spent last week debugging a cluster for a fintech client in Oslo. They were seeing intermittent 502 Bad Gateways. The culprit wasn't their Java app; it was a mismatch in VXLAN packet sizes. In this post, we are cutting through the marketing fluff. We are looking at the raw plumbing of the Cluster Network Interface (CNI), `kube-proxy` modes, and how to handle Ingress when you don't have an Elastic Load Balancer to hide behind.

1. The CNI Battlefield: Calico vs. Flannel

In 2019, you generally have two main choices for self-hosted clusters: Flannel or Calico. Flannel is the "it just works" option, creating a simple overlay network. But "simple" doesn't scale well when you need network policies for GDPR compliance.

If you are serious about security, you are running Calico. Why? BGP. Calico allows your nodes to exchange routing information directly, often avoiding the encapsulation overhead of VXLAN if your underlying network supports it. However, on most VPS providers, Layer 2 adjacency is blocked for security. You are forced into IPIP (IP-in-IP) mode.

Here is the critical mistake most make: ignoring the MTU (Maximum Transmission Unit).

The MTU Trap

Your VPS interface (let's say `eth0`) likely has an MTU of 1500 bytes. If you wrap your packets in IPIP or VXLAN headers, you add overhead (usually 20-50 bytes). If your inner pod interface is also set to 1500, the outer packet becomes 1520+ bytes. The physical switch drops it. Packet loss.

The Fix: Explicitly define the MTU in your Calico config map to be lower than the physical interface.

kubectl edit configmap -n kube-system calico-config

Look for `veth_mtu`. Set it to 1440 to be safe. It saves you weeks of debugging "ghost" latency.

Pro Tip: On CoolVDS NVMe instances, our network stack is optimized for high throughput. We support standard 1500 MTU with extremely low jitter to the NIX (Norwegian Internet Exchange). When you run overlay networks here, the CPU overhead of encapsulation is negligible thanks to the raw power of the KVM virtualization we use—unlike older Xen setups.

2. `kube-proxy`: IPTables is Bottlenecking You

By default, Kubernetes uses `iptables` mode for `kube-proxy`. This works fine for 50 services. But if you are running a microservices architecture with 500+ services, the kernel has to traverse a massive list of rules sequentially for every packet.

In K8s 1.11, IPVS (IP Virtual Server) went GA. In 2019, if you aren't using IPVS, you are wrong. IPVS uses hash tables instead of linear lists. It is O(1) complexity versus O(n). It implies lower latency and higher throughput.

To enable IPVS, you need to ensure kernel modules are loaded on your nodes before Kubelet starts:

# Load IPVS modules on the host
modprobe -- ip_vs
modprobe -- ip_vs_rr
modprobe -- ip_vs_wrr
modprobe -- ip_vs_sh
modprobe -- nf_conntrack_ipv4

Then, update your `kube-proxy` config map mode to `ipvs`. The difference in response time, especially when routing internal traffic between database shards, is night and day.

3. Ingress: The "Real IP" Problem

Hosting in Norway implies strict adherence to data privacy laws. Datatilsynet doesn't look kindly on logs that show the load balancer's IP instead of the customer's IP. When you run NGINX Ingress Controller on a VPS without an external cloud load balancer, you usually expose it via a `NodePort` or `HostNetwork`.

If you use a standard Service of type `NodePort`, the packet hits Node A, gets SNAT'd (Source Network Address Translation), and forwarded to the pod on Node B. The pod sees Node A's IP, not the client's. This breaks your geo-fencing and your logs.

The Solution: `externalTrafficPolicy: Local`.

apiVersion: v1
kind: Service
metadata:
  name: ingress-nginx
  namespace: ingress-nginx
spec:
  type: NodePort
  externalTrafficPolicy: Local
  ports:
  - name: http
    port: 80
    targetPort: 80
    protocol: TCP
  selector:
    app.kubernetes.io/name: ingress-nginx

This setting forces the traffic to stay on the node that received it. If the node doesn't have an ingress pod, the packet is dropped. This means you must run an Ingress DaemonSet (one pod per node) to ensure high availability. It effectively turns your CoolVDS cluster into a distributed edge router.

4. Debugging Like a 2019 Sysadmin

You can't just `ssh` into a container and expect `tcpdump` to be there. Distroless images are trendy, and they have no shell. So how do you inspect traffic?

We use `nsenter` to piggyback on the container's network namespace from the host node.

  1. Find the Container ID: docker ps | grep my-pod
  2. Get the PID: docker inspect --format '{{ .State.Pid }}' <container-id>
  3. Enter the namespace:
nsenter -t <PID> -n tcpdump -i eth0 -nn port 80

This command allows you to see exactly what the pod sees, using the host's tools. It’s a lifesaver when debugging communication between your PHP frontend and your MySQL backend.

Why Infrastructure Matters

You can tune sysctls until you are blue in the face, but if your underlying hypervisor has "noisy neighbors" stealing CPU cycles, your network latency will spike. Kubernetes networking is CPU intensive—packet encapsulation, iptables rule evaluation, and encryption all take toll.

This is where CoolVDS differs. We don't oversell our cores. When you spin up a VPS in our Oslo datacenter, you get the dedicated slice of compute you paid for. For Kubernetes, this stability is not a luxury; it is a requirement.

Summary Checklist for your K8s Cluster

  • CNI: Use Calico, but check your MTU (set to 1440 for safety).
  • Proxy: Switch `kube-proxy` to IPVS mode immediately.
  • Ingress: Use `externalTrafficPolicy: Local` to preserve client IPs for GDPR compliance.
  • Storage: Don't forget that etcd needs fast disk I/O. Our NVMe storage guarantees low write latency, keeping your cluster state consistent.

Stop fighting the network. Build on a foundation that respects the packet.

Ready to build a cluster that actually stays up? Deploy a high-performance NVMe VPS on CoolVDS today and experience single-digit latency to the Nordics.