Console Login

Kubernetes Networking Deep Dive: Escaping the Iptables Hell (September 2016 Edition)

Kubernetes Networking Deep Dive: Escaping the Iptables Hell

It is 2016. If you are reading this, you are likely tired of managing Docker links manually or writing hacked-together shell scripts to update Nginx upstreams. You have moved to Kubernetes. Good choice. But now you are facing a new beast: the network model.

Most developers treat Kubernetes networking as magic. They create a Service, traffic flows, and they go home. But when you are running a high-load cluster targeting Norwegian users, relying on "magic" is negligence. I have spent the last week debugging a cluster where packet latency between pods spiked to 200ms inside the same datacenter. The culprit? Poorly configured overlay networks and a choked kube-proxy.

We are going to tear apart the Kubernetes 1.3 networking stack, look at CNI plugins, and explain why your choice of hosting infrastructure—specifically the underlying virtualization—dictates whether your cluster flies or crawls.

The "Flat Network" Mandate

Kubernetes imposes a strict requirement: Every Pod must be able to communicate with every other Pod without Network Address Translation (NAT).

This sounds simple. It is not. In a traditional VPS environment, you get one public IP. How do you route traffic to 50 containers running on that single node, let alone across a cluster of 10 nodes? In the old Docker days, we mapped ports. Port 8080 on the host maps to 80 in the container. Kubernetes hates this. It creates port conflicts and makes service discovery a nightmare.

Instead, we assign an IP address to the Pod, not the container. This requires a Container Network Interface (CNI) plugin to manage a virtual bridge on the host.

The CNI War: Flannel vs. Calico

Right now, you have two main choices for your cluster networking: Flannel or Calico.

Flannel is the easy button. It creates an overlay network (usually VXLAN). It encapsulates your packet inside a UDP packet, sends it to the other host, and decapsulates it. It works everywhere.

Pro Tip: VXLAN adds overhead. Every packet requires CPU cycles for encapsulation. On shared hosting with "noisy neighbors" stealing CPU time, your network throughput will tank. This is why we insist on KVM-based virtualization at CoolVDS—you need guaranteed CPU cycles for packet processing.

If you are using Flannel, your configuration in /etc/systemd/system/docker.service.d/flannel.conf might look like this to ensure Docker picks up the right subnet:

[Service]
EnvironmentFile=-/run/flannel/subnet.env
ExecStart=
ExecStart=/usr/bin/docker daemon \
  --bip=${FLANNEL_SUBNET} \
  --mtu=${FLANNEL_MTU} \
  --ip-masq=false \
  --iptables=false

Calico, on the other hand, is for the grown-ups. It uses BGP (Border Gateway Protocol)—the same protocol that powers the internet—to route packets between nodes. No encapsulation. Just pure routing. It is faster, but it requires your underlying network to allow BGP traffic. Many budget VPS providers block this. We don't.

Kube-Proxy: The Silent Bottleneck

Until recently, Kubernetes used "userspace" mode for proxying traffic. It was terrible. Packet comes in -> context switch to kernel -> context switch to userspace proxy -> kernel -> destination. It was slow and fragile.

In Kubernetes 1.2 and now 1.3, the iptables mode is stable. You must use this. It handles everything in the kernel using netfilter rules. It scales significantly better.

Check your kube-proxy startup flags immediately:

/usr/local/bin/kube-proxy \
  --master=http://10.0.0.1:8080 \
  --proxy-mode=iptables \
  --kubeconfig=/var/lib/kube-proxy/kubeconfig \
  --v=2

When running in this mode, if you inspect your node's iptables, you will see a massive chain of rules. This is how K8s performs load balancing without an external load balancer. It uses the statistic module in iptables to randomly distribute traffic.

Here is what the NAT table looks like for a Service with two endpoints (50% probability split):

-A KUBE-SVC-SOMESERVICE -m comment --comment "default/my-service:" -m statistic --mode random --probability 0.50000000000 -j KUBE-SEP-ENDPOINT1
-A KUBE-SVC-SOMESERVICE -m comment --comment "default/my-service:" -j KUBE-SEP-ENDPOINT2

If you have thousands of services, iptables can get slow to update. But for 99% of deployments in 2016, this is vastly superior to the userspace proxy.

Ingress: The Beta Feature You Need

Exposing services via NodePort is clumsy. You end up with ports like 32045 open on every node. The cleaner solution in Kubernetes 1.3 is the Ingress resource (currently extensions/v1beta1).

You deploy an Nginx Ingress Controller inside the cluster. It watches the Kubernetes API for new Ingress rules and hot-reloads its nginx.conf automatically. No more manual config updates.

Here is a standard Ingress definition:

apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  name: my-web-ingress
  annotations:
    kubernetes.io/ingress.class: "nginx"
spec:
  rules:
  - host: myapp.no
    http:
      paths:
      - path: /
        backend:
          serviceName: my-frontend
          servicePort: 80

This allows you to terminate SSL and route traffic based on host headers, all using a single public IP. Crucial for keeping costs down.

Infrastructure Matters: The CoolVDS Advantage

You can run Kubernetes on a laptop. But running it in production requires I/O and network stability. If you are targeting users in Oslo, latency is your enemy. Routing traffic through a budget host in the US adds 150ms of lag. You want your nodes sitting on the NIX (Norwegian Internet Exchange).

Furthermore, many "Cloud" providers today still use OpenVZ or LXC containers to sell you "VPS" hosting. Do not run Kubernetes on OpenVZ. You cannot load the necessary kernel modules for Overlay networks or modify ip_tables deep enough for CNI plugins to function correctly.

At CoolVDS, we use KVM. Each instance has its own kernel. You want to enable `ip_forward`? Go ahead. You want to run Calico and mess with BGP routes? We won't stop you. Plus, our NVMe storage ensures that when etcd writes to disk (which it does, constantly), your cluster doesn't lock up waiting for I/O.

Debugging Network Latency

If your users complain about timeouts, don't guess. Check the physical link.

# Check for dropped packets on the interface
netstat -i

# Trace the path inside the overlay
traceroute -n 10.244.1.5

Also, keep an eye on conntrack tables. If you have high traffic, your node might drop packets simply because the table is full. Increase it in sysctl:

sysctl -w net.netfilter.nf_conntrack_max=131072

Conclusion

Kubernetes 1.3 is powerful, but it exposes the raw complexity of Linux networking. Whether you choose Flannel for simplicity or Calico for performance, the underlying hardware determines your stability. Don't let a slow hypervisor or a distant datacenter undermine your architecture.

We are seeing more Norwegian companies moving data home to comply with Datatilsynet recommendations. If you are building the future of infrastructure, build it on ground you can trust.

Need a sandbox to test your CNI config? Deploy a KVM instance on CoolVDS today. We give you full root access and the low latency your cluster demands.