Console Login

Demystifying Kubernetes 1.1 Networking: A Deep Dive into Overlays and iptables

Demystifying Kubernetes 1.1 Networking: A Deep Dive into Overlays and iptables

Let’s be honest. Getting Kubernetes up and running is one thing. Understanding how a packet actually travels from a frontend Pod on Node A to a backend Service on Node B is an entirely different circle of hell. If you are running Kubernetes 1.1 in production right now, you know exactly what I mean.

The promise of Google's infrastructure for the masses is seductive. But abstraction has a cost. In the case of Kubernetes, that cost is often network complexity. I recently spent three sleepless nights debugging a sporadic timeout issue for a client here in Oslo. The culprit wasn't the application code; it was a misconfigured MTU size inside a Flannel VXLAN overlay. Packet fragmentation is a silent killer.

In this post, we are going to rip open the hood of the Kubernetes networking model. We will look at how the "flat network" assumption works, why your iptables are a mess, and how to host this stack without sacrificing performance.

The "Flat Network" Lie

Kubernetes imposes a fundamental requirement on any networking implementation: all containers can communicate with all other containers without NAT. Ideally, the IP that a container sees itself as is the same IP that others see it as.

On a cloud provider like GCE, this is handled by the underlying SDN. But if you are building on bare metal or VPS infrastructure—which you should be if you care about cost and data sovereignty in Norway—you have to build this network yourself. This is usually done via an overlay network.

Enter Flannel

Right now, CoreOS's Flannel is the de facto standard for simple overlays. It creates a flannel0 device that encapsulates packets and sends them over the host network to the destination node.

Here is what the configuration inside etcd typically looks like for a Flannel setup. If you aren't checking your etcd keys, you are flying blind.

$ etcdctl get /coreos.com/network/config
{
    "Network": "10.1.0.0/16",
    "Backend": {
        "Type": "vxlan"
    }
}

When flanneld starts, it allocates a subnet (e.g., 10.1.15.0/24) to the host and writes it to a subnet file. This file is critical because Docker needs it to configure its bridge.

$ cat /run/flannel/subnet.env
FLANNEL_NETWORK=10.1.0.0/16
FLANNEL_SUBNET=10.1.15.1/24
FLANNEL_MTU=1450
FLANNEL_IPMASQ=true
Pro Tip: notice the FLANNEL_MTU=1450? The standard Ethernet MTU is 1500. VXLAN adds a 50-byte header. If your underlying VPS network doesn't support Jumbo Frames (most don't), and you force Docker to use 1500, your packets will drop silently. Always sync your Docker daemon flags with this file.

The Service Proxy: Iptables Spaghetti

In Kubernetes 1.0, the kube-proxy operated in userspace. It was stable but slow. In 1.1, the iptables mode is gaining traction, offering higher throughput by handling routing entirely in kernel space.

However, this makes debugging a nightmare. When you create a Service, kube-proxy writes a chain of rules. Let's look at what actually happens when you expose a service.

First, verify your service IP:

$ kubectl get svc frontend
NAME       CLUSTER_IP      EXTERNAL_IP   PORT(S)   SELECTOR       AGE
frontend   10.0.0.142      <none>        80/TCP    app=frontend   4d

Now, if you try to curl 10.0.0.142 and it hangs, don't blame the application yet. Check the node rules. The traffic doesn't go to an interface; it gets trapped by PREROUTING.

$ sudo iptables-save | grep 10.0.0.142
-A KUBE-PORTALS-CONTAINER -d 10.0.0.142/32 -p tcp -m comment --comment "default/frontend:" -m tcp --dport 80 -j REDIRECT --to-ports 44321
-A KUBE-PORTALS-HOST -d 10.0.0.142/32 -p tcp -m comment --comment "default/frontend:" -m tcp --dport 80 -j DNAT --to-destination 192.168.1.50:44321

If these rules are missing, kube-proxy is likely out of sync with the API server. Restarting the proxy is the classic "turn it off and on again" fix, but persistent failures usually indicate an issue with the hyperkube binary or network partition.

Performance penalties and Local Reality

Overlays like VXLAN induce CPU overhead. Every packet is encapsulated and decapsulated. On a shared, oversold VPS, this "CPU steal" will kill your latency. You might see ping times spike from 2ms to 200ms during neighbor activity.

This is why hardware selection matters. For our internal clusters, and what we recommend for CoolVDS implementations, we rely on KVM virtualization rather than OpenVZ. KVM allows us to load custom kernel modules needed for specific overlay routing and ensures that the CPU cycles you pay for are actually yours.

The Norwegian Context: Latency and Law

With the recent invalidation of the Safe Harbor agreement (Schrems I), relying on US-based cloud providers has become legally risky for Norwegian businesses handling personal data. The Datatilsynet (Norwegian Data Protection Authority) is watching closely.

Hosting locally in Oslo or nearby isn't just about compliance; it's about physics. Routing traffic from Oslo to a data center in Frankfurt and back adds roughly 20-30ms of latency. If you are running a microservices architecture where one user request triggers 50 internal RPC calls, that latency compounds fast.

Configuration Checklist for Production

Before you declare your cluster "production ready," verify these settings. I've seen these misconfigurations take down entire environments.

  1. Docker Bridge IP: Ensure --bip on the Docker daemon does not conflict with the node IP or the overlay range.
  2. Netfilter on Bridges: You often need to enable bridge-nf-call-iptables for K8s to function correctly.
sysctl -w net.bridge.bridge-nf-call-iptables=1
sysctl -w net.bridge.bridge-nf-call-ip6tables=1

Add this to your /etc/sysctl.conf or the setting will vanish on reboot.

The Infrastructure Foundation

Kubernetes is brilliant, but it is heavy. It assumes you have resources to burn. Running a master node, etcd cluster, and worker nodes requires robust IOPS, especially for etcd which is incredibly sensitive to disk latency.

This is where standard HDD VPS hosting fails. If etcd cannot fsync to disk fast enough, the cluster leader election fails, and your cluster falls apart. We built the CoolVDS NVMe platform specifically to solve this I/O bottleneck. When you are stacking file systems (overlayFS on top of ext4 on top of virtual block devices), you need the raw throughput of NVMe to keep the system responsive.

Don't let your infrastructure be the bottleneck for your orchestration. If you are experimenting with Kubernetes 1.1, spin up a high-performance instance that can actually handle the overlay overhead.

Ready to build a cluster that doesn't time out? Deploy a CoolVDS NVMe instance in Oslo today.