Kubernetes Networking: Beyond Docker Links
It has been two months since Docker hit version 1.0. The hype is deafening. Every developer I know is packaging apps into containers, and for good reason—dependency hell is arguably over. But if you are managing infrastructure like I do, the party stops the moment you need a container on Host A to talk to a container on Host B.
Until recently, we were stuck hacking --link or setting up fragile HAProxy bridges. Then Google dropped Kubernetes (currently in v0.x). It promises to manage this chaos, but its networking model is radically different from what we are used to. If you are still thinking in terms of port mapping, you need to unlearn that immediately.
In this post, we are going to look under the hood of Kubernetes networking, how kube-proxy manipulates iptables, and why this new architecture forces us to rethink our underlying VPS choices.
The "Flat IP" Promise
The fundamental requirement of Kubernetes is simple but painful to implement: all containers can communicate with all other containers without NAT.
In a standard Docker setup, you get a bridge (docker0) and private IPs that are unreachable from the outside. Kubernetes demands that every "Pod" (a collection of containers) gets its own IP address that is routable across the entire cluster.
This means if you have three Minions (the new term for worker nodes), you cannot rely on the standard Docker bridge default. You have to allocate a subnet to each Minion and ensure the routing table on every other Minion knows about it.
The Manual Routing Approach (The Hard Way)
Right now, early adopters are scripting this with SaltStack or Chef. Here is what the routing table actually looks like on a Minion (let's say 192.168.1.10) that needs to reach Pods on a neighbor Minion (192.168.1.11 which hosts the subnet 10.244.1.0/24).
root@minion-1:~# ip route add 10.244.1.0/24 via 192.168.1.11 dev eth1
root@minion-1:~# route -n
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
0.0.0.0 192.168.1.1 0.0.0.0 UG 0 0 0 eth0
10.244.1.0 192.168.1.11 255.255.255.0 UG 0 0 0 eth1
172.17.0.0 0.0.0.0 255.255.0.0 U 0 0 0 docker0
You have to manage these routes for every single node. If you add a node, you have to update the route tables on every existing node. It is tedious. It is prone to error. One typo and your database can't talk to your web frontend.
Service Discovery & Kube-Proxy
The routing gets packets to the node, but how do they hit the container? Enter kube-proxy. This service runs on every Minion and acts as a traffic director. In the current 0.x releases, it is heavy on userspace proxying, but it is moving toward pure iptables manipulation.
When you define a Service in JSON, kube-proxy opens a port on the host and proxies traffic to the backend Pods. It looks something like this in your config:
{
"id": "redis-master",
"kind": "Service",
"apiVersion": "v1beta1",
"port": 6379,
"containerPort": 6379,
"selector": {
"name": "redis",
"role": "master"
}
}
Behind the scenes, check your iptables. You will see a mess of chains generated by Kubernetes to handle this redirection. This adds latency. It is millisecond-level overhead, but in a high-frequency trading environment or a high-load Norwegian news site, it adds up.
Why Your Virtualization Platform Matters
This is where infrastructure reality hits the theoretical wall. To run Kubernetes effectively in 2014, you need control.
- Kernel Requirements: Docker needs a modern kernel (3.8+) for AUFS and cgroups. Many "Managed VPS" providers are still stuck on OpenVZ with ancient 2.6 kernels shared across hundreds of users. Docker daemon will crash. Kubernetes will fail to start.
- Bridge Manipulation: You need to install packages like
bridge-utilsand modify/etc/network/interfacesto set up your own bridges (likecbr0) for the Pod network.
This is why at CoolVDS, we strictly use KVM (Kernel-based Virtual Machine). You get your own kernel. You can load custom modules. You can modify your iptables without hitting a "Permission Denied" error from a hypervisor restriction.
Pro Tip: If you are seeing random packet drops between Pods, check the MTU size on your overlay interface. Flannel (CoreOS's new overlay tool) adds a header that reduces the payload size. If your physical interface is 1500, set the Docker MTU to 1450.
Data Sovereignty in the Age of Snowden
We are all a bit paranoid right now. With the recent leaks about NSA surveillance, relying on US-based cloud giants for your cluster infrastructure is a risk many European CTOs are no longer willing to take. The Safe Harbor framework is under scrutiny.
Hosting your Kubernetes cluster on CoolVDS in our Oslo datacenter ensures your data remains under Norwegian jurisdiction (Personopplysningsloven). Plus, if your target market is Norway, the latency benefits are undeniable.
Latency Benchmark: Oslo vs. Frankfurt
| Source | Target: AWS Frankfurt | Target: CoolVDS Oslo |
|---|---|---|
| User in Bergen | ~35ms | ~9ms |
| User in Trondheim | ~42ms | ~12ms |
Setting Up Etcd (The Brain)
Kubernetes needs etcd to store state. This is a distributed key-value store. It is sensitive to disk I/O latency. If your disk writes are slow, the cluster state desynchronizes, and the API server starts timing out.
We recently tested an etcd cluster (v0.4.6) on standard SATA VPS hosting versus CoolVDS NVMe instances. The difference is night and day during leader election.
# Starting etcd on CoolVDS
./etcd -name infra0 -initial-advertise-peer-urls http://10.0.1.10:2380 \
-listen-peer-urls http://10.0.1.10:2380 \
-initial-cluster-token etcd-cluster-1 \
-initial-cluster infra0=http://10.0.1.10:2380 \
-initial-cluster-state new
On standard spinning rust storage, we saw heartbeat timeouts. On NVMe, it was rock solid.
Conclusion
Kubernetes is still in its infancy (v0.x). It is rough, the documentation is sparse, and the networking requires manual intervention. But it is the future. The days of manually SSH-ing into servers to git pull are ending.
If you are brave enough to test this new architecture, do not handicap yourself with inferior virtualization. You need KVM, you need raw I/O for etcd, and you need low latency.
Ready to build your first cluster? Deploy a high-performance KVM instance on CoolVDS today and get full root access in under 55 seconds.