Console Login

Kubernetes Networking on Bare Metal: Escaping Iptables Hell in 2018

Kubernetes Networking on Bare Metal: Escaping Iptables Hell

Let’s be honest: kubectl apply -f is the easy part. The moment you move from a local Minikube setup to a production cluster spanning multiple nodes, networking becomes your worst nightmare. I’ve spent the last three weeks debugging a microservices architecture that worked perfectly in staging but fell apart under load. The culprit? It wasn't the application code. It was the conntrack table exhaustion on the worker nodes.

In late 2018, Kubernetes (k8s) has won the orchestration war, but the networking layer remains a black box for many. If you are deploying in Norway, dealing with GDPR compliance and the need for low latency to NIX (Norwegian Internet Exchange), you cannot afford to treat the network as an abstraction. You need to know what happens to a packet when it hits eth0.

The Flat Network Lie

Kubernetes promises a flat network where every pod can talk to every other pod. This is great for developers but a headache for us infrastructure architects. To achieve this on a standard VPS or bare metal setup without specialized hardware support, we rely on CNI (Container Network Interface) plugins.

Most tutorials tell you to just install Flannel and move on. Don't. If you care about packet overhead, you need to understand the trade-offs.

1. Flannel (The Easy Way, The Slow Way)

Flannel typically uses VXLAN. It encapsulates Layer 2 Ethernet frames within Layer 4 UDP packets. This adds headers. Headers mean overhead. On a standard 1500 MTU network, your effective payload shrinks. If your application pushes massive throughput—say, video streaming or heavy database replication—the CPU cost of encapsulation/decapsulation adds up.

2. Calico (The BGP Way)

I prefer Calico. Instead of encapsulating traffic, it routes it. It runs a vRouter on each node and uses BGP (Border Gateway Protocol) to propagate routes. This is how the actual internet works. No encapsulation overhead.

Pro Tip: If you are running on CoolVDS, our KVM instances provide a clean L2 network environment. This allows BGP peering between nodes to work seamlessly without the "noisy neighbor" interference you get on shared container hosting.

The Bottleneck: Kube-proxy and Iptables

Until recently, the default mode for kube-proxy was iptables. In this mode, k8s writes an iptables rule for every service.

If you have 50 services, it’s fine. If you have 5,000 services, the kernel has to traverse a massive sequential list of rules for every packet. O(n) complexity kills performance. I've seen latency spikes of 50ms+ just from rule traversal on loaded clusters.

The Fix: IPVS (IP Virtual Server)

As of Kubernetes 1.11 (and stable in the current 1.12/1.13 release cycles), IPVS is the production-ready alternative. It uses hash tables instead of linear lists. The complexity is O(1). It doesn't matter if you have ten services or ten thousand; the lookup time is constant.

Here is how you enable IPVS mode if you are bootstrapping a cluster with kubeadm in 2018. You need to edit your KubeProxyConfiguration:

apiVersion: kubeproxy.config.k8s.io/v1alpha1
kind: KubeProxyConfiguration
mode: "ipvs"
ipvs:
  excludeCIDRs: null
  minSyncPeriod: 0s
  scheduler: "rr" # Round Robin
  syncPeriod: 30s

Before applying this, ensure the kernel modules are loaded on your underlying host system:

# Load required modules for IPVS
modprobe ip_vs
modprobe ip_vs_rr
modprobe ip_vs_wrr
modprobe ip_vs_sh
modprobe nf_conntrack_ipv4

# Check if they are loaded
lsmod | grep -e ip_vs -e nf_conntrack_ipv4

Ingress: Terminating SSL in Oslo

For external traffic, NodePort is messy. You end up with ports like 32045 open to the world. The standard solution in 2018 is the NGINX Ingress Controller. It acts as a smart router, sitting at the edge of your cluster.

Performance matters here. You are terminating TLS. If you are hosting for Norwegian clients, you want that handshake to happen in Oslo, not in a datacenter in Virginia. The speed of light is constant; physics doesn't negotiate.

Here is a tuned nginx-configuration ConfigMap to optimize for high concurrency:

kind: ConfigMap
apiVersion: v1
metadata:
  name: nginx-configuration
  namespace: ingress-nginx
  labels:
    app.kubernetes.io/name: ingress-nginx
data:
  keep-alive: "75"
  keep-alive-requests: "1000"
  worker-processes: "auto"
  #Optimize SSL for security and speed (modern ciphers for late 2018)
  ssl-ciphers: "ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384"
  ssl-protocols: "TLSv1.2"
  # Increase buffer size for large headers
  large-client-header-buffers: "4 16k"

Why Infrastructure Still Rules

You can tune your yaml files all day, but if the underlying disk I/O is choking, your etcd cluster will fail. etcd is the brain of Kubernetes. It is extremely sensitive to disk write latency. If fsync takes too long, the cluster can become unstable or split-brain.

This is where the "cloud" abstraction leaks. On generic public clouds, you often share disk IOPS with other tenants. If your neighbor decides to mine crypto, your Kubernetes API server slows down.

At CoolVDS, we use local NVMe storage for our KVM instances. We don't use network-attached block storage for the root filesystem. The difference in etcd performance is night and day.

Comparison: CNI Plugins in 2018

Feature Flannel (VXLAN) Calico (BGP) Weave Net
Network Model Overlay (Encapsulation) Layer 3 Routing Mesh Overlay
Performance Medium (UDP Overhead) High (Native Speed) Medium
Configuration Simple Complex Moderate
Network Policies No (needs external) Yes (Native) Yes

Datatilsynet and The Law

A quick note on compliance. If you are processing personal data of Norwegian citizens, GDPR (Article 44) restricts transfers outside the EEA. While US clouds rely on Privacy Shield, the legal ground is shaky. Hosting on CoolVDS servers physically located in Norway/Europe removes this headache entirely. Your data stays under Norwegian jurisdiction.

Final Thoughts

Kubernetes is powerful, but it is not magic. It is just Linux processes, iptables (or IPVS) rules, and routing tables. To run it successfully in production, you need to understand the layers beneath it.

Don't let high latency or noisy neighbors kill your cluster's performance. You need dedicated resources and fast I/O.

Ready to build a cluster that actually performs? Deploy a high-performance NVMe KVM instance on CoolVDS today and see the difference raw power makes for your pods.