Console Login

Kubernetes Networking Deep Dive: Escaping iptables Hell in 2018

The Packet Never Lies: A Real-World Look at Kubernetes Networking

It is 3:00 AM. Your pager goes off. The latency on your microservices architecture just spiked from 25ms to 400ms. The logs are clean, the CPU load is normal, but the network throughput is choking. If you are running Kubernetes in production today, you know this nightmare. It usually isn't the application code; it's the overlay network struggling to encapsulate packets faster than the underlying hardware can handle.

Kubernetes networking is often treated as "magic" by developers. You define a Service, the traffic flows. But for those of us managing the infrastructure, there is no magic. There is only encapsulation, routing tables, and the terrifying complexity of iptables rules. With Kubernetes v1.10 and the upcoming v1.11, the landscape is shifting from purely iptables based routing to IPVS (IP Virtual Server), but the fundamentals of the Container Network Interface (CNI) remain the bottleneck.

The Flat Network Lie

The Kubernetes model assumes that every pod can talk to every other pod without NAT. Achieving this on a VPS environment usually means building an overlay network using VXLAN or IP-in-IP. This adds overhead. Every packet leaving a container is encapsulated, sent over the wire, decapsulated, and delivered.

If your hosting provider sells you "vCPUs" that are actually heavily contended threads on an old Xeon, that encapsulation cost will kill your throughput. This is why we built CoolVDS on KVM with strict resource isolation. When your CNI plugin demands CPU cycles to wrap a packet, they need to be there instantly.

Choosing Your Weapon: Flannel vs. Calico

In 2018, the debate is settled for most use cases.

  • Flannel: Simple. Great for demos. Uses VXLAN. It works, but it lacks Network Policies.
  • Calico: The production standard. It can run in Layer 3 mode using BGP (Border Gateway Protocol) without encapsulation if your network supports it, or IPIP mode if it doesn't.

If you are serious about security—especially with the GDPR enforcement that hit us last month—you need Calico. Why? Because it supports NetworkPolicy resources. You can't just let every pod talk to the database anymore.

Here is how you check your current Calico node status to ensure BGP peering is established:

sudo calicoctl node status

The Bottleneck: iptables vs. IPVS

This is the technical meat. Historically, kube-proxy implemented service load balancing by writing iptables rules. This works fine for 50 services. But we have clients running 5,000 services. When you have 5,000 services, iptables becomes a sequential list of rules that the kernel has to traverse for every packet. It is O(n). It is slow. It burns CPU.

Enter IPVS (IP Virtual Server). IPVS is a kernel-space load balancer that uses hash tables. It is O(1). It doesn't care if you have 10 services or 10,000.

To enable IPVS mode in your cluster (assuming you are on K8s v1.10+), you need to modify the kube-proxy ConfigMap. Don't just toggle the flag; you need to ensure the kernel modules are loaded on the host first.

# Load required kernel modules on the host
modprobe ip_vs
modprobe ip_vs_rr
modprobe ip_vs_wrr
modprobe ip_vs_sh
modprobe nf_conntrack_ipv4

Once the modules are loaded, edit the config:

apiVersion: v1
kind: ConfigMap
metadata:
  name: kube-proxy
  namespace: kube-system
  labels:
    app: kube-proxy
data:
  config.conf: |-
    apiVersion: kubeproxy.config.k8s.io/v1alpha1
    kind: KubeProxyConfiguration
    bindAddress: 0.0.0.0
    mode: "ipvs" # <--- THE CRITICAL CHANGE
    ipvs:
      excludeCIDRs: null
      minSyncPeriod: 0s
      scheduler: "rr"
      strictARP: false
      syncPeriod: 30s

After applying this and restarting the kube-proxy pods, you can verify the mapping tables directly on the node. This is much cleaner than grepping through a 20,000 line iptables dump.

sudo ipvsadm -Ln

GDPR, Data Sovereignty, and Ingress

Since May 25th, the world has changed. Data sovereignty is no longer just a "nice to have"; it is a legal requirement. If you are serving Norwegian customers, routing your traffic through a load balancer in Frankfurt might be a grey area depending on your DPA (Data Processing Agreement). Hosting directly in Oslo reduces this risk.

For Ingress, the Nginx Ingress Controller remains the robust choice in mid-2018. It handles SSL termination efficiently, which is critical because decrypting HTTPS traffic is CPU-intensive. On CoolVDS NVMe instances, we see SSL handshakes complete significantly faster due to the high clock speed of our cores compared to budget VPS providers.

Here is a standard Ingress definition ensuring TLS termination. Note the extensions/v1beta1 API version—Kubernetes is evolving fast, so always check your API deprecation logs.

apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  name: secure-gateway
  annotations:
    kubernetes.io/ingress.class: nginx
    # Force SSL redirection for compliance
    nginx.ingress.kubernetes.io/ssl-redirect: "true"
    # Increase buffer size for large headers
    nginx.ingress.kubernetes.io/proxy-buffer-size: "16k"
spec:
  tls:
  - hosts:
    - api.example.no
    secretName: tls-secret
  rules:
  - host: api.example.no
    http:
      paths:
      - path: /
        backend:
          serviceName: backend-service
          servicePort: 80
Pro Tip: If you are seeing 502 Bad Gateway errors on Nginx Ingress during high load, check your sysctl settings on the node. The default connection tracking table often fills up.

You can tweak this on the host node to allow more concurrent connections:

sysctl -w net.netfilter.nf_conntrack_max=131072

Securing East-West Traffic

By default, Kubernetes allows all traffic. If a hacker compromises your frontend, they have a direct line to your Redis backend. We use NetworkPolicy to lock this down. This is the software-defined firewall of 2018.

Below is a "Default Deny" policy. Deploying this will break everything immediately, which is how you know it is working. You then whitelist only necessary traffic.

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny
  namespace: production
spec:
  podSelector: {} # Selects all pods in namespace
  policyTypes:
  - Ingress
  - Egress

To allow traffic, you layer an additive policy on top. For example, allowing the frontend to reach the backend on port 6379:

kubectl get networkpolicy -n production

The Hardware Reality

Software-defined networking (SDN) is heavy. It requires constant context switching. When we benchmarked CoolVDS against standard cloud offerings, the difference wasn't in the successful requests—it was in the outliers. The 99th percentile latency on standard HDD-based VPS solutions was atrocious because the I/O wait caused the CNI plugin to hang momentarily.

You cannot fix physical latency with YAML configuration. If you are building a Kubernetes cluster to serve the Nordic market, you need low latency to NIX (Norwegian Internet Exchange) and hardware that doesn't steal CPU cycles from your packet processing.

Kubernetes is powerful, but it is heavy. Don't make it carry the weight of slow infrastructure too. Deploy a high-performance node today and see the difference in your kube-system metrics.