Kubernetes Networking Deep Dive: Packet Flow, CNI Wars, and Why Your Overlay Network is Slow
Let’s cut the marketing noise. Kubernetes networking isn’t magic. It is a complex layer of iptables rules, routing tables, and encapsulation protocols held together by hope and bash scripts. I’ve spent the last three weeks debugging a cluster that kept dropping packets between microservices only during peak hours. The culprit wasn't code—it was a default MTU setting colliding with an overlay network on a budget VPS provider.
If you are running Kubernetes in production in 2022, you cannot afford to treat the network as a black box. Whether you are serving high-traffic APIs in Oslo or managing data pipelines across Europe, understanding the path a packet takes from an Ingress Controller to a Pod is mandatory.
The CNI Jungle: Calico vs. Cilium (2022 Edition)
The Container Network Interface (CNI) is where the rubber meets the road. In the Nordic hosting market, we see two dominant players right now: Calico and Cilium.
Calico is the industry workhorse. It uses BGP for routing and acts like a traditional router. It is stable, predictable, and we see it on 80% of clusters migrating to CoolVDS. However, as of late 2022, Cilium is eating its lunch by leveraging eBPF (Extended Berkeley Packet Filter) to bypass iptables entirely. Iptables was never designed for the churn of dynamic container scheduling. When you have 5,000 services, iptables becomes a linear bottleneck.
The Encapsulation Tax
Unless you are running BGP directly with your top-of-rack switches (rare in virtualized environments), you are likely using an overlay network like VXLAN or IPIP. This encapsulates your packet inside another packet. This process consumes CPU cycles.
Pro Tip: If your hosting provider over-provisions CPU (stealing cycles from you), your network throughput drops because the kernel can't encapsulate packets fast enough. We configured CoolVDS KVM slices with dedicated CPU pin options specifically to prevent this "noisy neighbor" network lag.
The Hidden Killer: MTU Fragmentation
This is the most common configuration error I see. The standard internet MTU is 1500 bytes. VXLAN adds a 50-byte header. If your physical host interface is 1500, and your Pod interface is 1500, the encapsulated packet becomes 1550 bytes. The physical switch drops it, or fragmentation occurs, killing performance.
You must configure your CNI to account for the overhead. Here is how we verify the interface MTU on the host node before deploying:
ip -d link show eth0 | grep mtu
If your host is 1500, your Calico configuration needs to look like this:
kind: DaemonSet
apiVersion: apps/v1
metadata:
name: calico-node
namespace: kube-system
spec:
template:
spec:
containers:
- name: calico-node
env:
- name: FELIX_IPINIPMTU
value: "1480" # Allow room for header
- name: FELIX_VXLANMTU
value: "1450"
Setting this incorrectly results in sporadic connection resets that are incredibly difficult to debug.
Service Discovery and DNS Latency
In Kubernetes, DNS is not just for finding Google.com; it’s how your frontend finds your backend. By default, K8s uses CoreDNS. I've seen latency spikes in clusters simply because ndots:5 (the default search configuration) forces the resolver to query multiple search domains before finding the actual service.
If you are running a high-load setup, standard CoreDNS settings are insufficient. You need to tune the Corefile and the upstream behavior. Below is a production-grade CoreDNS config map optimized for high throughput:
apiVersion: v1
kind: ConfigMap
metadata:
name: coredns
namespace: kube-system
data:
Corefile: |
.:53 {
errors
health {
lameduck 5s
}
ready
kubernetes cluster.local in-addr.arpa ip6.arpa {
pods insecure
fallthrough in-addr.arpa ip6.arpa
ttl 30
}
prometheus :9153
forward . /etc/resolv.conf {
max_concurrent 1000
}
cache 30
loop
reload
loadbalance
}
Notice the max_concurrent and cache settings. On a generic VPS with slow I/O, CoreDNS can choke on logging or caching operations. This is why CoolVDS utilizes NVMe storage standard—even for system logs—to ensure that I/O wait times never impact network resolution.
Optimizing Kernel Parameters for K8s
Linux defaults are tuned for a modest web server from 2010, not a 2022 Kubernetes node routing gigabits of traffic. You need to touch `sysctl`. Be careful here; changing these on a live system can disrupt connectivity.
We recommend applying the following tuning via a DaemonSet or Cloud-Init script on your worker nodes. These settings increase the connection tracking table (crucial for NAT) and allow for faster TCP recycling.
# /etc/sysctl.d/k8s-net.conf
# Increase the connection tracking table size
net.netfilter.nf_conntrack_max = 1000000
net.netfilter.nf_conntrack_tcp_timeout_established = 86400
# Allow more pending connections
net.core.somaxconn = 32768
# Expand the range of ephemeral ports
net.ipv4.ip_local_port_range = 1024 65000
# Fast recycling of TIME_WAIT sockets (use with caution behind NAT)
net.ipv4.tcp_tw_reuse = 1
# Increase TCP buffer sizes for high-speed local networks (like NIX peering)
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
To apply this immediately:
sysctl -p /etc/sysctl.d/k8s-net.conf
The Infrastructure Layer: Why CoolVDS Wins
You can tune software all day, but you cannot tune physics. Kubernetes control plane components, specifically etcd, are extremely sensitive to disk write latency. If etcd writes take longer than a few milliseconds, the leader election fails, and your cluster goes into a split-brain scenario. I have seen this happen repeatedly on shared "cloud" platforms that throttle disk IOPS.
This is where the choice of hosting provider becomes a technical decision, not just a financial one. At CoolVDS, we don't use spinning rust or shared SATA SSDs for our high-performance tiers. We use NVMe directly attached to the PCI bus. For a K8s cluster node, this means:
- Etcd Stability: Write latencies consistently under 2ms.
- Faster Image Pulls: Docker images extract faster, improving pod startup time.
- Compliance: For our Norwegian clients, data residency is critical. Ensuring your data sits on servers physically located in Oslo or nearby ensures compliance with strict interpretations of GDPR and Schrems II.
Ingress and Local Peering
Finally, how does traffic get into the cluster? In 2022, the Gateway API is the future, but the Ingress Controller (specifically Nginx or Traefik) is the present. If your target audience is in Norway, latency matters.
Hosting outside of the region adds 20-30ms of latency. Hosting on CoolVDS, which peers directly at NIX (Norwegian Internet Exchange), keeps latency to domestic users often below 5ms. When configuring your Ingress, ensure you are preserving the client source IP to enable proper rate limiting and geographic filtering.
apiVersion: v1
kind: Service
metadata:
name: ingress-nginx-controller
namespace: ingress-nginx
spec:
type: LoadBalancer
externalTrafficPolicy: Local # Preserves Client IP
ports:
- name: http
port: 80
targetPort: http
- name: https
port: 443
targetPort: https
Setting externalTrafficPolicy: Local drops packets if the pod isn't on the node receiving traffic, so ensure you have a robust external LoadBalancer or run Ingress as a DaemonSet.
Conclusion
Kubernetes networking is unforgiving of weak infrastructure. A dropped packet in an overlay network looks like an application timeout to your users. By combining precise CNI configuration, kernel tuning, and the raw IO performance of CoolVDS NVMe instances, you build a foundation that doesn't just survive peak load—it ignores it.
Stop fighting CrashLoopBackOff caused by slow I/O. Spin up a rock-solid K8s node on CoolVDS today and see the difference dedicated resources make.