Unmasking Kubernetes Networking Performance: A 2023 Survival Guide for Nordic DevOps
Let’s be honest. Kubernetes networking is where good clusters go to die. On your laptop, everything is fine. But push that YAML to a staging environment, and suddenly DNS lookups take 300ms, and your microservices are timing out. I have spent the last decade debugging packet drops at 3 AM, and the culprit is rarely the application code. It is almost always the plumbing underneath.
In the Nordic region, we have a specific set of constraints. We have excellent connectivity via NIX (Norwegian Internet Exchange), but we also have strict data residency requirements (GDPR, Schrems II). You cannot just throw a CDN at every problem if the data cannot leave Oslo. You need raw network throughput and an infrastructure that doesn't steal your CPU cycles for software interrupts.
This guide isn't about how a Service works. You know that. This is about why your Service is slow, and how to fix it using the tools available to us in late 2023.
The CNI Battlefield: IPTables vs. eBPF
If you are still running a default Flannel installation with VXLAN in 2023, you are leaving performance on the table. The industry standard has shifted. For years, we relied on `iptables` to route traffic between pods. It works, but it scales linearly with the number of services. At 5,000 services, `iptables` becomes a bottleneck, consuming massive kernel CPU just to decide where a packet goes.
Enter eBPF (Extended Berkeley Packet Filter). By October 2023, Cilium had matured enough to be the default choice for serious production clusters. It bypasses the tangled mess of iptables entirely.
Why eBPF Matters for VPS Hosting
When you run Kubernetes on a VPS, you are already dealing with a layer of virtualization. Adding a heavy overlay network on top creates the "double encapsulation" tax. eBPF minimizes this context switching.
Here is how you actually deploy Cilium with strict kube-proxy replacement enabled (the performant way) using Helm. Note the versioning—Cilium 1.14.x is the stable target right now.
helm repo add cilium https://helm.cilium.io/
helm install cilium cilium/cilium --version 1.14.2 \
--namespace kube-system \
--set kubeProxyReplacement=strict \
--set k8sServiceHost=API_SERVER_IP \
--set k8sServicePort=6443
By replacing kube-proxy, we remove the need for conntrack entries for every single service connection. On a high-traffic node, this saves roughly 15-20% CPU overhead. That is CPU you are paying for but not using for your app.
The Infrastructure Reality: Noisy Neighbors and Steal Time
You can tune your CNI until you are blue in the face, but if the underlying hypervisor is oversubscribed, your network latency will jitter. This is the "noisy neighbor" effect.
Pro Tip: Check your `st` (steal time) in `top`. If it is consistently above 0.5%, your provider is overselling their CPU cores. Network packet processing is CPU-bound. High steal time = dropped packets.
This is why we architect CoolVDS differently. We use KVM (Kernel-based Virtual Machine) with strict resource isolation. When you spin up a node for your cluster, the network interface provided to your VM uses virtio-net drivers that map directly to our backbone. We don't overprovision the uplinks.
For a Kubernetes node, standard HDD storage is a death sentence for etcd. If etcd latency spikes because of slow I/O, the API server can't update the endpoint slices, and your networking configuration becomes stale. NVMe storage is not a luxury here; it is a requirement for cluster stability. CoolVDS provides local NVMe by default, keeping `etcd` fsync latency comfortably under 2ms.
Kernel Tuning for High-Throughput
Out of the box, most Linux distros (Ubuntu 22.04, Debian 11) are tuned for general desktop use, not high-performance routing. You need to modify `sysctl.conf` on your worker nodes to handle the packet storms typical of microservices.
Apply these settings to handle higher connection counts and faster TCP recycling:
# /etc/sysctl.d/99-k8s-network.conf
# Increase the maximum number of open file descriptors
fs.file-max = 2097152
# Increase the connection tracking table (crucial for NAT)
net.netfilter.nf_conntrack_max = 262144
# Enable IP forwarding (mandatory for K8s routers)
net.ipv4.ip_forward = 1
# Optimize TCP buffer sizes for 10Gbps+ links
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216
Reload with sysctl --system. Without these, a sudden spike in traffic (like a marketing campaign or a DDoS attack) will cause the kernel to drop SYN packets before your Ingress controller even sees them.
Ingress: Nginx vs. The World
In 2023, everyone is talking about Gateway API, but Ingress is what is running in production. While Istio is powerful, it is often overkill for teams smaller than 50 engineers. The complexity tax is high. For 95% of use cases in Norway, ingress-nginx is still the king of throughput per CPU cycle.
However, default Nginx settings are too conservative. If you are serving an API with large payloads (common in B2B tech), you need to adjust the buffer sizes in your ConfigMap:
apiVersion: v1
kind: ConfigMap
metadata:
name: ingress-nginx-controller
namespace: ingress-nginx
data:
proxy-body-size: "50m"
proxy-connect-timeout: "10"
proxy-read-timeout: "120"
proxy-send-timeout: "120"
worker-processes: "4" # Match your VDS core count
Matching `worker-processes` to your CoolVDS vCPU count is critical. Nginx is event-driven; context switching between cores kills efficiency.
Security: The Default Deny
We see this constantly: A cluster is deployed with open networking. If one pod is compromised, the attacker can port scan the entire internal network. Zero Trust isn't just a buzzword; it's a `NetworkPolicy`.
Start with a default deny policy. It breaks things initially, but it forces you to understand your traffic flow.
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-all
namespace: production
spec:
podSelector: {}
policyTypes:
- Ingress
- Egress
When you apply this, silence follows. You must then explicitly allow DNS (UDP 53) and communication between your frontend and backend. It’s tedious, but necessary for compliance with strict Norwegian security standards.
The Latency Factor: Oslo vs. The Continent
Physics is the ultimate limit. Speed of light in fiber is roughly 200,000 km/s. If your VPS is in Frankfurt but your users are in Bergen, you are adding 20-30ms of round-trip time (RTT) purely on distance. Add TLS handshakes (3x RTT), and your user waits 100ms before receiving a single byte.
| Origin | Destination | Approx. Latency (RTT) |
|---|---|---|
| Oslo User | Frankfurt DC | 25-35 ms |
| Oslo User | Amsterdam DC | 20-30 ms |
| Oslo User | CoolVDS Oslo DC | < 2 ms |
For high-frequency trading, real-time gaming, or complex microservices chattering to each other, that difference defines the user experience.
Conclusion
Kubernetes networking is a beast, but it is a tameable one. It requires a shift from "it works" to "it performs." You need the right CNI (Cilium), the right kernel tuning, and most importantly, the right underlying hardware.
You cannot fix a noisy neighbor problem with YAML configuration. You need dedicated resources.
If you are tired of debugging latency ghosts and want a predictable, high-performance foundation for your cluster, stop fighting the infrastructure. Deploy a test instance on CoolVDS today and see what < 2ms latency does for your application response times.