Crushing Latency: API Gateway Tuning for High-Throughput Microservices
Your API isn't slow because your code is bad. It's slow because your infrastructure is choking on default settings from 2010. I see this constantly: a brilliant dev team builds a microservices architecture using the latest Go or Rust libraries, deploys it to a standard VPS, and then watches in horror as p99 latency spikes to 500ms under load. Why? Because the Linux kernel and your reverse proxy aren't configured for modern high-concurrency workloads out of the box.
In the Norwegian market, where fiber penetration is high and users in Oslo expect instant interaction, a 200ms delay is noticeable. It feels broken. If you are routing traffic through NIX (Norwegian Internet Exchange), you should be aiming for single-digit millisecond overhead. Here is how we tune API Gateways for raw speed, based on the reference architecture we use at CoolVDS.
1. The OS Layer: Stop Accepting Defaults
Most Linux distributions are tuned for general-purpose computing, not for handling 50,000 simultaneous TCP connections. Before you even touch NGINX or HAProxy, you need to fix the networking stack. The biggest bottleneck in 2021 is often the ephemeral port exhaustion and the TCP backlog.
We need to modify /etc/sysctl.conf to allow the kernel to recycle connections faster and handle bursts of traffic without dropping packets. If you are running on a CoolVDS KVM instance, you have full control over these kernel flags (unlike some container-based hosting where you are locked out).
Apply these settings to handle high throughput:
# /etc/sysctl.conf
# Maximize the backlog of incoming connections
net.core.somaxconn = 65535
net.core.netdev_max_backlog = 65535
# Allow reusing sockets in TIME_WAIT state for new connections
net.ipv4.tcp_tw_reuse = 1
# Increase ephemeral port range
net.ipv4.ip_local_port_range = 1024 65535
# Increase TCP buffer sizes for 10Gbps+ links
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216
# Enable BBR congestion control (Requires Kernel 4.9+)
net.core.default_qdisc = fq
net.ipv4.tcp_congestion_control = bbr
Pro Tip: Google's BBR (Bottleneck Bandwidth and Round-trip propagation time) algorithm significantly improves throughput on networks with some packet loss. We enable this by default on our internal setups. Verify it's active with sysctl net.ipv4.tcp_congestion_control.
2. NGINX: The Gateway Config
Whether you are using raw NGINX, OpenResty, or Kong, the underlying mechanics are the same. A common mistake is failing to configure upstream keepalives. Without this, NGINX opens a new TCP connection to your backend service for every single request. This adds the full TCP handshake overhead to every call. For SSL backends, it's even worse.
Here is the correct way to configure an upstream block for high performance:
http {
# Worker settings should match CPU cores.
# On CoolVDS dedicated core instances, set this to 'auto'.
worker_processes auto;
# Increase open file descriptors limit
worker_rlimit_nofile 65535;
events {
# Essential for high concurrency
worker_connections 16384;
use epoll;
multi_accept on;
}
upstream backend_microservice {
server 10.0.0.5:8080;
# CRITICAL: Keep connections open to the backend
keepalive 64;
}
server {
listen 443 ssl http2;
server_name api.coolvds.com;
location / {
proxy_pass http://backend_microservice;
# HTTP 1.1 is required for keepalive
proxy_http_version 1.1;
proxy_set_header Connection "";
# Buffer tuning
proxy_buffers 16 4k;
proxy_buffer_size 2k;
}
}
}
3. SSL Termination and CPU Steal
In 2021, TLS 1.3 is the standard. It reduces the handshake latency by one full round-trip. However, encryption is CPU intensive. If you are hosting on a crowded VPS provider (noisy neighbors), your "vCPU" might be waiting for physical CPU time. This is called "CPU Steal," and it destroys SSL performance.
You can check for steal time using top (look for the st value). If it's consistently above 1-2%, move your workload.
OpenSSL Optimization Benchmark:
Check your instance's raw speed. On our NVMe KVM plans, we ensure the AES-NI instruction set is passed through to the VM.
$ openssl speed -evp aes-256-gcm
If you aren't seeing numbers in the gigabytes per second, your gateway will become a bottleneck during traffic spikes.
4. The I/O Reality Check
API Gateways log a lot. Access logs, error logs, audit trails. If you are writing to a standard SATA SSD (or heaven forbid, a spinning disk network mount), your I/O Wait (iowait) will skyrocket, blocking the NGINX worker process. This causes 502 Bad Gateway errors even when CPU usage seems low.
For high-load gateways, we strictly recommend local NVMe storage. Network-attached block storage (common in big public clouds) adds latency to every write operation. When we built the CoolVDS infrastructure in Oslo, we mandated local NVMe arrays specifically to eliminate this I/O blocking.
Comparison: Disk Latency Impact on API Requests
| Storage Type | Write Latency | Impact on Logging |
|---|---|---|
| Standard SATA SSD | ~0.5 - 2 ms | Acceptable for low traffic |
| Ceph / Network Storage | ~2 - 10 ms | High risk of blocking under load |
| CoolVDS Local NVMe | ~0.03 ms | Zero blocking |
5. Local Compliance & Data Sovereignty
Technical tuning is useless if you are legally non-compliant. With the Schrems II ruling shaking up the industry last year, relying on US-owned gateways presents a compliance risk for Norwegian companies handling sensitive user data.
By hosting your API Gateway in Norway, you solve two problems:
- Latency: Physics applies. Oslo to Oslo is faster than Oslo to Frankfurt.
- GDPR: Keeping data within Norwegian jurisdiction satisfies Datatilsynet requirements more easily than explaining Standard Contractual Clauses (SCCs).
Final Thoughts
Performance isn't magic; it's configuration and hardware. If you tune your kernel for BBR, configure NGINX upstream keepalives, and ensure you aren't fighting for CPU cycles on a crowded host, you will see a massive drop in latency.
Don't let your infrastructure be the reason your microservices fail. Deploy a test gateway on a CoolVDS NVMe instance today and run your own wrk benchmarks. The difference is usually instant.