API Gateway Tuning: Squeezing Milliseconds Out of Nginx & Kong on Linux
If you represent an API response time in averages, you are already failing. The 99th percentile (p99) is where your reputation lives and dies. In the Nordic market, where fiber penetration is high and user expectations are even higher, a 200ms delay is perceptible. It feels broken.
I recently audited a microservices architecture for a fintech client in Oslo. They were running a standard Kong (Nginx-based) gateway on a generic cloud instance. Their throughput was decent, but their latency jitter was atrocious. During traffic spikes, p99 latency jumped from 45ms to 600ms. The culprit wasn't their Go application code; it was a default Linux kernel configuration and a virtualization layer that suffered from "noisy neighbors."
Most VPS providers hand you a server configured for general compatibility, not high-performance packet switching. Below is the exact methodology I used to stabilize their gateway, focusing on the Linux kernel, Nginx configuration, and the underlying infrastructure requirements.
1. The Kernel is Not Your Friend (Yet)
Out of the box, Ubuntu 18.04 and CentOS 7 are tuned for general-purpose computing, not for handling 50,000 concurrent connections. The TCP stack is too conservative. We need to open the floodgates. Access your /etc/sysctl.conf and look at your file descriptors and TCP backlog.
When an API Gateway acts as a reverse proxy, it burns two file descriptors per request: one for the client, one for the upstream service. If you stick with the default 1024 limit, you will hit Too many open files before you even finish your morning coffee.
# /etc/sysctl.conf - Optimized for API Gateway (2019 Standard)
# Maximize open file descriptors
fs.file-max = 2097152
# Increase the TCP backlog queue
# This prevents packets from being dropped when the application is slow to accept() them
net.core.somaxconn = 65535
net.core.netdev_max_backlog = 65535
# Widen the local port range to allow more upstream connections
net.ipv4.ip_local_port_range = 1024 65535
# Reuse sockets in TIME_WAIT state for new connections
# Essential for high-throughput gateways making many connections to backend services
net.ipv4.tcp_tw_reuse = 1
# Increase TCP buffer sizes for modern high-speed networks (1Gbps+)
net.core.rmem_default = 31457280
net.core.rmem_max = 67108864
net.core.wmem_default = 31457280
net.core.wmem_max = 67108864
# Enable TCP Fast Open (TFO) to reduce handshake latency
# Note: Requires client support, but good to have enabled server-side
net.ipv4.tcp_fastopen = 3
Apply these changes with sysctl -p. Note that on shared hosting environments (like containerized VPS based on OpenVZ), you often cannot modify these kernel parameters because you share the kernel with the host. This is why we enforce KVM virtualization at CoolVDS. You need your own kernel to tune the network stack properly.
2. Nginx / Kong Configuration Nuances
Whether you are using raw Nginx or an abstraction like Kong, the underlying mechanics are the same. The biggest mistake I see is the lack of keepalive connections to the upstream (backend) services. Without keepalives, Nginx opens a new TCP connection (SYN, SYN-ACK, ACK) and performs a new SSL handshake for every single request forwarded to your backend.
This adds massive overhead and CPU usage. Here is how you fix the upstream block to reuse connections:
upstream backend_microservice {
server 10.0.0.5:8080;
# The Critical Setting: Keep idle connections open to the backend
keepalive 64;
}
server {
listen 80;
listen 443 ssl http2;
location /api/ {
proxy_pass http://backend_microservice;
# Required to enable keepalive
proxy_http_version 1.1;
proxy_set_header Connection "";
# buffer settings to handle variable payload sizes
proxy_buffers 16 32k;
proxy_buffer_size 64k;
}
}
Pro Tip: If you are using SSL termination at the gateway (which you should be), CPU instruction sets matter. Ensure your server supports AES-NI. We enable this pass-through by default on our KVM instances, drastically reducing the CPU cost of encrypting traffic. Check it with lscpu | grep aes.
3. The Hardware Reality: NVMe and I/O Wait
Logging is the silent killer of API performance. Every request generates an access log. At 10,000 requests per second, you are writing a massive amount of data to disk. On standard SATA SSDs, or worse, spinning HDDs (which surprisingly still exist in 2019 hosting), your disk I/O queue fills up. When the disk blocks, the worker process blocks. The request hangs.
We ran a benchmark using wrk comparing standard SSD storage vs. NVMe storage for an API Gateway logging extensively to disk.
| Storage Type | Requests/Sec | Latency (p99) |
|---|---|---|
| Standard SATA SSD | 8,400 | 145ms |
| CoolVDS NVMe | 14,200 | 12ms |
The difference isn't just speed; it's consistency. NVMe protocols talk directly to the CPU via the PCIe bus, bypassing the legacy SATA controller bottlenecks. For a database, this is important. For an API Gateway handling logs and caching, it is non-negotiable.
4. Local Latency and Legal Compliance
Technical tuning is useless if the physics are against you. If your users are in Oslo, Bergen, or Trondheim, hosting your gateway in Frankfurt adds a mandatory ~20-30ms round-trip time (RTT) simply due to the speed of light and fiber routing. Hosting in a US East datacenter adds ~90ms.
Furthermore, with the tightening grip of GDPR and the rigorous stance of the Norwegian Datatilsynet, keeping data within Norwegian borders is becoming a compliance necessity, not just a performance tweak. While the cloud giants offer "availability zones," true data sovereignty often requires a provider that is legally and physically rooted in the jurisdiction.
By placing your gateway at the edge—physically close to the NIX (Norwegian Internet Exchange)—you reduce the initial TCP handshake time significantly. For mobile clients on 4G networks where latency is already variable, this backend proximity is the only variable you can control.
Validating Your Changes
Don't guess. Measure. Use wrk to stress test your endpoint. Here is a standard test command I use to simulate load:
# Run for 30 seconds, using 12 threads, keeping 400 connections open
wrk -t12 -c400 -d30s --latency https://api.yourdomain.no/v1/status
If you see a "Socket errors: connect" message, your kernel backlog (Step 1) is still too small. If you see high latency but no errors, check your CPU usage—you might be stealing cycles from neighbors if you are on a cheap VPS. Dedicated CPU resources prevent this steal time.
Conclusion
Performance tuning is a full-stack discipline. You cannot fix a bad network with code, and you cannot fix a bad kernel with hardware. You need all three aligned.
If you are tired of fighting for IOPS and dealing with variable latency on oversold cloud instances, it is time to upgrade the foundation. Deploy a CoolVDS NVMe instance today. You get full KVM virtualization, custom kernel control, and the raw I/O throughput required to run a serious API Gateway.
Configure your NVMe VPS in Oslo now and stop apologizing for high latency.