Console Login

API Gateway Bottlenecks: NGINX Tuning for Sub-10ms Latency in High-Traffic Microservices

API Gateway Bottlenecks: NGINX Tuning for Sub-10ms Latency in High-Traffic Microservices

If you are routing API traffic through a default NGINX configuration in 2018, you are essentially driving a Ferrari in first gear. I recently audited a payment processing cluster for a fintech client in Oslo. Their architecture was sound—microservices in Docker, REST APIs everywhere—but their latency was hovering around 150ms. For a local service. That is unacceptable.

The culprit wasn't the application logic. It was the gateway. The kernel was dropping packets, and the SSL handshake overhead was murdering the CPU. In the Nordic market, where internet infrastructure is top-tier, users expect near-instant responses. If your VPS Norway instance is adding 50ms of overhead just to terminate TLS, you are failing.

Let's fix this. We are going to look at kernel-level tuning, NGINX upstream keepalives, and why underlying hardware virtualization (KVM vs. Containers) makes or breaks your p99 latency.

1. The OS Layer: Open File Descriptors and Backlog

Before touching NGINX, look at your Linux kernel. By default, most distributions (Ubuntu 16.04, CentOS 7) ship with conservative limits designed for desktop usage, not high-concurrency API gateways.

When a sudden spike of traffic hits—say, a marketing push at 9:00 AM—the kernel's TCP backlog fills up. New connections are silently dropped. Your logs won't show it, but your clients will timeout. You need to increase the somaxconn and the file descriptor limits.

Edit /etc/sysctl.conf:

# Maximize the backlog of incoming connections
net.core.somaxconn = 65535
net.core.netdev_max_backlog = 65535

# Increase the range of ephemeral ports
net.ipv4.ip_local_port_range = 1024 65535

# Reuse sockets in TIME_WAIT state (use with caution, but essential for heavy API traffic)
net.ipv4.tcp_tw_reuse = 1

Apply these changes with sysctl -p. Next, ensure NGINX can actually open enough files. Check your limits in /etc/security/limits.conf:

* soft nofile 65535
* hard nofile 65535
root soft nofile 65535
root hard nofile 65535

2. NGINX Upstream Keepalives: The Silent Killer

This is the most common mistake I see. By default, NGINX acts as a reverse proxy that opens a new connection to your backend service (Node.js, Go, Python) for every single request, then closes it. This causes port exhaustion and wastes CPU cycles on TCP handshakes.

For an API gateway, you must use keepalive connections to the upstream.

upstream backend_api {
    server 10.0.0.5:8080;
    server 10.0.0.6:8080;

    # Keep 100 idle connections open per worker process
    keepalive 100;
}

server {
    location /api/ {
        proxy_pass http://backend_api;

        # REQUIRED: HTTP/1.1 is needed for keepalive
        proxy_http_version 1.1;
        
        # Clear the connection header to prevent the backend from closing it
        proxy_set_header Connection "";
        
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
    }
}
Pro Tip: If you are using CoolVDS, our internal networks support Jumbo Frames. If your backend servers are on the same private LAN, ensure MTU is optimized. But even without that, the keepalive directive alone can drop internal latency by 20-30ms per request.

3. TLS Optimization & The "Schrems" Factor

With GDPR going into full enforcement next month (May 2018), encryption is non-negotiable. However, the handshake is expensive. You cannot afford to negotiate a full handshake for every API call.

You need to enable SSL Session Caching. This allows clients to reuse the SSL parameters from a previous connection, drastically reducing CPU usage on your gateway.

ssl_session_cache shared:SSL:20m;
ssl_session_timeout 180m;

# Disable SSLv3 and TLSv1 (insecure), stick to TLS 1.2
ssl_protocols TLSv1.2;

# prioritize server ciphers
ssl_prefer_server_ciphers on;
ssl_ciphers 'ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384...';

Note: While TLS 1.3 is technically finalized as of last month, OpenSSL support is still bleeding edge. Stick to tuned TLS 1.2 for production environments in 2018.

4. The Hardware Reality: Why Virtualization Type Matters

Here is the uncomfortable truth: You can tune NGINX until you are blue in the face, but if your host node is oversubscribed, your API will stutter.

Many budget providers use container-based virtualization (like OpenVZ). In those environments, you are sharing the kernel with hundreds of other users. If a neighbor gets DDoS'd, your sysctl settings effectively mean nothing because the host kernel is choked.

This is why specific architectural choices matter:

  • KVM (Kernel-based Virtual Machine): Gives you a dedicated kernel. Your TCP stack is yours. CoolVDS uses KVM exclusively for this reason.
  • NVMe Storage: API Gateways generate massive logs (access logs, error logs). Writing these to a spinning HDD or a SATA SSD blocks the I/O thread. NVMe queues are deep enough to handle logging without blocking the request processing.

5. Logging and GDPR Compliance (Datatilsynet)

The Norwegian Data Protection Authority (Datatilsynet) is clear: minimize PII. If you are logging IP addresses in your API gateway access logs, you are storing personal data.

To prepare for the May 25th deadline, you should mask IP addresses in your NGINX logs immediately. It preserves your ability to debug geo-location issues without violating privacy statutes.

map $remote_addr $ip_anonymized {
    default 0.0.0.0;
    "~(?P(\d+)\.(\d+)\.(\d+))\.\d+" $ip;
    "~(?P[^:]+:[^:]+):" $ip;
}

log_format private '[$time_local] $ip_anonymized "$request" $status $body_bytes_sent';

access_log /var/log/nginx/access.log private buffer=32k flush=5s;

Notice the buffer=32k flush=5s? That prevents NGINX from writing to the disk for every single request. It buffers logs in RAM and flushes them in batches. This is critical for low latency operations.

Conclusion

High-performance API gateways require a holistic approach. You need the kernel to accept connections, NGINX to maintain persistence with backends, and a file system that doesn't block on writes.

But fundamentally, software cannot fix bad hardware. If you are running mission-critical APIs, you need dedicated resources. Don't let a "noisy neighbor" on a cheap container platform ruin your SLA.

Ready to test your tuned NGINX config? Deploy a KVM-based, NVMe-powered instance on CoolVDS today. Spool up time is under 55 seconds.