Console Login

API Gateway Performance Tuning: Why Your 502 Errors Are a Config Problem, Not a Code Problem

API Gateway Performance Tuning: Why Your 502 Errors Are a Config Problem

It is 2017, and the monolith is dying. Everyone is rushing to break their applications into microservices. It sounds great in the boardroom until you realize you’ve just replaced function calls (nanoseconds) with network calls (milliseconds). Suddenly, your frontend is waiting on an aggregation of five different internal APIs, and your latency metrics are bleeding red.

I recently audited a setup for a client in Oslo. They were running a perfectly decent Node.js stack, but their API Gateway (Nginx reverse proxy) was choking at 400 requests per second. The hardware wasn't the issue—they had plenty of RAM. The issue was the default configuration of their Linux kernel and Nginx, which assumes it's still 1999.

If you are serving traffic to the Norwegian market, you cannot afford the overhead of bad configuration. Latency to NIX (Norwegian Internet Exchange) should be under 5ms. If your gateway adds 50ms of processing time, you are wasting the advantage of local hosting.

1. The "File Descriptor" Trap

Every connection to your API gateway is a file. In Linux, everything is a file. By default, many distributions ship with a limit of 1024 open files per user. If you have 2000 concurrent users trying to hit your API, half of them are going to hit a wall. You won't see this in your application logs; you'll see it in dmesg or just vague connection timeouts.

Before you touch Nginx, you must raise the system limits. This isn't optional.

# Check current limits
ulimit -n

# Edit /etc/security/limits.conf
# Add the following lines to unleash the OS capabilities:

*       soft    nofile  65535
*       hard    nofile  65535
root    soft    nofile  65535
root    hard    nofile  65535
Pro Tip: On Systemd based systems (like CentOS 7 or Ubuntu 16.04), you also need to update the service file. If you are running Nginx via systemd, the limits.conf might be ignored. Override it in the unit file.

2. Tuning the TCP Stack in sysctl.conf

The default TCP stack is conservative. It is designed to save memory on 512MB RAM machines, not to handle high-throughput API traffic. When acting as a gateway, your server is opening thousands of ephemeral ports to talk to upstream services.

You need to allow the kernel to reuse these ports quickly. Otherwise, you run out of ports (TIME_WAIT state) and the server stops accepting new connections. Here is the /etc/sysctl.conf configuration I deploy on every production CoolVDS instance:

# /etc/sysctl.conf

# Maximize the backlog of incoming connections
net.core.somaxconn = 65535
net.core.netdev_max_backlog = 65535

# Allow reusing sockets in TIME_WAIT state for new connections
net.ipv4.tcp_tw_reuse = 1

# Increase the range of ephemeral ports
net.ipv4.ip_local_port_range = 1024 65535

# Increase TCP buffer sizes for 10Gbps networks (common in modern datacenters)
net.ipv4.tcp_rmem = 4096 87380 67108864
net.ipv4.tcp_wmem = 4096 65536 67108864

# Enable TCP Fast Open (if your kernel > 3.7 supports it)
net.ipv4.tcp_fastopen = 3

Apply these with sysctl -p. If you are on a shared hosting provider using OpenVZ, some of these might fail because you share the kernel with noisy neighbors. This is exactly why we use KVM at CoolVDS—you get your own kernel, so you can tune the stack exactly how you need it.

3. Nginx: The Upstream Keepalive

This is the most common mistake I see in 2017. Nginx, by default, speaks HTTP/1.0 to upstream servers and closes the connection after every request. If your API gateway talks to a backend microservice, it performs a full TCP handshake (and SSL handshake if internal HTTPS is used) for every single request.

This burns CPU and adds significant latency. You must enable keepalive connections to your upstreams.

http {
    upstream backend_api {
        server 10.0.0.5:8080;
        server 10.0.0.6:8080;
        
        # Keep 64 idle connections open to the backend
        keepalive 64;
    }

    server {
        location /api/ {
            proxy_pass http://backend_api;
            
            # REQUIRED for keepalive to work
            proxy_http_version 1.1;
            proxy_set_header Connection "";
            
            # Buffering tweaks
            proxy_buffers 16 16k;
            proxy_buffer_size 32k;
        }
    }
}

Setting proxy_set_header Connection ""; is critical. If you don't clear the header, Nginx forwards the "Close" header from the client to the backend, and the backend kills the connection anyway.

4. SSL/TLS: Performance vs. Security

With the impending GDPR regulations causing headaches across Europe, everyone is moving to HTTPS everywhere. However, SSL handshakes are expensive. To maintain performance without sacrificing security, you need to implement Session Resumption and OCSP Stapling.

Ensure your Nginx build (check with nginx -V) is linked against OpenSSL 1.0.2 or higher to support ALPN (required for HTTP/2).

server {
    listen 443 ssl http2;
    server_name api.yourdomain.no;

    ssl_certificate /etc/letsencrypt/live/api.yourdomain.no/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/api.yourdomain.no/privkey.pem;

    # Cache SSL Sessions to avoid full handshakes
    ssl_session_cache shared:SSL:10m; # 10MB can hold ~40,000 sessions
    ssl_session_timeout 24h;

    # OCSP Stapling
    ssl_stapling on;
    ssl_stapling_verify on;
    resolver 8.8.8.8 8.8.4.4 valid=300s;
    resolver_timeout 5s;
    
    # Modern Cipher Suite (2017 Best Practice)
    ssl_ciphers 'ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384';
}

5. The Infrastructure Factor: Why IOPS Matter

You can have the most optimized Nginx config in the world, but if your logs are writing to a slow disk, your request thread blocks. In an API Gateway, logging is often the hidden bottleneck. High-traffic APIs generate massive access logs.

We see this constantly with developers migrating from shared hosting to VPS. They assume "SSD" is enough. It isn't. Standard SATA SSDs queue up under heavy random write loads (like logging thousands of requests per second).

This is where the hardware underlying the virtualization matters. At CoolVDS, we strictly use NVMe storage for our storage backend. NVMe talks directly to the CPU via the PCIe bus, bypassing the SATA controller bottleneck. For a high-throughput API gateway logging to disk (or buffering to a local agent like Logstash), the difference is not subtle.

Benchmarking Your Config

Don't take my word for it. Use wrk to stress test your endpoint. If you are on a Mac, brew install wrk. On Linux, build it from source.

# Run a test with 12 threads and 400 connections for 30 seconds
wrk -t12 -c400 -d30s https://api.yourdomain.no/healthcheck

If you see a high standard deviation in latency, your CPU is likely stealing time (check st in top) or your I/O is blocked. High steal time is a symptom of overcrowded host nodes—a practice we strictly ban at CoolVDS to ensure your latency remains flat.

Summary

Tuning an API Gateway in 2017 requires looking beyond the application code. It requires a holistic view of the Linux kernel, the Nginx configuration, and the physical hardware limitations.

  1. Increase file descriptor limits.
  2. Tune the TCP stack to recycle connections.
  3. Enable upstream keepalive to stop handshake overhead.
  4. Use HTTP/2 and SSL Session caching.
  5. Host on NVMe.

Data privacy laws like the upcoming GDPR are forcing data to stay local. But local shouldn't mean slow. If you need low-latency infrastructure in Norway that respects your need for raw performance, spin up a CoolVDS instance. We provide the raw power; you provide the code.