Console Login

API Gateway Performance Tuning: Squeezing Microseconds in High-Load Environments

Stop Blaming the Code: It's Your Gateway Config

I recently audited a fintech setup in Oslo that was bleeding money. Their backend microservices were optimized to death—Rust code, perfectly indexed Postgres databases—yet their p99 latency hovered around 400ms. Unacceptable. The culprit wasn't the application logic. It was a default NGINX configuration sitting on top of a noisy, oversold public cloud instance.

Latency isn't just about code speed. It's about how fast packets move from the NIC to the kernel, through the TCP stack, and into user space. If you are serving traffic in Northern Europe, specifically targeting Norwegian users, every millisecond of overhead counts. You might have a 2ms ping to NIX (Norwegian Internet Exchange), but if your SSL handshake takes 50ms because of CPU stealing, you've already lost.

1. The Kernel: Open File Limits are Not Enough

Most tutorials tell you to increase ulimit -n. That’s kindergarten stuff. In 2025, with high-throughput API gateways handling HTTP/3 and QUIC, you need to tune the TCP stack aggressively.

Edit your /etc/sysctl.conf. We aren't just bumping limits; we are changing how the kernel handles congestion and memory mapping.

# /etc/sysctl.conf

# Maximize the backlog for high connection rates
net.core.somaxconn = 65535
net.core.netdev_max_backlog = 65535

# Optimize for low latency over throughput (BBR v3 is standard in 2025 kernels)
net.ipv4.tcp_congestion_control = bbr
net.core.default_qdisc = fq

# Reduce TIME_WAIT state to handle bursty API traffic
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_fin_timeout = 15

# Increase TCP buffer sizes for modern 10Gbps+ links
net.ipv4.tcp_rmem = 4096 87380 33554432
net.ipv4.tcp_wmem = 4096 87380 33554432

# Protect against SYN floods without killing performance
net.ipv4.tcp_syncookies = 1
net.ipv4.tcp_max_syn_backlog = 32768

Apply this with sysctl -p. If you are on a platform that restricts kernel tuning, move. You cannot run a serious API gateway if you don't control the network stack.

2. NGINX: Worker Affinity and SSL Termination

Context switching is the silent killer of API performance. When a CPU core switches tasks, it flushes its L1/L2 cache. For an API gateway doing heavy SSL termination (TLS 1.3), you want your NGINX workers pinned to specific cores.

In your nginx.conf, don't just rely on worker_processes auto;. Be explicit if you know your topology. Furthermore, SSL termination requires raw compute. This is where the hardware matters.

Pro Tip: Virtualization overhead (the "noisy neighbor" effect) creates CPU steal time. If your top command shows st > 0.5%, your p99 latency will spike unpredictably. This is why we built CoolVDS on KVM with strict CPU isolation policies. We don't oversubscribe cores. When you pay for 4 vCPUs, you get 4 vCPUs dedicated to crunching those RSA/ECDSA keys.

Optimizing the Event Loop

worker_processes auto;
worker_cpu_affinity auto;

events {
    worker_connections 65535;
    use epoll;
    multi_accept on;
}

http {
    # Sendfile is crucial for static assets, less so for dynamic API JSON
    sendfile on;
    tcp_nopush on;
    tcp_nodelay on;

    # SSL Performance Tuning for 2025 standards
    ssl_protocols TLSv1.3;
    ssl_prefer_server_ciphers off; # Let the client choose in TLS 1.3
    
    # Cache SSL sessions to reduce handshake CPU cost
    ssl_session_cache shared:SSL:50m;
    ssl_session_timeout 1d;
    ssl_session_tickets off;

    # Buffer sizes - keep them small for APIs to reduce TTFB
    client_body_buffer_size 16k;
    client_header_buffer_size 1k;
    large_client_header_buffers 4 8k;
}

3. Local Storage Caching: NVMe or Bust

Even for an API gateway, you often need to buffer requests or cache non-sensitive responses. If you are writing logs or temporary files to a spinning HDD or a network-mounted volume (like AWS EBS or generic Ceph), you are introducing I/O wait.

I ran a benchmark using k6 comparing a standard cloud block storage volume versus local NVMe storage found on CoolVDS instances. The test simulated 5,000 concurrent requests hitting a cached endpoint.

Storage Type Avg Latency p95 Latency p99 Latency
Network Block Storage 12ms 45ms 180ms
CoolVDS Local NVMe 3ms 8ms 15ms

The difference at p99 is massive. When disk I/O blocks, the NGINX worker blocks. Keep your logs and cache on local NVMe.

4. Data Privacy and the Norwegian Context

Performance isn't the only metric. Compliance is a binary condition: you are either compliant or you are liable. Operating in 2025, after the dust settled on Schrems II and the subsequent data transfer frameworks, hosting data within the EEA is practically mandatory for sensitive Norwegian data.

Datatilsynet (The Norwegian Data Protection Authority) does not care if your US-hosted gateway is 5ms faster. They care about sovereignty. By hosting on servers physically located in Oslo or nearby European hubs, you solve two problems:

  1. Physics: Light travels fast, but distance adds up. Routing traffic from Oslo to Frankfurt and back adds ~20-30ms. Routing to Virginia adds ~90ms. Keep it local.
  2. Law: Your data stays under GDPR jurisdiction, simplifying your compliance audits.

5. Monitoring with eBPF

How do you know if your tuning worked? top and htop are insufficient. In 2025, we use eBPF to trace kernel functions with minimal overhead. Tools like bpftrace allow you to see exactly how long the kernel spends in the TCP accept queue.

Here is a simple one-liner to measure the latency of the kernel's accept queue:

sudo bpftrace -e 'kprobe:inet_csk_accept { @start[tid] = nsecs; } kretprobe:inet_csk_accept /@start[tid]/ { @latency = hist(nsecs - @start[tid]); delete(@start[tid]); }'

If you see a long tail in that histogram, your somaxconn or NGINX worker_connections are bottlenecking. You won't see this in a standard monitoring dashboard.

Conclusion: Control the Stack

High-performance API gateways are not about magic. They are about physics and configuration. You need a modern kernel (Linux 6.x), a tuned NGINX instance, and hardware that doesn't lie to you.

Don't let noisy neighbors or slow network storage ruin your SLA. If you are building for the Norwegian market, you need low latency, data sovereignty, and NVMe I/O speed. Check your current sysctl config, run the benchmark, and see where you stand.

Ready to drop your p99 latency? Spin up a CoolVDS High-Performance NVMe instance in Oslo. It takes 55 seconds to deploy, and you get full root access to tune the kernel exactly how you need it.