Stop Blaming Your Backend: It's Your Gateway Choking on Handshakes
I recently audited a fintech deployment based here in Oslo. They were bleeding money—not because of bad investment algorithms, but because their payment API had a p99 latency of 400ms. In the world of high-frequency trading and instant settlements, that is an eternity. They blamed their Java backend. They blamed the database.
They were wrong.
The bottleneck was their API Gateway. It was a standard Nginx reverse proxy running on a default Linux kernel, choking on TCP connection establishment. We dropped that latency to 45ms without touching a single line of Java code.
If you are running microservices in 2020 without tuning your gateway layer, you are effectively throwing compute power into a black hole. Here is how we fix it, focusing on the Linux kernel, Nginx/Kong configuration, and why the underlying hardware (specifically the virtualization type) makes or breaks your metrics.
1. The Linux Kernel is Not Tuned for Web Traffic
Most Linux distributions, even the trusty Ubuntu 18.04 LTS, ship with network settings designed for general-purpose computing, not for handling 10,000 concurrent connections per second. Before you even touch Nginx, you need to fix the OS.
Open your /etc/sysctl.conf. We need to widen the TCP ephemeral port range and allow for faster recycling of sockets. If you don't do this, your gateway will run out of ports (TIME_WAIT state) during traffic spikes.
# /etc/sysctl.conf
# Increase system file descriptor limit
fs.file-max = 2097152
# Widen the port range
net.ipv4.ip_local_port_range = 1024 65535
# Reuse sockets in TIME_WAIT state for new connections
net.ipv4.tcp_tw_reuse = 1
# Increase the maximum number of connections in the backlog
net.core.somaxconn = 65535
net.ipv4.tcp_max_syn_backlog = 65535
# Increase the read/write buffer sizes for handling larger payloads
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
Apply these with sysctl -p. This is the difference between your server handling a DDoS attack or falling over during a marketing campaign.
2. The "Keepalive" Mistake in Nginx
Whether you are using raw Nginx or Kong (which is built on OpenResty/Nginx), this is the most common error I see. By default, Nginx acts as a polite HTTP/1.1 client to your upstream services: it opens a connection, sends a request, gets a response, and closes the connection.
This means for every single API call, your gateway is performing a full TCP handshake (SYN, SYN-ACK, ACK) with your backend service. If you have SSL enabled internally, add a TLS handshake to that. This destroys performance.
You must enable upstream keepalives. This keeps the pipe open between the Gateway and your Microservices.
# nginx.conf (or your site config)
upstream backend_api {
server 10.0.0.5:8080;
# The Critical Setting: Keep 64 idle connections open to the upstream
keepalive 64;
}
server {
location /api/ {
proxy_pass http://backend_api;
# Required to enable keepalive
proxy_http_version 1.1;
proxy_set_header Connection "";
}
}
Pro Tip: If you are using Kong, check yourupstream_keepalive_pool_sizeinkong.conf. The default is often too low for high-throughput environments. Bump it up to at least 60.
3. SSL/TLS Optimization
Encryption is expensive. In 2020, we are seeing a massive shift toward "Zero Trust" networks, meaning even internal traffic is encrypted. If your gateway is terminating SSL, you need to optimize the buffer size.
Nginx defaults to a 16k buffer. This is fine for large file transfers, but for APIs returning small JSON objects, it adds latency because Nginx waits to fill the buffer before sending.
# Optimize for Time To First Byte (TTFB)
ssl_buffer_size 4k;
Reducing this ensures that the client starts receiving data sooner. It saves milliseconds, and in the API world, milliseconds accumulate.
4. The Hardware Reality: Why Your "Cloud" Might Be Lying
You can apply all the configs above, but if your underlying disk I/O is garbage, your API gateway will still lag. Why? Logging.
Every request generates access logs and error logs. High-throughput gateways write to disk constantly. On standard cloud providers using shared HDD or even standard SSDs with