Console Login

Surviving Microservices: Implementing High-Availability Smart Proxies with HAProxy & Synapse

Surviving the Microservice Sprawl: A Battle-Tested Guide to Smart Proxies

Everyone wants to be Netflix these days. Management reads a blog post about decoupling the monolith, and suddenly, my team is tasked with splitting a perfectly functional LAMP stack into twelve different services. The theory is sound: separation of concerns, independent scaling. But the reality? It's a networking nightmare.

When you replace function calls with network calls, you are trading reliability for latency. I've seen it happen: a single API endpoint hangs, the connection pool saturates, and the entire platform goes dark because of a timeout setting in a PHP script.

Traditional hardware load balancers are too rigid. DNS is too slow (TTL caching is the devil). In this guide, we are implementing what the folks at Airbnb are calling the "Smart Proxy" pattern using HAProxy and Synapse. It's the only way to manage service discovery dynamically without waking up at 3 AM.

The Architecture of Resilience

In a standard setup, App Server A talks to API Server B via a hardcoded IP or a VIP (Virtual IP). If Server B dies, you have to manually update the config or wait for a heartbeat mechanism to flip the VIP. Too slow.

We are going to run a local HAProxy instance on every single server. This local proxy handles all outbound traffic to other services. It effectively creates a "mesh" of connectivity (though we don't really have a name for this yet).

The Stack

  • Zookeeper: The source of truth. Keeps track of which backends are alive.
  • Synapse: A Ruby utility (released by Airbnb last year) that watches Zookeeper and dynamically reconfigures HAProxy.
  • HAProxy 1.5 (dev/stable): The workhorse. It routes traffic and handles retries.

Step 1: The Foundation (Hardware Matters)

Before we touch config files, let's talk about the metal. Zookeeper is incredibly sensitive to disk latency. If your disk wait times spike, ZK nodes lose quorum, and your entire service discovery layer falls apart.

Pro Tip: Never run Zookeeper on budget shared hosting (OpenVZ). The "noisy neighbor" effect will kill your leader elections. We use CoolVDS for this because they offer KVM virtualization with dedicated resource guarantees. You need consistent I/O performance, not "burstable" nonsense.

Step 2: Tuning the Kernel

If you are proxying thousands of connections locally, you will hit the ephemeral port limit. I learned this the hard way during a load test for a Norwegian e-commerce client last Black Friday. The logs showed EADDRNOTAVAIL and the server just gave up.

Edit your /etc/sysctl.conf to allow faster recycling of TCP sockets:

# /etc/sysctl.conf
net.ipv4.tcp_tw_reuse = 1
net.ipv4.ip_local_port_range = 1024 65023
net.ipv4.tcp_max_syn_backlog = 10240
net.ipv4.tcp_fin_timeout = 15

Run sysctl -p. If you don't do this, HAProxy will choke before your CPU even hits 10%.

Step 3: Configuring HAProxy for the "Smart Proxy" Role

We want HAProxy to be aggressive. If a backend is slow, cut it loose. We aren't waiting 30 seconds for a timeout. In a microservices environment, fail fast is the only rule.

Here is a snippet of a robust haproxy.cfg designed for local sidecar usage:

global
    log /dev/log    local0
    log /dev/log    local1 notice
    chroot /var/lib/haproxy
    user haproxy
    group haproxy
    daemon
    # High limit for massive throughput
    maxconn 50000

defaults
    log     global
    mode    http
    option  httplog
    option  dontlognull
    # Aggressive timeouts
    timeout connect 500ms
    timeout client  5000ms
    timeout server  5000ms
    retries 3
    option  redispatch

# The frontend that your application connects to (localhost)
frontend local_service_front
    bind 127.0.0.1:8080
    default_backend dynamic_backend_service_a

# The backend is populated dynamically, but here is the template
backend dynamic_backend_service_a
    balance roundrobin
    option httpchk GET /health
    # These lines will be managed by Synapse
    server service_a_1 10.0.0.5:80 check inter 2000 rise 2 fall 3
    server service_a_2 10.0.0.6:80 check inter 2000 rise 2 fall 3

Notice the timeout connect 500ms. If we can't establish a TCP handshake in half a second, that server is effectively dead to us. Move on.

Step 4: Automating with Synapse

Manually editing HAProxy is fine for three servers. It is suicide for thirty. Synapse runs as a daemon, watches a Zookeeper path (e.g., /services/user-api/instances), and rewrites the HAProxy config whenever a node joins or leaves.

Here is a working synapse.conf.json:

{
  "service_watcher": {
    "discovery": {
      "method": "zookeeper",
      "path": "/services/user-api",
      "hosts": [
        "10.0.0.2:2181",
        "10.0.0.3:2181"
      ]
    },
    "haproxy": {
      "port": 8080,
      "server_options": "check inter 2000 rise 2 fall 3",
      "listen": [
        "mode http",
        "option httpchk GET /health"
      ]
    }
  }
}

The Data Sovereignty Factor

Technical architecture doesn't exist in a vacuum. Working with Norwegian clients, I constantly deal with the Personopplysningsloven (Personal Data Act). You cannot just throw customer data onto a random AWS instance in Virginia and hope for the best.

Latency is another factor. If your users are in Oslo or Bergen, routing traffic through Frankfurt adds 30-40ms of unnecessary round-trip time. In a microservice chain where one request triggers five internal calls, that latency compounds.

This is where CoolVDS has become my go-to recommendation. Their data center is located locally, peering directly at NIX (Norwegian Internet Exchange). We get sub-5ms latency to most ISPs in the region. Plus, strict adherence to Norwegian privacy laws keeps the legal team off my back.

Database Considerations

While we are optimizing the proxy layer, don't ignore the database. All the HAProxy tuning in the world won't save you if your MySQL instance is thrashing.

On a recent deployment, we saw high I/O wait despite low traffic. The culprit? Default InnoDB settings. Always ensure your buffer pool matches your RAM allocation:

[mysqld]
# Set to 70-80% of available RAM on a dedicated DB node
innodb_buffer_pool_size = 4G
innodb_log_file_size = 512M
innodb_flush_log_at_trx_commit = 2 # 1 is safer, 2 is faster. Pick your poison.

Conclusion

Microservices are powerful, but they shift complexity from the code to the network. By using a local HAProxy instance controlled by Zookeeper and Synapse, you gain resilience. If a node fails, your application doesn't crash; it just reroutes.

But software resilience requires hardware stability. Don't build a skyscraper on a swamp. Ensure your underlying infrastructure—specifically your virtualization and disk I/O—is up to the task.

Ready to build a cluster that doesn't wake you up at night? Spin up a KVM instance on CoolVDS today and test the difference raw NVMe performance makes for Zookeeper.