Console Login

Envoy Proxy vs. The World: Rethinking Edge Routing in 2018

The Age of Static Load Balancers is Over

If you are still hard-coding IP addresses into an upstream block in NGINX or reloading HAProxy configurations via shell scripts every time a container spins up, you are fighting a losing battle. It is March 2018. The monolith is decomposing, and the debris—hundreds of microservices—is clogging your network.

I have spent the last six months migrating a high-traffic e-commerce platform from a standard LAMP stack to a Kubernetes-managed microservices architecture. The biggest bottleneck wasn't PHP; it was the network. We initially tried to shoehorn our legacy load balancers into this dynamic environment. The result? 502 Bad Gateways during scaling events and zero visibility into which service was actually failing.

Enter Envoy Proxy. Originally built by Lyft, this C++ L7 proxy and communication bus is designed specifically for large service-oriented architectures. It is not just a load balancer; it is a universal data plane.

Why Envoy? (And Why Now?)

Traditional proxies treat configuration as a static file. Envoy treats configuration as an API. Through its xDS (Discovery Service) APIs, Envoy can dynamically update listeners, routes, and clusters without dropping a single connection or requiring a process reload. For a DevOps engineer managing a cluster that autoscales based on load, this is non-negotiable.

The Observability Gap

The second reason we switched is observability. In a distributed system, "it's slow" is not a valid bug report. You need to know where it is slow. Envoy generates robust statistics and distributed tracing out of the box.

Pro Tip: Don't just dump Envoy stats into a log file. Configure the DogStatsD sink to push metrics to your time-series database. If you are operating in Norway under the looming GDPR deadline (May 2018), having granular tracing helps prove exactly where user data flows within your internal network.

Configuration: The v2 API

Let's look at a concrete example. Below is a basic envoy.yaml configuration using the v2 API (the current standard). This setup configures Envoy as an edge proxy terminating HTTP/1.1 and HTTP/2 traffic.

admin:
  access_log_path: /tmp/admin_access.log
  address:
    socket_address: { address: 0.0.0.0, port_value: 9901 }

static_resources:
  listeners:
  - name: listener_0
    address:
      socket_address: { address: 0.0.0.0, port_value: 10000 }
    filter_chains:
    - filters:
      - name: envoy.http_connection_manager
        config:
          stat_prefix: ingress_http
          codec_type: AUTO
          route_config:
            name: local_route
            virtual_hosts:
            - name: local_service
              domains: ["*"]
              routes:
              - match: { prefix: "/" }
                route: { cluster: service_google }
          http_filters:
          - name: envoy.router
  clusters:
  - name: service_google
    connect_timeout: 0.25s
    type: LOGICAL_DNS
    lb_policy: ROUND_ROBIN
    hosts: [{ socket_address: { address: google.com, port_value: 443 }}]
    tls_context: { snippet: google.com }

To run this, we use the official Docker image. Note that we are mapping both the service port (10000) and the admin interface (9901).

docker run --name=envoy -d \
-p 9901:9901 \
-p 10000:10000 \
-v $(pwd)/envoy.yaml:/etc/envoy/envoy.yaml \
envoyproxy/envoy:v1.6.0

The Infrastructure Reality Check

Here is the uncomfortable truth about running a service mesh or advanced proxies like Envoy: It increases your infrastructure footprint.

If you deploy Envoy as a sidecar (one proxy per application container), you effectively double the number of running processes. While Envoy is efficient, it still consumes CPU cycles for context switching and memory for buffering. If you run this on a cheap, oversold VPS where "vCPU" basically means "you get CPU when the neighbors aren't mining crypto," your latency will spike. I have seen p99 latency jump from 10ms to 400ms simply because the underlying host was suffering from CPU steal.

Why I Use CoolVDS for Proxy Layers

When we built our Norwegian peering nodes, we standardized on CoolVDS for three specific reasons:

  1. KVM Isolation: Unlike OpenVZ or LXC containers often sold as "VPS", CoolVDS uses KVM. This means our kernel memory is ours. No noisy neighbors impacting our packet processing capabilities.
  2. NVMe Storage: Envoy's access logs and tracing spans can generate massive I/O throughput. Writing these to a standard spinning HDD or even a cheap SATA SSD introduces blocking I/O that stalls the proxy. CoolVDS NVMe drives handle the IOPS required for high-resolution logging without breaking a sweat.
  3. Local Latency: For our Norwegian clients, routing traffic through Frankfurt or London adds 20-30ms. By hosting in Oslo on CoolVDS, we keep the round-trip time (RTT) negligible.

Advanced Configuration: Rate Limiting

One of the most powerful features to implement before you get hit by a DDoS attack is global rate limiting. Unlike NGINX, which often keeps rate limit counters local to the worker process, Envoy can talk to a global Rate Limit Service (RLS).

Here is a snippet to enable the Rate Limit filter in your connection manager:

          http_filters:
          - name: envoy.rate_limit
            config:
              domain: edge_proxy
              stage: 0
              timeout: 0.1s
              failure_mode_deny: false
          - name: envoy.router

You then define descriptors to limit traffic based on headers, IPs, or destination clusters. This protects your backend database from being hammered by a rogue script.

Final Thoughts: Prepare for GDPR

With Datatilsynet ramping up for the May 2018 GDPR enforcement, you need to know exactly what data you hold and where it goes. Envoy's ability to create a transparent mesh allows you to audit data flows in real-time. But remember: software is only as reliable as the hardware it runs on.

Do not let storage latency be the reason your microservices architecture fails. Test your Envoy configuration on a platform built for IOPS.

Ready to lower your p99 latency? Deploy a high-performance KVM instance on CoolVDS in under 60 seconds.