Console Login

Surviving the Microservices Hangover: A Practical Service Mesh Guide for 2017

Surviving the Microservices Hangover: A Practical Service Mesh Guide for 2017

We all bought the lie. We smashed our monoliths. We containerized everything. We pushed it all to Kubernetes 1.5. And now? Now we have fifty services screaming at each other over the network, and debugging a single 502 Bad Gateway takes three engineers and a bottle of Aquavit.

Microservices fix development bottlenecks, but they introduce a massive operational tax: Network Complexity. When function calls become network requests, latency isn't just a metric; it's the killer of user experience.

If you are running a distributed system in production today, you don't just need a load balancer; you need a Service Mesh. In this guide, I’m going to show you how to implement the "sidecar" proxy pattern using Linkerd (the current robust choice over the raw Envoy proxy) to regain sanity. We will look at why running this on high-performance KVM infrastructure, like CoolVDS, is the only way to keep the JVM overhead from crushing your nodes.

The Problem: The Network is Not Reliable

In a monolithic architecture, components talk via memory. Fast. Reliable. In microservices, they talk via TCP. Variable. Flaky.

I recently audited a setup for a Norwegian fintech startup. They had sporadic failures in their payment gateway. The logs showed nothing but timeouts. The root cause? A "noisy neighbor" on their public cloud instance was stealing CPU cycles, causing the SSL handshake between checkout-service and payment-service to lag just enough to trigger a timeout.

We need to decouple the application logic from the network logic. Your Python code shouldn't care about retries, circuit breaking, or service discovery. That belongs in the infrastructure layer.

The Solution: Linkerd and the Sidecar Pattern

As of early 2017, Linkerd is the heavyweight champion here. It joins the Cloud Native Computing Foundation (CNCF) earlier this year for a reason. It acts as a transparent proxy for your services.

Pro Tip: Linkerd runs on the JVM. This is critical to understand. It is feature-rich but resource-hungry. Do not try to run this on shared, burstable RAM hosting. You need dedicated RAM allocations found in KVM-based providers like CoolVDS to prevent Garbage Collection pauses from adding latency.

Step 1: The Architecture

Instead of Service A talking directly to Service B, Service A talks to its local Linkerd instance, which routes the request to Service B's Linkerd instance. This gives us:

  • Latency Aware Load Balancing: Send traffic to the fastest node, not just the next one in round-robin.
  • Circuit Breaking: Stop sending traffic to a dying instance before it crashes the whole cluster.
  • Distributed Tracing: See exactly where the time is going (Zipkin integration).

Step 2: Configuring the Router

Here is a battle-tested linkerd.yaml configuration. We are defining a router that speaks HTTP and uses a file-based discovery for simplicity (though you'd likely plug this into Consul or Kubernetes DNS in production).

admin:
  port: 9990

routers:
- protocol: http
  label: outgoing
  dtab: |
    /svc => /#/io.l5d.fs;
  servers:
  - port: 4140
    ip: 0.0.0.0
  client:
    loadBalancer:
      kind: ewma # Exponentially Weighted Moving Average - crucial for latency tails!
    failureAccrual:
      kind: io.l5d.failureAccrual.consecutiveFailures
      failures: 5
      backoff:
        kind: jittered
        minMs: 10000
        maxMs: 60000

namers:
- kind: io.l5d.fs
  rootDir: /var/discovery

Notice the ewma load balancer setting. Standard round-robin is garbage for microservices because it doesn't account for a node that is technically "up" but performing slowly. EWMA biases traffic toward the fastest responders.

Step 3: Deploying to Kubernetes (DaemonSet Strategy)

While the sidecar pattern (one proxy per pod) is the future, right now in 2017, running a JVM per pod is too expensive for most. The "DaemonSet" approach—one Linkerd instance per host node—is the pragmatic middle ground.

apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
  name: l5d
spec:
  template:
    metadata:
      labels:
        app: l5d
    spec:
      volumes:
      - name: l5d-config
        configMap:
          name: l5d-config
      containers:
      - name: l5d
        image: buoyantio/linkerd:1.0.0
        args:
        - /io.buoyant/linkerd/config/config.yaml
        ports:
        - name: outgoing
          containerPort: 4140
          hostPort: 4140
        - name: admin
          containerPort: 9990
        resources:
          requests:
            cpu: 500m
            memory: 512Mi
          limits:
            cpu: 1
            memory: 1024Mi
        volumeMounts:
        - name: "l5d-config"
          mountPath: "/io.buoyant/linkerd/config"

Look at those resource limits. 1GB of RAM limit. If you run this on a cheap VPS with "burstable" RAM, the OOM Killer will murder your mesh during peak load. This is why I deploy my clusters on CoolVDS. Their NVMe storage ensures that if we do swap (god forbid), it happens fast, and the dedicated KVM slice guarantees the memory exists.

The Latency Trap: Local Geography Matters

Service Meshes add a hop. That adds latency. Usually sub-millisecond, but it adds up.

If your servers are in a datacenter in Germany, but your customers are in Oslo, you are already fighting physics. Add a mesh, and you might see noticeable lag. For Norwegian businesses, data sovereignty is becoming a massive topic with the upcoming GDPR regulations looming in 2018. Keeping traffic inside Norway isn't just about speed; it's about compliance with the Datatilsynet's tightening grip.

Hosting locally reduces the Round Trip Time (RTT) to the NIX (Norwegian Internet Exchange). If your microservices are chatty (e.g., 50 internal calls per user request), saving 2ms per call by being on a premium network backbone like CoolVDS translates to a 100ms faster page load. That is the difference between a conversion and a bounce.

Verifying the Mesh

Once deployed, don't trust; verify. Use curl to hit the admin interface and check the routing table.

# Check if Linkerd is routing correctly
curl -H "Host: web" http://localhost:4140/

# Delegate a name to see where it resolves
curl "http://localhost:9990/delegator?path=/svc/users&dtab=/svc=>/#/io.l5d.fs"

This command simulates a request. If your delegation tables (dtabs) are wrong, you'll see it here before production goes down.

Summary: Don't Build a Distributed Monolith

Microservices without a mesh is just a distributed monolith with network latency. You need visibility. You need retries that don't require code changes. You need Linkerd.

But software cannot fix bad hardware. A service mesh adds computational overhead. It demands high I/O for logging traces and consistent CPU for routing logic.

Next Steps:
1. Audit your current inter-service latency.
2. Deploy Linkerd v1 on a staging cluster.
3. Ensure your underlying infrastructure handles the JVM weight. Don't let slow I/O kill your SEO. Deploy a test instance on CoolVDS in 55 seconds and see the difference dedicated resources make.