Console Login

Taming Microservices Chaos: Implementing a Service Mesh with Linkerd on Kubernetes 1.5

Taming Microservices Chaos: Implementing a Service Mesh with Linkerd on Kubernetes 1.5

We traded Monolithic Hell for Microservices Purgatory, and honestly, I'm tired of it. It's 2017, and if you are still hardcoding retry logic inside your Node.js or Java applications, you are doing it wrong.

In a recent deployment for a Norwegian logistics client, we split a massive legacy PHP application into twelve discrete services running in Docker containers. The deployment worked. The network didn't. Intermittent failures between the inventory service and the shipping service caused cascading 500 errors that took us three days to trace. Why? Because developers implemented different timeout strategies in every single service.

This is where the Service Mesh concept saves your sanity. Today, we are going to look at decoupling operational logic (retries, circuit breaking, routing) from application logic using Linkerd. And because every millisecond of latency you add with a proxy needs to be gained back elsewhere, we will discuss why underlying hardware performance—specifically the NVMe storage stacks we use at CoolVDS—is non-negotiable for this architecture.

The Problem: The "Fat" Client Library

Before containers, we had giant load balancers (F5, NetScaler) handling traffic at the edge. Inside the datacenter, services talked directly to each other. In 2017, with Kubernetes 1.5 gaining traction, services appear and disappear dynamically. IP addresses change every minute.

Your developers probably handle this by writing code like this:

import requests
import time

def get_inventory(item_id):
    retries = 3
    for i in range(retries):
        try:
            return requests.get(f"http://inventory-service/{item_id}", timeout=0.5)
        except requests.exceptions.RequestException:
            time.sleep(0.1 * (2 ** i)) # Exponential backoff
            continue
    raise Exception("Service unavailable")

This is technical debt. Now multiply this by 20 services and 4 languages. If you want to change the retry strategy, you have to redeploy the entire platform. This inconsistency is what wakes you up at 3 AM.

The Solution: The Sidecar Proxy (Linkerd)

The solution is to push this logic out of the code and into a dedicated infrastructure layer. We run a small proxy next to every single service instance (a "sidecar"). The application talks to the local proxy (localhost), and the proxy handles the scary internet stuff.

As of early 2017, Linkerd (built on Twitter's Finagle) is the most robust option for this. It handles service discovery, intelligent load balancing, and circuit breaking.

Pro Tip: Linkerd runs on the JVM. It is feature-rich but memory-hungry. If you are deploying this, ensure your nodes have ample RAM. On CoolVDS, we recommend starting with our 8GB RAM instances to accommodate the JVM overhead without starving your actual application containers.

Step 1: The Configuration

Here is a battle-tested linkerd.yaml configuration we used to stabilize traffic between our Oslo and Frankfurt endpoints. This setup bridges the gap between old-school Consul discovery and Kubernetes.

admin:
  port: 9990

routers:
- protocol: http
  label: outgoing
  dtab: |
    /svc => /#/io.l5d.k8s/default/http;
  interpreter:
    kind: default
    transformers:
    - kind: io.l5d.k8s.daemonset
      namespace: default
      port: incoming
      service: l5d
  servers:
  - port: 4140
    ip: 0.0.0.0

- protocol: http
  label: incoming
  dtab: |
    /svc => /#/io.l5d.k8s/default/http;
  interpreter:
    kind: default
    transformers:
    - kind: io.l5d.k8s.localnode
  servers:
  - port: 4141
    ip: 0.0.0.0

Step 2: Deploying via DaemonSet

Instead of injecting a sidecar into every pod definition (which creates massive YAML bloat), we use a Kubernetes DaemonSet to run one Linkerd instance per node. This is more efficient for resources.

apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
  labels:
    app: l5d
  name: l5d
spec:
  template:
    metadata:
      labels:
        app: l5d
    spec:
      volumes:
      - name: l5d-config
        configMap:
          name: "l5d-config"
      containers:
      - name: l5d
        image: buoyantio/linkerd:0.8.5
        args:
        - /io.buoyant/linkerd/config/config.yaml
        ports:
        - name: outgoing
          containerPort: 4140
          hostPort: 4140
        - name: incoming
          containerPort: 4141
        volumeMounts:
        - name: "l5d-config"
          mountPath: "/io.buoyant/linkerd/config"
          readOnly: true

Latency: The Hidden Cost of Service Meshes

Here is the reality check nobody tells you in the "Hello World" tutorials: A Service Mesh doubles your network hops.

  1. Service A talks to Local Linkerd (Hop 1)
  2. Local Linkerd talks to Remote Linkerd (Hop 2)
  3. Remote Linkerd talks to Service B (Hop 3)

If your underlying infrastructure is slow, you just killed your application's performance. You cannot run a Service Mesh on budget spinning disks or oversold VPS hosts where the CPU steal time is high. The JVM context switching alone will add 10-20ms of jitter if the CPU is contended.

The Hardware Prerequisite

This is where infrastructure choices matter. At CoolVDS, we specifically architected our KVM stack to mitigate this "proxy tax."

Feature Generic VPS CoolVDS Implementation Why It Matters for Linkerd
Storage SATA SSD (Shared) NVMe (Direct Path) Log aggregation and tracing writes must not block the I/O thread.
Virtualization OpenVZ / LXC KVM JVM requires strict memory isolation and true kernel access.
Network Shared 1Gbps 10Gbps Uplink Handling the doubled packet count of service-to-proxy traffic.

Configuring Circuit Breaking

The killer feature of Linkerd is Circuit Breaking. If the `shipping` service starts failing, Linkerd stops sending it traffic before your users notice.

Add this to your linkerd.yaml client configuration:

client:
  failureAccrual:
    kind: io.l5d.consecutiveFailures
    failures: 5
    backoff:
      kind: constant
      ms: 10000

This config says: "If the service fails 5 times in a row, stop talking to it for 10 seconds." This gives the failing service breathing room to recover (e.g., if the database lock clears) rather than hammering it to death with retries.

Testing the Mesh

Once deployed, verify the routing using curl with the http_proxy command. Do not assume it works; prove it.

# Send a request through the local Linkerd instance
http_proxy=http://$(minikube ip):4140 curl -v http://inventory/items/55

# Expected Output:
# < HTTP/1.1 200 OK
# < l5d-ctx-trace: ...
# < Server: Jetty(9.2.14.v20151106)

If you see the l5d-ctx-trace header, congratulations. You have successfully intercepted the traffic. You now have visibility.

Data Sovereignty and The Norwegian Context

With the EU General Data Protection Regulation (GDPR) looming on the horizon for 2018, knowing exactly where your data flows is critical. A Service Mesh provides tracing logs that prove service A talked to service B.

However, logs are data. If you are hosting in Norway to comply with Datatilsynet requirements or simply to ensure low latency to the NIX (Norwegian Internet Exchange), ensure your persistent volumes for these logs are also local.

Conclusion

Implementing a Service Mesh in 2017 is bleeding edge. It requires patience, solid knowledge of YAML, and robust hardware. But the payoff is a system that heals itself when components fail.

Don't let the overhead of Java proxies slow down your production environment. You need raw compute power and zero-wait storage I/O.

Ready to build a resilient cluster? Deploy a high-performance KVM instance on CoolVDS today and get the NVMe speed your microservices demand.