Taming Microservices Chaos: Implementing a Service Mesh with Linkerd on Kubernetes 1.5
We traded Monolithic Hell for Microservices Purgatory, and honestly, I'm tired of it. It's 2017, and if you are still hardcoding retry logic inside your Node.js or Java applications, you are doing it wrong.
In a recent deployment for a Norwegian logistics client, we split a massive legacy PHP application into twelve discrete services running in Docker containers. The deployment worked. The network didn't. Intermittent failures between the inventory service and the shipping service caused cascading 500 errors that took us three days to trace. Why? Because developers implemented different timeout strategies in every single service.
This is where the Service Mesh concept saves your sanity. Today, we are going to look at decoupling operational logic (retries, circuit breaking, routing) from application logic using Linkerd. And because every millisecond of latency you add with a proxy needs to be gained back elsewhere, we will discuss why underlying hardware performance—specifically the NVMe storage stacks we use at CoolVDS—is non-negotiable for this architecture.
The Problem: The "Fat" Client Library
Before containers, we had giant load balancers (F5, NetScaler) handling traffic at the edge. Inside the datacenter, services talked directly to each other. In 2017, with Kubernetes 1.5 gaining traction, services appear and disappear dynamically. IP addresses change every minute.
Your developers probably handle this by writing code like this:
import requests
import time
def get_inventory(item_id):
retries = 3
for i in range(retries):
try:
return requests.get(f"http://inventory-service/{item_id}", timeout=0.5)
except requests.exceptions.RequestException:
time.sleep(0.1 * (2 ** i)) # Exponential backoff
continue
raise Exception("Service unavailable")
This is technical debt. Now multiply this by 20 services and 4 languages. If you want to change the retry strategy, you have to redeploy the entire platform. This inconsistency is what wakes you up at 3 AM.
The Solution: The Sidecar Proxy (Linkerd)
The solution is to push this logic out of the code and into a dedicated infrastructure layer. We run a small proxy next to every single service instance (a "sidecar"). The application talks to the local proxy (localhost), and the proxy handles the scary internet stuff.
As of early 2017, Linkerd (built on Twitter's Finagle) is the most robust option for this. It handles service discovery, intelligent load balancing, and circuit breaking.
Pro Tip: Linkerd runs on the JVM. It is feature-rich but memory-hungry. If you are deploying this, ensure your nodes have ample RAM. On CoolVDS, we recommend starting with our 8GB RAM instances to accommodate the JVM overhead without starving your actual application containers.
Step 1: The Configuration
Here is a battle-tested linkerd.yaml configuration we used to stabilize traffic between our Oslo and Frankfurt endpoints. This setup bridges the gap between old-school Consul discovery and Kubernetes.
admin:
port: 9990
routers:
- protocol: http
label: outgoing
dtab: |
/svc => /#/io.l5d.k8s/default/http;
interpreter:
kind: default
transformers:
- kind: io.l5d.k8s.daemonset
namespace: default
port: incoming
service: l5d
servers:
- port: 4140
ip: 0.0.0.0
- protocol: http
label: incoming
dtab: |
/svc => /#/io.l5d.k8s/default/http;
interpreter:
kind: default
transformers:
- kind: io.l5d.k8s.localnode
servers:
- port: 4141
ip: 0.0.0.0
Step 2: Deploying via DaemonSet
Instead of injecting a sidecar into every pod definition (which creates massive YAML bloat), we use a Kubernetes DaemonSet to run one Linkerd instance per node. This is more efficient for resources.
apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
labels:
app: l5d
name: l5d
spec:
template:
metadata:
labels:
app: l5d
spec:
volumes:
- name: l5d-config
configMap:
name: "l5d-config"
containers:
- name: l5d
image: buoyantio/linkerd:0.8.5
args:
- /io.buoyant/linkerd/config/config.yaml
ports:
- name: outgoing
containerPort: 4140
hostPort: 4140
- name: incoming
containerPort: 4141
volumeMounts:
- name: "l5d-config"
mountPath: "/io.buoyant/linkerd/config"
readOnly: true
Latency: The Hidden Cost of Service Meshes
Here is the reality check nobody tells you in the "Hello World" tutorials: A Service Mesh doubles your network hops.
- Service A talks to Local Linkerd (Hop 1)
- Local Linkerd talks to Remote Linkerd (Hop 2)
- Remote Linkerd talks to Service B (Hop 3)
If your underlying infrastructure is slow, you just killed your application's performance. You cannot run a Service Mesh on budget spinning disks or oversold VPS hosts where the CPU steal time is high. The JVM context switching alone will add 10-20ms of jitter if the CPU is contended.
The Hardware Prerequisite
This is where infrastructure choices matter. At CoolVDS, we specifically architected our KVM stack to mitigate this "proxy tax."
| Feature | Generic VPS | CoolVDS Implementation | Why It Matters for Linkerd |
|---|---|---|---|
| Storage | SATA SSD (Shared) | NVMe (Direct Path) | Log aggregation and tracing writes must not block the I/O thread. |
| Virtualization | OpenVZ / LXC | KVM | JVM requires strict memory isolation and true kernel access. |
| Network | Shared 1Gbps | 10Gbps Uplink | Handling the doubled packet count of service-to-proxy traffic. |
Configuring Circuit Breaking
The killer feature of Linkerd is Circuit Breaking. If the `shipping` service starts failing, Linkerd stops sending it traffic before your users notice.
Add this to your linkerd.yaml client configuration:
client:
failureAccrual:
kind: io.l5d.consecutiveFailures
failures: 5
backoff:
kind: constant
ms: 10000
This config says: "If the service fails 5 times in a row, stop talking to it for 10 seconds." This gives the failing service breathing room to recover (e.g., if the database lock clears) rather than hammering it to death with retries.
Testing the Mesh
Once deployed, verify the routing using curl with the http_proxy command. Do not assume it works; prove it.
# Send a request through the local Linkerd instance
http_proxy=http://$(minikube ip):4140 curl -v http://inventory/items/55
# Expected Output:
# < HTTP/1.1 200 OK
# < l5d-ctx-trace: ...
# < Server: Jetty(9.2.14.v20151106)
If you see the l5d-ctx-trace header, congratulations. You have successfully intercepted the traffic. You now have visibility.
Data Sovereignty and The Norwegian Context
With the EU General Data Protection Regulation (GDPR) looming on the horizon for 2018, knowing exactly where your data flows is critical. A Service Mesh provides tracing logs that prove service A talked to service B.
However, logs are data. If you are hosting in Norway to comply with Datatilsynet requirements or simply to ensure low latency to the NIX (Norwegian Internet Exchange), ensure your persistent volumes for these logs are also local.
Conclusion
Implementing a Service Mesh in 2017 is bleeding edge. It requires patience, solid knowledge of YAML, and robust hardware. But the payoff is a system that heals itself when components fail.
Don't let the overhead of Java proxies slow down your production environment. You need raw compute power and zero-wait storage I/O.
Ready to build a resilient cluster? Deploy a high-performance KVM instance on CoolVDS today and get the NVMe speed your microservices demand.