Console Login

Surviving the Microservices Hype: A Pragmatic Service Mesh Guide for 2018

Surviving the Microservices Hype: A Pragmatic Service Mesh Guide for 2018

Let’s be honest: we all drank the microservices Kool-Aid. We broke up our monoliths, containerized everything with Docker, and orchestrated it with Kubernetes. But now, instead of a single application log file to debug, we have fifty decentralized services screaming into the void, and latency spikes that no one can trace. Troubleshooting a checkout failure has turned into a game of distributed detective work.

If you are running a distributed system in production right now, you aren't managing code anymore; you are managing the network. And the network is inherently unreliable.

This is where the Service Mesh comes in. It’s not just a buzzword for 2018; it’s a survival mechanism. In this guide, we are going to look at how to implement a mesh using Linkerd (the current stable standard) on Kubernetes 1.9, and why your underlying hardware—specifically the VPS Norway infrastructure—matters more than your YAML configuration.

The Problem: The Fallacy of Distributed Computing

When Service A calls Service B, a million things can go wrong. The network might be congested, Service B might be garbage collecting (JVM pause), or the instance might have died. Hardcoding retry logic into your application code is a technical debt trap. You end up with inconsistent timeout policies across Java, Go, and Node.js services.

A Service Mesh abstracts this communication layer. It moves reliability features out of your application and into a dedicated infrastructure layer.

The Architecture: Sidecars vs. Per-Host

Currently, we have two prevailing models. The "Sidecar" model (championed by the rising Envoy proxy and the alpha-stage Istio project) places a proxy next to every container. This offers granular control but doubles your container count.

For this guide, we will focus on the Per-Host model using Linkerd 1.x. It’s JVM-based, heavy on memory, but battle-tested. You run one Linkerd instance per Kubernetes node. It acts as a router for all traffic leaving that node.

Deploying Linkerd to Kubernetes

First, we need to deploy the Linkerd DaemonSet. This ensures that as you scale your cluster, every new node gets a router.

Here is a battle-tested configuration snippet for your linkerd.yaml. Note the transformer setup—this is where most people break their routing.

apiVersion: v1
kind: ConfigMap
metadata:
  name: l5d-config
data:
  config.yaml: |-
    routers:
    - protocol: http
      label: outgoing
      dtab: |
        /srv        => /#/io.l5d.k8s/default/http;
        /host       => /srv;
        /svc        => /host;
      interpreter:
        kind: default
        transformers:
        - kind: io.l5d.k8s.daemonset
          namespace: default
          port: 4141
          service: l5d
      servers:
      - port: 4140
        ip: 0.0.0.0

The dtab (Delegation Table) is the brain of Linkerd. It rewrites logical names (like /svc/users) into concrete destination addresses. In the example above, we are mapping service names directly to Kubernetes endpoints via the io.l5d.k8s namer.

Circuit Breaking: Failing Fast

The real value of a mesh isn't routing; it's resilience. If your 'Inventory Service' starts timing out, you don't want your 'Frontend' to hang for 30 seconds waiting for a response. You want to fail fast so you can serve a cached version or an error message.

We configure this in the client section of the Linkerd config:

client:
  failureAccrual:
    kind: io.l5d.successRate
    successRate: 0.9
    requests: 20
    backoff:
      kind: constant
      ms: 10000

This configuration tells the mesh: "If the success rate for this service drops below 90% over the last 20 requests, stop sending traffic to it for 10 seconds." This gives the failing service breathing room to recover—a concept known as backpressure.

Pro Tip: Do not set your backoff too low. If a database is overloaded, retrying every 100ms is effectively a self-inflicted DDoS attack. We have seen entire clusters in Oslo melt down because of aggressive retry policies on weak hardware.

The Hardware Reality: JVM Overhead & Latency

Here is the uncomfortable truth: Service Meshes are expensive.

Linkerd 1.x runs on the JVM. It needs a significant heap size to function without garbage collection pauses adding latency to every request. If you are running this on a budget VPS with shared CPU (