Console Login

Observability is Not Just "Monitoring on Steroids": A 2020 Survival Guide

Observability is Not Just "Monitoring on Steroids": A 2020 Survival Guide

It’s 03:14 AM. The pager screams. You stumble to your laptop, eyes adjusting to the blue light, and check your Grafana dashboard. Everything is green. CPU load is nominal at 0.4 on your Nginx gateways. Memory usage is flat. Disk I/O is barely registering.

Yet, Twitter is melting down because your checkout page is throwing 500 errors for 30% of users in Trondheim.

This is the failure of traditional monitoring. It answers the question: "Is the server healthy?" It fails to answer the question that actually matters: "Why is the system acting weird?"

In late 2020, with distributed systems becoming the default standard even for mid-sized Nordic shops, "monitoring" is no longer enough. We need observability. And no, that isn't just a marketing buzzword used to sell you expensive SaaS subscriptions. It’s a fundamental shift in how we architect infrastructure.

The "Unknown Unknowns"

I’ve managed infrastructure for nearly two decades, from bare metal racks in damp basements to modern KVM clusters. The old way was simple: check if the process is running. If httpd is up, we are good.

But today, a user request might hit a load balancer, traverse a Kubernetes ingress, bounce through three microservices, query a Redis cache, and finally hit a PostgreSQL replica. If one of those links has a 200ms latency spike due to a noisy neighbor or a garbage collection pause, your "uptime" monitor won't catch it. The service is up, but the experience is broken.

  • Monitoring tells you when you have exceeded a known threshold (e.g., Disk > 90%). It handles known unknowns.
  • Observability allows you to ask arbitrary questions about your system to debug unknown unknowns.

The Three Pillars in Practice (Not Theory)

To achieve this, we rely on Metrics, Logs, and Traces. But simply collecting them isn't enough; you need to correlate them. Here is how we implement this stack effectively, specifically for environments sensitive to latency and data sovereignty.

1. Metrics: The "What"

Prometheus is the undisputed king here in 2020. Forget Nagios. If you aren't scraping metrics, you are flying blind. However, a common mistake I see in DevOps teams across Europe is scraping too much high-cardinality data, which explodes the time-series database.

Here is a battle-tested prometheus.yml snippet optimized for a standard scrape interval. Note the evaluation interval—don't set this too low unless you have the IOPS to back it up.

global:
  scrape_interval:     15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'coolvds-node'
    static_configs:
      - targets: ['10.0.0.5:9100', '10.0.0.6:9100']
    # Drop heavy distinct metrics to save storage
    metric_relabel_configs:
      - source_labels: [__name__]
        regex: 'go_.*'
        action: drop
Pro Tip: Do not run your Prometheus instance on the same disk as your application logs. Prometheus writes to disk constantly. On CoolVDS, we recommend attaching a separate NVMe volume for your TSDB (Time Series Database) to ensure write latency never blocks your application IO.

2. Logs: The "Why" (Context)

Grepping /var/log/syslog is over. If you are running a high-traffic site, you need structured logging. Plain text logs are impossible to query at scale.

Configure Nginx to output JSON. This allows tools like the ELK Stack (Elasticsearch, Logstash, Kibana) or the rising star, Loki, to index fields instantly.

Here is the nginx.conf snippet I use on almost every deployment:

http {
    log_format json_analytics escape=json '{"": "$time_iso8601", '
        '"remote_addr": "$remote_addr", '
        '"request_uri": "$request_uri", '
        '"status": "$status", '
        '"request_time": "$request_time", '
        '"upstream_response_time": "$upstream_response_time", '
        '"user_agent": "$http_user_agent"}';

    access_log /var/log/nginx/access_json.log json_analytics;
}

With this configuration, you can instantly visualize "Average Upstream Response Time" in Kibana. You aren't guessing if the database is slow; the logs prove it.

3. Tracing: The "Where"

This is where most teams fail. Tracing follows a single request through your entire stack. In 2020, Jaeger is the robust choice, though we are watching the OpenTelemetry project closely as it matures.

If you are running a Go microservice, adding tracing middleware is trivial but powerful:

import (
    "github.com/opentracing/opentracing-go"
    "github.com/uber/jaeger-client-go"
    "github.com/uber/jaeger-client-go/config"
)

func InitJaeger(service string) (opentracing.Tracer, io.Closer) {
    cfg := &config.Configuration{
        ServiceName: service,
        Sampler: &config.SamplerConfig{
            Type:  "const",
            Param: 1,
        },
        Reporter: &config.ReporterConfig{
            LogSpans: true,
        },
    }
    tracer, closer, _ := cfg.NewTracer()
    return tracer, closer
}

The Elephant in the Room: GDPR and Schrems II

Here is the local reality for us in Norway and the EEA. The CJEU's Schrems II ruling in July has made using US-based SaaS observability platforms a legal minefield. If your logs contain IP addresses (which are PII) and you ship them to a US cloud provider, you are likely non-compliant.

This is not FUD (Fear, Uncertainty, Doubt); this is the current legal landscape defined by Datatilsynet.

The Solution: Self-Hosted Observability.

By hosting your own ELK or Prometheus/Grafana stack on CoolVDS servers in Oslo, you ensure data sovereignty. Your logs never leave Norway. You maintain full control. Plus, from a TCO perspective, a heavy Elasticsearch cluster is significantly cheaper on our high-performance VPS instances than paying for ingestion volume on a SaaS platform.

Infrastructure Matters: The "CoolVDS" Factor

Observability tools are resource hogs. Elasticsearch devours RAM. Prometheus chews through Disk I/O like a hungry wolf. If you try to run these on a budget VPS with shared spinning rust (HDD) or throttled CPU, your monitoring will crash before your application does.

This is why we architect CoolVDS with pure NVMe storage and KVM virtualization.

Feature Generic VPS CoolVDS Implementation Why It Matters for Observability
Storage SATA SSD (Shared) NVMe (Local RAID 10) Elasticsearch indexing speed depends entirely on IOPS.
Virtualization OpenVZ / LXC KVM No noisy neighbors stealing CPU cycles during log spikes.
Network Standard Routing Low Latency to NIX Faster metric scraping ensures real-time alerting.

Implementation Strategy

Don't try to boil the ocean. Start small.

  1. Centralize Logs first: Spin up a CoolVDS instance with 8GB RAM. Install the ELK stack. Point your web servers to ship logs there using Filebeat.
  2. Add Metrics: Install Prometheus on a separate small instance (or the same one if load permits). Set up node_exporter on all your servers.
  3. Visualize: Connect Grafana. Import standard dashboards (ID 1860 is great for Node Exporter).

Observability allows you to sleep at night. It changes the conversation from "I think it's the database" to "I see a 400ms delay in the SELECT query on shard 3."

Don't let blind spots kill your uptime or your reputation. Take control of your data and your infrastructure today.

Ready to build a compliant, high-performance observability stack? Deploy a KVM NVMe instance on CoolVDS in Oslo in under 55 seconds and stop guessing.