Console Login

Observability vs. Monitoring: Why Green Dashboards Can Still Mean a Broken System (And How to Fix It)

Observability vs. Monitoring: Why Green Dashboards Can Still Mean a Broken System

It is 3:00 AM on a Tuesday. Your PagerDuty didn't fire. Your Zabbix dashboard is a comforting sea of green. Yet, your support ticket queue is flooding with angry emails from customers in Trondheim claiming the checkout process is timing out. You check CPU usage: 20%. You check RAM: 40%. You ping the server: 12ms. Everything looks fine.

This is the failure of Monitoring.

If you are still relying solely on "is the server up?" checks in late 2021, you are flying blind. As systems move from monolithic LAMP stacks to distributed microservices (even just a split frontend/backend setup), the question changes from "Is it working?" to "Why is it behaving weirdly?" This is where Observability comes in. And no, it's not just a buzzword to sell you more expensive SaaS tools.

The Core Distinction: Known Unknowns vs. Unknown Unknowns

Let's cut through the marketing fluff. I've spent the last decade debugging distributed systems across Europe, and the distinction is practical, not academic.

  • Monitoring tracks known unknowns. You know disk space might run out, so you monitor node_filesystem_avail_bytes. You know the CPU might spike, so you set a threshold at 80%. It's a dashboard of things you predicted might break.
  • Observability allows you to ask questions about unknown unknowns. It is a property of a system that lets you understand its internal state based on its external outputs (logs, metrics, and traces). It helps you answer: "Why is latency high only for requests containing a specific HTTP header?"

You cannot buy Observability. You build it. And to build it, you need the "Three Pillars"—and the raw infrastructure horsepower to ingest them.

1. Metrics: The "What" (Prometheus)

In 2021, Prometheus is the undisputed king of metrics in the cloud-native space. Unlike old-school Nagios checks, Prometheus scrapes time-series data. It is cheap to store and fast to query.

But here is where people mess up: they look at averages. Averages lie. If 99 requests take 10ms and 1 request takes 10 seconds, your average is roughly 110ms. You think you're fine. Meanwhile, that one user is churning.

Pro Tip: Always alert on percentiles (p95, p99), not averages. If your p99 latency spikes, you have a problem, even if the average is stable.

Here is a standard prometheus.yml scrape config we use for our internal node exporters. Notice the interval—don't go below 15s unless you have the storage I/O to back it up.

global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'coolvds-node'
    static_configs:
      - targets: ['10.0.0.5:9100']
    # Vital for multi-environment filtering
    labels:
      region: 'no-oslo-1'
      env: 'production'

2. Logs: The "Context" (Structured Logging)

Grepping through /var/log/nginx/access.log is dead. If you are doing that, stop. You need structured logging (JSON). This allows tools like the ELK Stack (Elasticsearch, Logstash, Kibana) or the more lightweight Grafana Loki to parse and aggregate data instantly.

The challenge with centralized logging is Disk I/O. Elasticsearch is a beast. It indexes every field. If you try to run an ELK stack on a cheap, oversold VPS with spinning rust (HDD) or throttled SSDs, your logging pipeline will crash the moment you get a traffic spike. This is why we equip CoolVDS instances with NVMe storage by default. You need high IOPS to write logs effectively without blocking your application's disk access.

Configure Nginx to output JSON directly so you don't waste CPU cycles parsing strings later:

http {
    log_format json_combined escape=json
      '{ "time_local": "$time_local", '
      '"remote_addr": "$remote_addr", '
      '"remote_user": "$remote_user", '
      '"request": "$request", '
      '"status": "$status", '
      '"body_bytes_sent": "$body_bytes_sent", '
      '"request_time": "$request_time", '
      '"http_referrer": "$http_referer", '
      '"http_user_agent": "$http_user_agent" }';

    access_log /var/log/nginx/access_json.log json_combined;
}

3. Tracing: The "Where" (OpenTelemetry/Jaeger)

This is the missing link for most teams. Distributed tracing follows a request as it jumps from your Load Balancer to your Nginx proxy, to your Python backend, to your PostgreSQL database.

In late 2021, the industry is coalescing around OpenTelemetry (OTel) as the standard for instrumentation, exporting data to Jaeger or Tempo. Tracing allows you to visualize the "waterfall" of a request.

If you see a 500ms gap in the waterfall between the application receiving the request and the database query starting, you know you have a CPU bottleneck (likely the app processing logic or garbage collection) rather than a slow query. Without tracing, you'd likely blame the database.

The Infrastructure Reality Check

Implementing this stack (Prometheus + Loki + Jaeger) creates a "Shadow IT" load. The observability tools themselves consume CPU and RAM. I have seen clusters where the monitoring agents consumed 15% of the total compute.

This highlights the importance of bare-metal performance isolation. In a shared hosting environment or a noisy public cloud neighbor scenario, CPU Steal (st in top) will ruin your metrics. If the hypervisor steals cycles from your VM, your timestamps will be inaccurate, and your traces will show "network latency" that is actually just your VM being paused.

Feature Standard Cloud VPS CoolVDS (KVM)
Disk I/O Often throttled / SATA SSD Unthrottled NVMe (Essential for Logging)
CPU Access Shared/Burstable Dedicated/High-Priority (Essential for accurate Tracing)
Data Sovereignty Often routed via US/Frankfurt Oslo, Norway (GDPR/Schrems II compliant)

The Legal Elephant in the Room: GDPR & Schrems II

Since the Schrems II ruling last year (2020), sending personal data to US-owned cloud providers has become a legal minefield. Observability data is personal data. IP addresses, User IDs in traces, and email addresses in logs are all protected under GDPR.

If you pipe your logs to a US-based SaaS observability platform, you might be non-compliant. The safest architectural decision for Norwegian companies right now is to self-host your observability stack on infrastructure located physically in Norway. This keeps the Datatilsynet happy and ensures your customer data never leaves the EEA.

Implementation Strategy

Don't try to boil the ocean. Start small:

  1. Upgrade your infrastructure. Move to a provider like CoolVDS where you have root access and NVMe storage. You cannot install Prometheus node exporters on shared web hosting.
  2. Enable JSON logging in Nginx/Apache.
  3. Deploy the "PLG" stack (Prometheus, Loki, Grafana) via Docker Compose.

Here is a quick docker-compose.yml snippet to get a local Grafana and Prometheus instance running to test your config:

version: '3.7'
services:
  prometheus:
    image: prom/prometheus:v2.30.3
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
    ports:
      - "9090:9090"

  grafana:
    image: grafana/grafana:8.2.5
    depends_on:
      - prometheus
    ports:
      - "3000:3000"

Observability turns the lights on in a dark room. It requires effort, configuration, and solid hardware, but the first time it saves you from a 4-hour outage because you spotted a memory leak early, it pays for itself a hundred times over.

Ready to build a stack that actually tells you what's going on? Spin up a CoolVDS NVMe instance in Oslo today and stop guessing why your server is slow.