Observability vs. Monitoring: Why Green Dashboards Lie

I have a rule: if your dashboard is all green but your support ticket queue is flooding, you are monitoring the wrong things. We have all been there. It's 3 AM in Oslo. Nagios says the CPU is at 20%. RAM is fine. Disk space is plentiful. Yet, the application is throwing 502 errors to half your users.

This is the fundamental disconnect between Monitoring and Observability. Monitoring answers the question: "Is the system healthy based on pre-defined metrics?" Observability answers: "Why is the system behaving weirdly in a way we never predicted?"

In this guide, we are going to move beyond basic uptime checks. We will configure a modern observability stack using OpenTelemetry and Prometheus, and discuss why the underlying infrastructure—specifically storage I/O—is usually the bottleneck you didn't account for.

The "Known Unknowns" vs. "Unknown Unknowns"

Monitoring is for known failure modes. You know disk space runs out, so you set an alert at 90%. You know high load kills response times, so you alert on Load Average.

Observability is for the things you cannot predict. It is a property of the system, not a tool. A system is observable if you can understand its internal state just by asking questions of its external outputs (logs, metrics, traces).

The Three Pillars (in 2023 terms)

Metrics: Aggregated numbers over time. Cheap to store, great for spotting trends.
Logs: Discrete events. Expensive to store, vital for context.
Traces: The journey of a request through microservices. Essential for latency debugging.

Pro Tip: Don't try to log everything. In a high-traffic environment, logging every HTTP 200 OK will bankrupt your storage budget and choke your I/O. Use sampling strategies in your tracing configuration. Start with 1% sampling and scale up only when debugging.

Implementing the PLG Stack (Prometheus, Loki, Grafana)

For most European dev teams, the PLG stack has become the gold standard due to its open-source nature and data sovereignty control—critical for GDPR compliance when hosting in Norway.

1. Prometheus Configuration

Prometheus pulls metrics. It doesn't wait for them. Here is a battle-tested scrape_config optimized for a typical K8s or heavy Docker environment. Note the scrape_interval. Many defaults sit at 1m. If you are debugging micro-bursts, you need 15s or even 10s resolution.

global:
  scrape_interval: 15s 
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'node_exporter'
    static_configs:
      - targets: ['localhost:9100']
    
  - job_name: 'app_backend'
    metrics_path: '/metrics'
    scheme: 'https'
    tls_config:
      insecure_skip_verify: false
    static_configs:
      - targets: ['api.internal.coolvds.net:8080']
    relabel_configs:
      - source_labels: [__address__]
        target_label: instance
        regex: '([^:]+)(:[0-9]+)?'
        replacement: '${1}'

2. The High-Cardinality Trap

This is where systems melt. Cardinality refers to the number of unique combinations of metric labels. If you add a `user_id` or `ip_address` as a label in Prometheus, you will explode your time-series database (TSDB).

Bad Idea:

http_requests_total{method="POST", status="200", user_id="849201"} // DO NOT DO THIS

Good Idea:

http_requests_total{method="POST", status="200", handler="/api/checkout"} // DO THIS

Use logs (Loki) for high-cardinality data like User IDs. Use metrics (Prometheus) for aggregates.

OpenTelemetry: The Unifying Layer

By mid-2023, OpenTelemetry (OTel) solidified itself as the standard for generating telemetry data. Instead of locking yourself into a vendor's agent, you run the OTel Collector. It sits between your app and your backend.

Here is how you configure the OTel Collector to receive data via gRPC and export it to Prometheus (for metrics) and Loki (for logs). This abstraction allows you to switch backends without rewriting application code.

receivers:
  otlp:
    protocols:
      grpc:
      http:

processors:
  batch:
    timeout: 1s
    send_batch_size: 1024

exporters:
  prometheus:
    endpoint: "0.0.0.0:8889"
  logging:
    loglevel: debug

service:
  pipelines:
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [prometheus]
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [logging] # Or Jaeger/Tempo

The Infrastructure Bottleneck: IOPS

Here is the hard truth nobody puts in the brochure: Observability requires massive Write I/O.

When you enable detailed tracing and logging, you are effectively writing gigabytes of data to disk every hour. If you are on a budget VPS with shared HDD or throttled SSDs (standard SATA), your iowait will skyrocket. The CPU will sit idle waiting for the disk, and your application latency will increase—ironically caused by the tool you installed to monitor latency.

Metric	Standard HDD VPS	SATA SSD VPS	CoolVDS NVMe
Random Write IOPS	~80-120	~5,000	~50,000+
Ingestion Latency	High (seconds)	Medium (ms)	Near Instant
Query Speed (Loki)	Timeout	Slow	Fast

We built CoolVDS on pure NVMe arrays specifically for this reason. When you run a heavy Loki or Elasticsearch cluster, you aren't CPU-bound; you are I/O bound. Our KVM virtualization ensures that your neighbor's log spike doesn't steal your IOPS.

Data Sovereignty and The "Datatilsynet" Factor

In Norway, we take privacy seriously. Post-Schrems II, sending log data containing PII (Personally Identifiable Information) to US-based cloud monitoring services is a legal minefield. IP addresses are considered PII.

Hosting your own observability stack on a VPS in Norway is not just a technical preference; often, it is a compliance requirement. By keeping the data on a CoolVDS instance in Oslo, you ensure that traces containing user data never leave the EEA/Norway legal jurisdiction.

War Story: The "Ghost" Latency

Last winter, we helped a client running a Magento cluster. Every day at 14:00, the site crawled. Monitoring showed CPU at 40%. No errors in Nginx.

We installed Jaeger for tracing. The traces revealed a 4-second span in a Redis `GET` call. Why? Because a cron job was running `KEYS *` (a blocking command) on the Redis instance at 14:00 to generate a report. Monitoring didn't catch it because the Redis service was "up." Observability caught it because the traces showed time spent.

Next Steps

Stop relying on ping checks. Build a stack that tells you why things break, not just when.

Deploy a CoolVDS instance (Ubuntu 22.04 LTS recommended).
Install the OpenTelemetry Collector.
Point your app traces to local storage.

Do not let slow storage throttle your insights. Deploy a high-performance NVMe instance on CoolVDS today and see what your code is actually doing.

🍪 We Value Your Privacy

Privacy & Cookie Settings

Your Privacy Rights

Observability vs. Monitoring: Why Green Dashboards Lie and How to Debug the Unknown

Observability vs. Monitoring: Why Green Dashboards Lie

The "Known Unknowns" vs. "Unknown Unknowns"

The Three Pillars (in 2023 terms)

Implementing the PLG Stack (Prometheus, Loki, Grafana)

1. Prometheus Configuration

2. The High-Cardinality Trap

OpenTelemetry: The Unifying Layer

The Infrastructure Bottleneck: IOPS

Data Sovereignty and The "Datatilsynet" Factor

War Story: The "Ghost" Latency

Next Steps

/// RELATED POSTS

Edge Computing in Norway: Architecting for Sub-5ms Latency in 2025

Kubernetes Networking Deep Dive: Optimizing Packet Flow for Low Latency in 2025

Surviving the Packet Storm: A Deep Dive into Kubernetes Networking & CNI Performance in 2025

Surviving the Millisecond War: Edge Computing Architectures for the Nordic Market

Kubernetes Networking Deep Dive: Why Your Packets Are Dropping in the Overlay

Serverless Without the Handcuffs: Implementing Private FaaS Patterns on High-Performance VDS in 2025