Observability vs. Monitoring: Why Green Dashboards Lie
I have a rule: if your dashboard is all green but your support ticket queue is flooding, you are monitoring the wrong things. We have all been there. It's 3 AM in Oslo. Nagios says the CPU is at 20%. RAM is fine. Disk space is plentiful. Yet, the application is throwing 502 errors to half your users.
This is the fundamental disconnect between Monitoring and Observability. Monitoring answers the question: "Is the system healthy based on pre-defined metrics?" Observability answers: "Why is the system behaving weirdly in a way we never predicted?"
In this guide, we are going to move beyond basic uptime checks. We will configure a modern observability stack using OpenTelemetry and Prometheus, and discuss why the underlying infrastructure—specifically storage I/O—is usually the bottleneck you didn't account for.
The "Known Unknowns" vs. "Unknown Unknowns"
Monitoring is for known failure modes. You know disk space runs out, so you set an alert at 90%. You know high load kills response times, so you alert on Load Average.
Observability is for the things you cannot predict. It is a property of the system, not a tool. A system is observable if you can understand its internal state just by asking questions of its external outputs (logs, metrics, traces).
The Three Pillars (in 2023 terms)
- Metrics: Aggregated numbers over time. Cheap to store, great for spotting trends.
- Logs: Discrete events. Expensive to store, vital for context.
- Traces: The journey of a request through microservices. Essential for latency debugging.
Pro Tip: Don't try to log everything. In a high-traffic environment, logging every HTTP 200 OK will bankrupt your storage budget and choke your I/O. Use sampling strategies in your tracing configuration. Start with 1% sampling and scale up only when debugging.
Implementing the PLG Stack (Prometheus, Loki, Grafana)
For most European dev teams, the PLG stack has become the gold standard due to its open-source nature and data sovereignty control—critical for GDPR compliance when hosting in Norway.
1. Prometheus Configuration
Prometheus pulls metrics. It doesn't wait for them. Here is a battle-tested scrape_config optimized for a typical K8s or heavy Docker environment. Note the scrape_interval. Many defaults sit at 1m. If you are debugging micro-bursts, you need 15s or even 10s resolution.
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'node_exporter'
static_configs:
- targets: ['localhost:9100']
- job_name: 'app_backend'
metrics_path: '/metrics'
scheme: 'https'
tls_config:
insecure_skip_verify: false
static_configs:
- targets: ['api.internal.coolvds.net:8080']
relabel_configs:
- source_labels: [__address__]
target_label: instance
regex: '([^:]+)(:[0-9]+)?'
replacement: '${1}'
2. The High-Cardinality Trap
This is where systems melt. Cardinality refers to the number of unique combinations of metric labels. If you add a `user_id` or `ip_address` as a label in Prometheus, you will explode your time-series database (TSDB).
Bad Idea:
http_requests_total{method="POST", status="200", user_id="849201"} // DO NOT DO THIS
Good Idea:
http_requests_total{method="POST", status="200", handler="/api/checkout"} // DO THIS
Use logs (Loki) for high-cardinality data like User IDs. Use metrics (Prometheus) for aggregates.
OpenTelemetry: The Unifying Layer
By mid-2023, OpenTelemetry (OTel) solidified itself as the standard for generating telemetry data. Instead of locking yourself into a vendor's agent, you run the OTel Collector. It sits between your app and your backend.
Here is how you configure the OTel Collector to receive data via gRPC and export it to Prometheus (for metrics) and Loki (for logs). This abstraction allows you to switch backends without rewriting application code.
receivers:
otlp:
protocols:
grpc:
http:
processors:
batch:
timeout: 1s
send_batch_size: 1024
exporters:
prometheus:
endpoint: "0.0.0.0:8889"
logging:
loglevel: debug
service:
pipelines:
metrics:
receivers: [otlp]
processors: [batch]
exporters: [prometheus]
traces:
receivers: [otlp]
processors: [batch]
exporters: [logging] # Or Jaeger/Tempo
The Infrastructure Bottleneck: IOPS
Here is the hard truth nobody puts in the brochure: Observability requires massive Write I/O.
When you enable detailed tracing and logging, you are effectively writing gigabytes of data to disk every hour. If you are on a budget VPS with shared HDD or throttled SSDs (standard SATA), your iowait will skyrocket. The CPU will sit idle waiting for the disk, and your application latency will increase—ironically caused by the tool you installed to monitor latency.
| Metric | Standard HDD VPS | SATA SSD VPS | CoolVDS NVMe |
|---|---|---|---|
| Random Write IOPS | ~80-120 | ~5,000 | ~50,000+ |
| Ingestion Latency | High (seconds) | Medium (ms) | Near Instant |
| Query Speed (Loki) | Timeout | Slow | Fast |
We built CoolVDS on pure NVMe arrays specifically for this reason. When you run a heavy Loki or Elasticsearch cluster, you aren't CPU-bound; you are I/O bound. Our KVM virtualization ensures that your neighbor's log spike doesn't steal your IOPS.
Data Sovereignty and The "Datatilsynet" Factor
In Norway, we take privacy seriously. Post-Schrems II, sending log data containing PII (Personally Identifiable Information) to US-based cloud monitoring services is a legal minefield. IP addresses are considered PII.
Hosting your own observability stack on a VPS in Norway is not just a technical preference; often, it is a compliance requirement. By keeping the data on a CoolVDS instance in Oslo, you ensure that traces containing user data never leave the EEA/Norway legal jurisdiction.
War Story: The "Ghost" Latency
Last winter, we helped a client running a Magento cluster. Every day at 14:00, the site crawled. Monitoring showed CPU at 40%. No errors in Nginx.
We installed Jaeger for tracing. The traces revealed a 4-second span in a Redis `GET` call. Why? Because a cron job was running `KEYS *` (a blocking command) on the Redis instance at 14:00 to generate a report. Monitoring didn't catch it because the Redis service was "up." Observability caught it because the traces showed time spent.
Next Steps
Stop relying on ping checks. Build a stack that tells you why things break, not just when.
- Deploy a CoolVDS instance (Ubuntu 22.04 LTS recommended).
- Install the OpenTelemetry Collector.
- Point your app traces to local storage.
Do not let slow storage throttle your insights. Deploy a high-performance NVMe instance on CoolVDS today and see what your code is actually doing.