Observability vs Monitoring: Why Your Green Dashboards Are Lying to You
It was 03:15 AM on a Tuesday. My phone was silent. The monitoring dashboardâa beautiful grid of green lights running on a dedicated screen in the officeâclaimed everything was perfect. CPU load was nominal. RAM usage was at 45%. Disk space was plentiful.
Yet, the support ticket queue was filling up with angry users from Trondheim to Oslo claiming the checkout API was timing out.
This is the classic failure of traditional monitoring. We were watching the health of the server, not the health of the request. We knew the server was alive; we had no idea it was hallucinating. In the post-2020 era of distributed systems, "is it up?" is the wrong question. The right question is: "Why is it behaving like that?"
This is the hard line between Monitoring and Observability. And if you are running critical infrastructure in Norway under the watchful eye of Datatilsynet, getting this wrong isn't just a technical debtâit's a compliance risk.
The semantic difference that saves jobs
Letâs cut through the marketing noise. Iâve managed infrastructure for over a decade, from bare metal racks in kjelleren to massive Kubernetes clusters.
- Monitoring is for known unknowns. You know disk space can run out, so you set an alert for 90% usage. You know the database can lock up, so you check active connections. It is a predefined checklist.
- Observability is for unknown unknowns. It allows you to ask new questions of your system without shipping new code. "Why is latency spiking only for iOS users on the Telenor network attempting to buy socks?"
If you can't debug a high-latency issue without SSH-ing into the box and running htop, you don't have observability. You have a fragile house of cards.
The Three Pillars in 2022: It's not just ELK anymore
To achieve true observability, you need to correlate three signals: Metrics, Logs, and Traces. But simply dumping them into a bucket isn't enough. You need structured data.
1. Structured Logging (The Foundation)
Stop parsing regex. If your Nginx logs are just text strings, you are fighting with one hand tied behind your back. Configure your ingress to speak JSON. Here is the exact nginx.conf snippet I deploy on CoolVDS instances to feed Logstash or Fluentd:
http {
log_format json_analytics escape=json
'{'
'"msec": "$msec", ' # Request time in seconds with milliseconds resolution
'"connection": "$connection", '
'"connection_requests": "$connection_requests", '
'"pid": "$pid", '
'"request_id": "$request_id", ' # Critical for tracing!
'"request_length": "$request_length", '
'"remote_addr": "$remote_addr", '
'"remote_user": "$remote_user", '
'"remote_port": "$remote_port", '
'"time_local": "$time_local", '
'"time_iso8601": "$time_iso8601", '
'"request": "$request", '
'"request_uri": "$request_uri", '
'"args": "$args", '
'"status": "$status", '
'"body_bytes_sent": "$body_bytes_sent", '
'"bytes_sent": "$bytes_sent", '
'"http_referer": "$http_referer", '
'"http_user_agent": "$http_user_agent", '
'"http_x_forwarded_for": "$http_x_forwarded_for", '
'"http_host": "$http_host", '
'"server_name": "$server_name", '
'"request_time": "$request_time", '
'"upstream": "$upstream_addr", '
'"upstream_connect_time": "$upstream_connect_time", '
'"upstream_header_time": "$upstream_header_time", '
'"upstream_response_time": "$upstream_response_time", '
'"upstream_response_length": "$upstream_response_length", '
'"upstream_cache_status": "$upstream_cache_status", '
'"ssl_protocol": "$ssl_protocol", '
'"ssl_cipher": "$ssl_cipher", '
'"scheme": "$scheme", '
'"request_method": "$request_method"'
'}';
access_log /var/log/nginx/analytics.json json_analytics;
}
Notice the $request_id. This is non-negotiable. You pass this ID downstream to your PHP-FPM or Node.js application, and suddenly you can trace a request from the edge load balancer all the way to the database query.
2. Metrics (Prometheus is King)
In 2022, Prometheus is the de-facto standard. If you aren't exposing a /metrics endpoint, you are wrong. However, a common mistake is high cardinality. If you tag your metrics with user_id or email, you will blow up your time-series database (TSDB).
Correct Prometheus Scrape Config:
scrape_configs:
- job_name: 'coolvds_node_exporter'
scrape_interval: 15s
static_configs:
- targets: ['10.0.0.5:9100', '10.0.0.6:9100']
relabel_configs:
- source_labels: [__address__]
regex: '(.*):.*'
target_label: instance
replacement: '${1}'
Pro Tip: Do not run Prometheus on the same disk as your application logs. TSDB heavy writes will choke your I/O. On CoolVDS, I always attach a secondary NVMe volume mounted at /var/lib/prometheus to isolate IOPS contention.
3. Distributed Tracing (OpenTelemetry)
This is where the industry has settled. OpenTelemetry (OTel) has effectively won the war against proprietary agents. It allows you to instrument your code once and send data to Jaeger, Zipkin, or Honeycomb.
Here is a basic OTel Collector configuration to run as a sidecar agent:
receivers:
otlp:
protocols:
grpc:
http:
processors:
batch:
timeout: 1s
send_batch_size: 1024
exporters:
logging:
loglevel: debug
jaeger:
endpoint: "jaeger-collector:14250"
tls:
insecure: true
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch]
exporters: [jaeger]
The "Schrems II" Reality Check
Here is the part most US-based tutorials ignore. If you are operating in Norway or the EU, you have a legal problem called Schrems II. Sending your observability dataâwhich often contains IP addresses, user agents, and potentially leaked PII in stack tracesâto a US-based SaaS cloud is a GDPR violation waiting to happen.
The Datatilsynet (Norwegian Data Protection Authority) has been very clear: you must control where your data lives.
This is why hosting your observability stack (Grafana, Loki, Tempo) on CoolVDS within Norwegian or European datacenters is not just a performance choice; it's a legal safeguard. You keep the data sovereignty. You don't export trace data across the Atlantic.
The Hardware Reality of Observability
Observability is expensive. Not in license fees (if you use open source), but in compute resources. Ingesting terabytes of logs and indexing millions of metric points requires serious I/O throughput.
If you try to run an ELK stack (Elasticsearch) on standard SATA SSDs or, heaven forbid, spinning rust, your ingestion lag will be minutes long. By the time you see the error in Kibana, the customer has already churned.
We benchmarked Elasticsearch indexing speeds on various storage backends. The results explain why we standardized on NVMe for CoolVDS:
| Storage Type | Indexing Rate (Docs/sec) | Query Latency (p99) |
|---|---|---|
| HDD (7.2k RPM) | ~800 | 1,200ms |
| Standard SATA SSD | ~4,500 | 250ms |
| CoolVDS NVMe | ~22,000 | 45ms |
When you are querying 50GB of logs to find that one Nginx 502 error, 45ms vs 1200ms is the difference between fixing the issue instantly and waiting for the progress bar while your boss paces behind your chair.
Putting it together: A Debugging Workflow
So, how does this look in practice when the alert fires?
- Alert: Prometheus fires
HighErrorRatevia Alertmanager to Slack. - Triage: You open the Grafana dashboard. You see the spike started at 14:00.
- Drill Down: You switch to the Loki panel. You query:
{app="checkout"} |= "error". You see a flood of "Connection Refused" errors. - Trace: You grab a
traceIDfrom the log. You paste it into Jaeger. - Root Cause: The trace shows the Checkout Service waiting 5 seconds for the Inventory Service, which is crashing because of an Out Of Memory (OOM) kill.
Without observability, you would still be checking the Checkout Service logs, seeing nothing wrong, and blaming the network.
Conclusion
Stop relying on green dashboards that only check if the port is open. Your users don't care if port 443 is accepting connections; they care if their payment goes through.
Building a proper observability stack takes effort. It requires configuring OpenTelemetry, tuning Prometheus retention, and managing storage for logs. But the alternative is flying blind.
If you are ready to build a stack that complies with Norwegian privacy standards and has the I/O horsepower to ingest heavy telemetry without choking, you need the right foundation.
Don't let slow I/O kill your insights. Deploy a high-performance NVMe instance on CoolVDS today and start seeing what is actually happening inside your systems.