Stop Monitoring, Start Observing: Why Your "Green" Dashboard is Lying to You
It is 3:00 AM. PagerDuty just fired. You open your Grafana dashboard. Everything is green. CPU load is nominal. Memory usage is flat. Disk space is at 40%. Yet, Twitter is on fire because no one can check out on your platform.
This is the failure of Monitoring. Monitoring is for known unknowns. You asked: "Is the CPU high?" The answer was no. But you didn't know to ask: "Is the payment gateway latency correlating with the micro-bursts in disk I/O wait times?"
That is Observability. It is not a buzzword; it is a property of your system. If you can answer new questions without deploying new code, you have observability. If you have to ssh in and grep logs, you don't.
In 2020, with distributed systems becoming the norm even for mid-sized Norwegian shops, the old Nagios "check_http" approach is dead. Here is how we build a stack that actually debugs itself, running on high-performance infrastructure.
The Triad: Metrics, Logs, and Traces
To move from monitoring to observability, we need to correlate three specific data streams. But be warned: storing this data requires serious I/O throughput. If you are running this on a budget spinning-disk VPS, you will bottleneck your production app just trying to measure it.
1. Metrics (Prometheus)
Metrics are cheap to store but lack context. They tell you when something happened. In September 2020, Prometheus is the undisputed king here.
Don't just install node_exporter and call it a day. You need to expose application internals. If you are running a Go application, you should be instrumenting your own handlers.
# prometheus.yml snippet for a robust scrape config
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'coolvds-production-api'
static_configs:
- targets: ['10.0.0.5:9090', '10.0.0.6:9090']
relabel_configs:
- source_labels: [__address__]
regex: "(.*):9090"
target_label: instance_ip
2. Structured Logs (Loki)
Grepping text files is for amateurs. If your logs aren't JSON, they are trash. You need to be able to query logs like a database. We use Loki (the PLG stack) because it doesn't index the full text, only the metadata, making it vastly cheaper than ElasticSearch for pure ingestion speed.
First, fix your Nginx configuration. The default access log format is useless for machine parsing.
# /etc/nginx/nginx.conf
http {
log_format json_combined escape=json
'{ "time_local": "$time_local", '
'"remote_addr": "$remote_addr", '
'"remote_user": "$remote_user", '
'"request": "$request", '
'"status": "$status", '
'"body_bytes_sent": "$body_bytes_sent", '
'"request_time": "$request_time", '
'"upstream_response_time": "$upstream_response_time", '
'"http_referrer": "$http_referrer", '
'"http_user_agent": "$http_user_agent" }';
access_log /var/log/nginx/access.log json_combined;
}
Now, when a 500 error spikes in Prometheus, you can immediately query Loki for logs where status="500" and request_time > 1.0.
3. Distributed Tracing (Jaeger)
If you run microservices, tracing is non-negotiable. It visualizes the request lifecycle across containers. Seeing a request spend 400ms in a database lock and only 10ms in application logic saves weeks of debugging "slow code."
The Hardware Reality: TSDBs Eat IOPS
This is where most implementations fail. Time Series Databases (TSDBs) like Prometheus rely heavily on disk write performance. They generate massive amounts of small, random writes. If your underlying storage has high latency (common in shared "cloud" buckets), your monitoring system will lag behind reality, or worse, crash.
Pro Tip: Never host your observability stack on the same physical disk as your database. At CoolVDS, we use NVMe storage tiers specifically to handle the high-write churn of TSDB WAL (Write Ahead Logs) compaction. We see write speeds upwards of 2GB/s, ensuring your metrics never drop, even during a DDoS attack.
The Legal Elephant: Schrems II and Data Sovereignty
We cannot ignore the legal landscape in late 2020. The CJEU's Schrems II ruling in July has effectively invalidated the Privacy Shield. If you are shipping your logs (which contain IP addresses—Personal Identifiable Information) to a US-owned SaaS monitoring platform, you are likely non-compliant with GDPR.
This is the pragmatic argument for self-hosting your observability stack on Norwegian soil.
- Data Residency: Keep logs in Oslo.
- Latency: Sending metrics from a server in Oslo to a collector in Virginia adds 80ms+ latency. Sending it to a local CoolVDS instance takes <2ms.
- Control: You own the retention policy. No surprise bills for "high cardinality" data.
Implementation: The "CoolVDS" Base Layer
When we provision a KVM instance for observability, we tweak the kernel for high-throughput networking. Default Linux settings are conservative. Open /etc/sysctl.conf and apply these changes to handle the influx of metrics:
# Optimization for high-throughput metric ingestion
net.core.somaxconn = 65535
net.core.netdev_max_backlog = 5000
net.ipv4.tcp_max_syn_backlog = 8096
vm.swappiness = 10 # Don't swap out Prometheus chunks
fs.file-max = 100000 # OpenTelemetry collectors need file descriptors
Reload with sysctl -p.
Conclusion
Monitoring is asking "Is the system healthy?" Observability is asking "What is the system doing?" The difference allows you to sleep through the night while the system auto-remediates or alerts you with precise root causes rather than vague symptoms.
But software is only as good as the iron it runs on. High-cardinality metrics require high-performance storage. Don't let IO wait times create a blind spot in your infrastructure.
Ready to build a stack that sees everything? Deploy a high-IOPS NVMe instance on CoolVDS today and keep your data strictly within Norwegian borders.