Observability vs Monitoring: Why Your Green Dashboards Are Lying to You
It’s 3:00 AM. PagerDuty wakes you up. Customers are screaming about 502 Bad Gateway errors on the checkout page.
You stumble to your laptop, open Grafana, and everything looks... green. CPU usage is low. RAM is fine. Disk space is plentiful. According to your expensive monitoring setup, the system is healthy. But the revenue graph has flatlined.
This is the failure of traditional monitoring. It focuses on the health of the infrastructure, not the behavior of the application. In early 2020, with microservices becoming the standard even in conservative Norwegian enterprise environments, purely checking if a port is open is negligence. We need to move from Monitoring to Observability.
The Difference: Known Unknowns vs. Unknown Unknowns
I get into arguments with CTOs about this weekly. They think buying a SaaS monitoring tool solves the problem. It doesn't.
- Monitoring checks for "known unknowns." You know the disk might fill up, so you set an alert for 90% usage. You know the CPU might spike, so you watch load average. It’s a dashboard of traffic lights.
- Observability allows you to ask questions about "unknown unknowns." Why is latency high only for iOS users in Bergen? Why did that specific SQL query hang, even though the database load is zero?
Pro Tip: Observability isn't a tool you buy; it's a property of your system. If your code swallows exceptions and your logs are unstructured text, no amount of money paid to Datadog or Splunk will save you. You need structured data.
The 2020 Observability Stack
To achieve observability right now, you need three pillars: Metrics, Logs, and Tracing. Here is how we architect this on a Linux environment, specifically tailored for the high-performance demands we see at CoolVDS.
1. Metrics (Prometheus)
Forget Nagios. Prometheus is the standard. It pulls metrics (scrapes) rather than waiting for pushes. This is crucial for high-load systems because if your monitoring system gets overloaded, it shouldn't crash your application.
Here is a standard scrape config for a Go application running on a VPS:
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'payment_service'
static_configs:
- targets: ['localhost:9090']
metrics_path: '/metrics'
scheme: 'http'
But don't just measure CPU. Measure Golden Signals: Latency, Traffic, Errors, and Saturation.
2. Structured Logging (ELK Stack)
Grepping text logs is dead. If you are handling traffic from Oslo to Tromsø, you need to aggregate logs. We use the ELK stack (Elasticsearch, Logstash, Kibana).
Critical Warning: Elasticsearch is I/O hungry. I have seen clusters implode because they were hosted on cheap VPS providers using shared spinning rust (HDD) or throttled SSDs. For a proper ELK stack, you need raw NVMe throughput. This is why we enforce NVMe storage on all CoolVDS instances—indexing speed directly correlates to how fast you can search logs during an outage.
Configure Nginx to output JSON logs so Logstash doesn't have to parse regex:
http {
log_format json_combined escape=json
'{ "time_local": "$time_local", '
'"remote_addr": "$remote_addr", '
'"remote_user": "$remote_user", '
'"request": "$request", '
'"status": "$status", '
'"body_bytes_sent": "$body_bytes_sent", '
'"request_time": "$request_time", '
'"http_referrer": "$http_referrer", '
'"http_user_agent": "$http_user_agent" }';
access_log /var/log/nginx/access.json json_combined;
}
3. Distributed Tracing (Jaeger/OpenTracing)
This is the missing link for most teams. When a request hits your Load Balancer, touches the Auth Service, queries the Database, and returns 200 OK, how do you track that single journey? You need a Trace ID.
We use Jaeger. It allows you to visualize the timeline of a request. If the database is fast but the network between your app server and the DB is saturated (high steal time on a noisy neighbor VPS), Jaeger shows you the gap.
The Infrastructure Reality Check
Observability comes at a cost: Data Gravity.
Generating traces and metrics for every request increases your I/O and network throughput requirements by 10-20%.
| Resource | Monitoring Impact | Observability Impact |
|---|---|---|
| Storage | Low (Time-series retention) | High (Log retention, Trace data) |
| CPU | Negligible (Agent based) | Moderate (Serialization of traces) |
| Disk I/O | Low | Very High (Elasticsearch Indexing) |
This is where infrastructure choice becomes a business risk. If you are running on a hypervisor that oversells resources (OpenVZ or cheap container wrappers), your observability tools will choke exactly when you need them most—during a high-traffic event.
At CoolVDS, we use KVM virtualization. This ensures that the RAM and CPU you allocate to your Elasticsearch node are actually yours. We don't steal cycles. When you are querying 50GB of logs to find why a VIP client in Stavanger got an error, you need the NVMe I/O to be consistent.
Data Sovereignty and The "Datatilsynet" Factor
Here in Norway, we have to talk about GDPR. Observability data often contains PII (IP addresses, User IDs, sometimes accidental email dumps in stack traces).
If you use a US-based SaaS for observability (like New Relic or Datadog), you are shipping this data out of the EEA. With the current legal climate regarding Privacy Shield being shaky, the safest architectural decision for 2020 is self-hosting.
By hosting your Prometheus and ELK stack on a Norwegian VPS:
- Compliance: Data never leaves Norwegian soil (if using our Oslo zone).
- Latency: Sending metric payloads from an Oslo server to a US-East collector adds 100ms+ overhead. Sending it to a local instance on the same private network adds <1ms.
- Cost: Ingress/Egress fees on major clouds are a trap. We offer generous bandwidth packages because we know observability data is heavy.
Start Small, but Start Now
You don't need to deploy the full suite today. Start by fixing your logs.
- Switch Nginx/Apache to JSON format.
- Spin up a CoolVDS instance with 4GB RAM and NVMe storage.
- Install the ELK stack (Docker makes this easy).
- Ship your logs there.
Stop relying on green lights. Turn on the floodlights. If your current host throttles your I/O when you try to index logs, it’s time to migrate.
Need a sandbox to test your ELK stack? Deploy a high-performance KVM instance on CoolVDS in under 60 seconds and keep your data inside Norway.