Observability is Not Just "Monitoring on Steroids"
It is 3:00 AM on a Tuesday. Your Zabbix dashboard is all green. The load balancers report healthy upstreams. Yet, your CEO is on the phone screaming that the checkout page takes 15 seconds to load for users in Trondheim. This is the classic failure of traditional monitoring.
In the distributed systems we build today—whether you are running monoliths on VMs or microservices in Kubernetes v1.18—binary "up/down" checks are useless. You need to debug your infrastructure based on the evidence it generates in real-time. That is the difference between monitoring and observability.
I have spent the last decade debugging high-traffic clusters across Europe. I have seen "monitored" systems fail spectacularly because we were measuring the wrong things. Here is how to fix your stack using tools available right now, in mid-2020, and why the underlying hardware (specifically NVMe storage) dictates your success.
The Lie of "System Healthy"
Monitoring is for known unknowns. You know the disk might fill up, so you set an alert for 90% usage. You know the CPU might spike, so you watch load averages.
Observability is for unknown unknowns. It allows you to ask arbitrary questions about your system without shipping new code. "Why is latency high only for iOS users on the checkout endpoint when the database CPU is idle?" Monitoring cannot answer that. Observability can.
The Three Pillars (2020 Edition)
To achieve this, we rely on the standard triad: Metrics, Logs, and Tracing. Let's break down the configuration required for each, moving away from default settings that kill performance.
1. Metrics: Beyond Simple Counters
We are long past the days of Cacti graphs. Prometheus is the standard. With Grafana 7.0 released just last month, visualizing this data has arguably become easier, but the collection is where you mess up.
Do not just scrape system metrics. Instrument your code. If you are running a Go service, you need to expose custom metrics. Here is a prometheus.yml scrape config pattern I use for dynamic service discovery, ensuring we don't miss targets:
scrape_configs:
- job_name: 'coolvds-node-exporter'
scrape_interval: 15s
static_configs:
- targets: ['10.0.0.5:9100', '10.0.0.6:9100']
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
Pro Tip: High cardinality metrics (like tracking HTTP latency per user ID) will crash your Prometheus instance. Store high-cardinality data in your logs, not your time-series database. If your time-series DB starts swapping to disk on a standard HDD VPS, you are dead. This is why we default to NVMe storage at CoolVDS; time-series ingestion is I/O heavy.
2. Logs: Structured or Nothing
Grepping text files in /var/log/nginx/ is barbaric. If you aren't logging in JSON, you aren't doing observability. You need to feed an aggregation system like the ELK Stack (Elasticsearch, Logstash, Kibana) or the lighter EFK (Fluentd).
Here is how to force Nginx to output structured data that machines can actually parse. Put this in your nginx.conf:
http {
log_format json_combined escape=json
'{ "timestamp": "$time_iso8601", '
'"remote_addr": "$remote_addr", '
'"request_time": $request_time, '
'"upstream_response_time": "$upstream_response_time", '
'"status": $status, '
'"request_method": "$request_method", '
'"request_uri": "$request_uri", '
'"host": "$host", '
'"http_user_agent": "$http_user_agent" }';
access_log /var/log/nginx/access.json json_combined;
}
Now, when you ship this to Elasticsearch, you can filter by request_time > 1.0. Suddenly, you aren't guessing. You know exactly which API calls are slow.
3. Distributed Tracing: The Jaeger Factor
This is where most setups in Norway fall short. If you have microservices, logs are disjointed. You need a trace ID that flows from the Load Balancer -> Frontend -> API -> Database.
In 2020, Jaeger is the robust choice here (OpenTelemetry is in beta, but let's stick to what is production-proven today). Implementing the OpenTracing API allows you to visualize the waterfall of a request.
The Infrastructure Cost of Observability
Here is the trade-off nobody talks about: Observability is expensive.
- Elasticsearch is a memory and I/O hog. It requires heavy Java heaps.
- Prometheus eats RAM as your series count grows.
- Tracing adds network overhead to every request.
If you run this stack on a budget VPS with "noisy neighbor" problems and spinning rust (HDD), your monitoring stack will fail exactly when you need it: during a high-load incident.
I recently audited a setup where the ELK stack was hosted on a shared platform. When traffic spiked, the "steal time" (CPU wait) on the logging server hit 20%. The logs fell behind by 40 minutes. We were flying blind.
| Feature | Standard VPS | CoolVDS Performance Instance |
|---|---|---|
| Storage | Shared SATA/SSD (Low IOPS) | Dedicated NVMe (High IOPS for Elasticsearch) |
| CPU | Shared / Burstable | Dedicated Cores (Crucial for Prometheus ingestion) |
| Network | Best Effort | Low Latency to NIX (Norwegian Internet Exchange) |
Data Sovereignty and the "Schrems" Headache
We need to talk about the Datatilsynet (Norwegian Data Protection Authority). Logs contain IP addresses. IP addresses are PII (Personally Identifiable Information) under GDPR.
If you are shipping your logs to a SaaS monitoring platform hosted in the US, you are walking a legal tightrope. The privacy shield framework is under massive scrutiny right now (Schrems II case is pending judgement soon). The safest architectural decision in 2020 is to keep your observability data local.
Hosting your own ELK or Prometheus stack on a CoolVDS instance in Oslo ensures that your user data never leaves Norwegian jurisdiction. It lowers latency for log shipping and keeps your Compliance Officer happy.
Configuration for Performance
If you are self-hosting Elasticsearch on our instances, do not use the defaults. Modify /etc/elasticsearch/jvm.options and /etc/security/limits.conf immediately.
# /etc/security/limits.conf
# Allow Elasticsearch to lock memory so it doesn't swap
elasticsearch - nofile 65535
elasticsearch - memlock unlimited
And ensure your heap size is set to 50% of available RAM, but no more than 31GB to keep compressed oops optimization enabled.
Final Thoughts
Observability allows you to move fast because you aren't afraid of breaking things—you know you'll spot the break immediately. But it requires a foundation of serious hardware. You cannot build a skyscraper on a swamp, and you cannot build an observability stack on oversold hosting.
Don't let I/O wait times hide the root cause of your outage.
Ready to build a monitoring stack that actually works? Deploy a high-frequency NVMe instance on CoolVDS today and keep your data strictly in Norway.