Observability vs Monitoring: Why Your "Green" Dashboard is Lying to You
It’s 2 AM on a Tuesday. Your Zabbix dashboard is a sea of calming green. CPU load is at 40%, disk space is ample, and the ping checks to your Oslo data center are returning a crisp 12ms. Yet, your support ticket queue is flooding with angry Norwegian customers claiming they can't checkout.
This is the classic failure of Monitoring. It tells you the server is alive. It doesn't tell you the application is dying.
With the GDPR deadline hitting us next month (May 25th), the stakes have never been higher. You can't just dump logs into a US-based S3 bucket anymore without giving your legal team a heart attack. You need granular insight, hosted locally, with strict control. That is where Observability comes in.
The Difference: "Is it Up?" vs "Why is it Slow?"
Monitoring is for the known unknowns. You know disk space can run out, so you monitor it. You know MySQL can crash, so you check the process state.
Observability is for the unknown unknowns. It’s the ability to ask arbitrary questions about your system without shipping new code. It allows you to trace a request from a user in Bergen, through your Nginx load balancer, into your PHP-FPM worker, down to a slow MySQL query, and understand exactly where that 500ms latency spike came from.
The 2018 Observability Stack: Logs, Metrics, and Tracing
To achieve this, we need to move beyond simple check_http. We are looking at three pillars: Metrics (Prometheus), Logs (ELK Stack), and Tracing (starting to mature with tools like Jaeger, though complex to implement).
1. Structured Logging: The Foundation
Grepping text files in /var/log is over. If you aren't logging in JSON, you aren't doing observability. You need to parse fields to aggregate data.
Here is how I configure Nginx on high-traffic CoolVDS instances to feed Logstash. Note the inclusion of request_time and upstream_response_time. These two variables are the difference between guessing and knowing.
http {
log_format json_combined escape=json
'{ "time_local": "$time_local", '
'"remote_addr": "$remote_addr", '
'"remote_user": "$remote_user", '
'"request": "$request", '
'"status": "$status", '
'"body_bytes_sent": "$body_bytes_sent", '
'"request_time": "$request_time", '
'"upstream_response_time": "$upstream_response_time", '
'"http_referrer": "$http_referer", '
'"http_user_agent": "$http_user_agent" }';
access_log /var/log/nginx/access.json json_combined;
}
2. Metrics that Matter with Prometheus
StatsD is fine, but Prometheus (currently v2.2) is the standard for modern infrastructure. It pulls metrics rather than waiting for them to be pushed. This prevents your monitoring system from becoming a DDoS bot against your own infrastructure during high load.
A typical prometheus.yml scrape config for a Linux node exporter looks like this:
scrape_configs:
- job_name: 'node_exporter_oslo'
scrape_interval: 15s
static_configs:
- targets: ['10.0.0.5:9100', '10.0.0.6:9100']
relabel_configs:
- source_labels: [__address__]
regex: "(.*):9100"
target_label: instance
replacement: "${1}"
Pro Tip: Don't just monitor CPU usage. Monitor CPU Steal. If you are on a cheap, oversold VPS provider, your CPU steal will spike even when your usage is low, causing random application stuttering. This is why we insist on KVM virtualization at CoolVDS—we don't play the "noisy neighbor" game.
3. The Database Bottleneck
Your application is usually only as fast as your database. Enabling the slow query log in MySQL 5.7 is mandatory. But don't just log it to a file; verify your innodb_buffer_pool_size matches your RAM allocation.
[mysqld]
# Ensure you catch queries taking longer than 1 second
long_query_time = 1
slow_query_log = 1
slow_query_log_file = /var/log/mysql/mysql-slow.log
log_queries_not_using_indexes = 1
The Hardware Reality: ELK Needs IOPS
Here is the painful truth about observability: It is resource heavy.
Elasticsearch (the 'E' in ELK) is a beast. It indexes every single field in those JSON logs we created earlier. If you try to run an ELK stack on standard HDD or cheap SSD VPS hosting, your I/O wait times will skyrocket. The logging infrastructure itself will become the bottleneck.
This is a scenario we see constantly: a dev team spins up an ELK stack to debug performance, but the indexing I/O load crashes the server.
To run this stack effectively in 2018, you need:
- NVMe Storage: The random read/write speeds of NVMe are non-negotiable for Elasticsearch indexing.
- Dedicated Resources: You cannot share CPU cycles with 500 other tenants.
- Network Throughput: Shipping gigabytes of logs requires a fat pipe.
This is where CoolVDS makes the difference. Our NVMe arrays are designed for high-throughput database and logging workloads. We don't throttle IOPS just because you're writing logs.
The GDPR Elephant in the Room
We are weeks away from GDPR enforcement. IP addresses in your access logs are considered Personal Identifiable Information (PII). If you are shipping these logs to a cloud provider in the US for analysis, you are walking into a legal minefield (Schrems judgment or not, the uncertainty is high).
Keeping your observability stack—your logs, your metrics, your user data—on servers physically located in Norway (or the EEA) is the safest path to compliance. It satisfies Datatilsynet requirements and ensures data sovereignty.
Implementation Strategy
Don't try to boil the ocean. Start small:
- Switch Nginx to JSON logging today. It costs you nothing.
- Deploy Node Exporter. Get visibility into hardware interrupts and context switches.
- Centralize. Spin up a dedicated CoolVDS instance for your monitoring stack. Keep it separate from production to ensure your observer doesn't go down when the ship sinks.
Observability isn't about pretty charts. It's about survival. It's about knowing why the checkout failed before the customer tweets about it.
Ready to take control of your infrastructure? Deploy a high-performance NVMe instance on CoolVDS in Oslo. Low latency, high IOPS, and GDPR-ready.