Console Login

Observability vs Monitoring: Why Your Green Dashboard is Lying to You (2019 Edition)

Observability vs Monitoring: Why Your Green Dashboard is Lying to You

It is 03:42 AM in Oslo. Your phone buzzes. It’s PagerDuty. The alert is generic: CPU_LOAD_HIGH on your primary API gateway.

You stumble to your laptop, SSH in, and run htop. The load is normal. The memory usage is fine. The dashboard is all green again. You go back to sleep, only to be woken up twenty minutes later. Rinse and repeat.

This is the failure of Monitoring. You are looking at aggregate health checks—binary states of "Up" or "Down." But in 2019, with microservices and container orchestration becoming the standard, "Up" doesn't mean "Working."

We need to stop just monitoring our servers and start building Observability. If monitoring tells you the system is broken, observability allows you to ask the system why.

The "Unknown Unknowns"

I recently consulted for a fintech startup in Stavanger. They were running a standard LEMP stack on a cheap, oversold VPS provider (not naming names, but you know the ones). They had Zabbix set up perfectly. Every disk, CPU core, and network interface was graphed.

Yet, every day at 14:00, their checkout process timed out for 5% of users. Zabbix showed nothing. No spikes.

Monitoring tracks the "known unknowns." You know disk space can run out, so you monitor it.
Observability helps you find the "unknown unknowns." You didn't know that a noisy neighbor on the host node was flushing their cache at 14:00, saturating the SATA bus and causing I/O wait times to skyrocket for 200 milliseconds—too short for a 1-minute Zabbix poll to catch, but long enough to kill a transaction.

Structuring Logs for Machines, Not Humans

The first step to observability in 2019 is accepting that grep is not a strategy. Standard Apache/Nginx logs are useless for high-cardinality analysis. You need structured data.

Here is how we configure Nginx on high-performance CoolVDS instances to output JSON. This allows us to feed logs directly into an ELK stack (Elasticsearch, Logstash, Kibana) or a sidecar Fluentd process.

Nginx JSON Configuration

http {
    log_format json_combined escape=json
      '{ "time_local": "$time_local", '
      '"remote_addr": "$remote_addr", '
      '"remote_user": "$remote_user", '
      '"request": "$request", '
      '"status": "$status", '
      '"body_bytes_sent": "$body_bytes_sent", '
      '"request_time": "$request_time", '
      '"upstream_response_time": "$upstream_response_time", '
      '"http_referrer": "$http_referer", '
      '"http_user_agent": "$http_user_agent" }';

    access_log /var/log/nginx/access.json json_combined;
}

Critical Detail: Notice the $upstream_response_time. This variable is the difference between Nginx receiving the request and your PHP-FPM or Node.js backend finishing it. If $request_time is high but $upstream_response_time is low, the latency is in the network (or the client), not your app.

Metrics Collection with Prometheus

Old school monitoring pushes data (Nagios agents). Modern observability pulls data. Prometheus has become the de-facto standard here. It scrapes metrics from endpoints you expose.

If you are running on a Linux VPS, you need the Node Exporter. But don't just run the binary. Systemd is your friend for resilience.

Systemd Unit for Node Exporter

[Unit]
Description=Prometheus Node Exporter
Wants=network-online.target
After=network-online.target

[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/node_exporter \
    --collector.systemd \
    --collector.processes

[Install]
WantedBy=multi-user.target
Pro Tip: Be careful with cardinality. If you tag every metric with a unique user ID, your Prometheus time-series database will explode in size. Stick to tags with finite values like status_code, endpoint, or region.

The Infrastructure Reality Check

You can have the best Grafana dashboards in Europe, but if your underlying infrastructure is a black box, you are observing noise.

This is where virtualization technology matters. In container-based VPS hosting (like OpenVZ or LXC), the kernel is shared. You often cannot see true I/O wait or detailed CPU steal metrics because the host OS hides them to prevent you from seeing other tenants.

This is why at CoolVDS, we exclusively use KVM (Kernel-based Virtual Machine). With KVM, your kernel is yours. If your observability tools report 40ms disk latency, that is a fact, not a noisy neighbor artifact.

Verifying I/O Performance

When you deploy a new instance, benchmark it immediately. You need a baseline to observe deviations from. Here is a standard fio command we use to test random write performance (the database killer):

fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 \
--name=test --filename=test --bs=4k --iodepth=64 --size=1G \
--readwrite=randwrite --rwmixwrite=75

On a CoolVDS NVMe instance, you should see IOPS in the tens of thousands. On a standard SATA VPS, you'll be lucky to hit 400. That difference is what causes the 14:00 timeout spikes.

Data Sovereignty and Datatilsynet

Observability means logging. Logging means storing IP addresses and potentially user identifiers. Under GDPR, this is PII (Personally Identifiable Information).

If you are piping your logs to a SaaS observability platform hosted in the US, you are navigating a legal minefield. The Privacy Shield framework is shaky, and many Norwegian legal teams are becoming increasingly risk-averse regarding data transfer outside the EEA.

Hosting your ELK or Prometheus stack on a VPS in Norway (like CoolVDS) solves the data residency headache instantly. Your data stays under Norwegian jurisdiction, physically located in Oslo-area data centers, compliant with Datatilsynet requirements.

Conclusion: Stop Guessing

Building an observable system takes more effort than installing `htop`. It requires structured logging, time-series metrics, and infrastructure that doesn't lie to you.

But the payoff is sleeping through the night. When an alert fires, it won't just say "Problem." It will say "High Latency on Checkout Microservice due to DB Lock Wait."

Don't let opaque infrastructure blind you. Deploy a KVM-based, NVMe-powered instance on CoolVDS today and start seeing what's really happening inside your stack.