Console Login

Observability vs. Monitoring: Why Your Green Dashboard Is Lying to You

Observability vs. Monitoring: Why Your Green Dashboard Is Lying to You

It is 3:00 AM on a Tuesday. Your PagerDuty rotation just woke you up. You stumble to your workstation, open Grafana, and see... nothing. CPU usage is nominal at 40%. Memory is flat. Disk I/O on the database is within limits. According to your dashboard, the system is perfectly healthy.

Yet, your support ticket queue is flooding with angry Norwegian users claiming the checkout page is timing out. This is the nightmare scenario of traditional monitoring: The dashboard is green, but the system is broken.

In 2019, the complexity of distributed systems—microservices, containers, and orchestration—has rendered simple health checks obsolete. We need to stop asking "Is the server up?" and start asking "Why is this specific request failing?" This is the shift from Monitoring to Observability.

The "Known Unknowns" vs. "Unknown Unknowns"

Let’s cut through the marketing fluff. Monitoring is for known unknowns. You know the disk might fill up, so you set an alert for disk usage > 90%. You know the CPU might spike, so you monitor load averages.

Observability is for unknown unknowns. It’s for the problems you never imagined could happen. It allows you to ask arbitrary questions of your system without shipping new code. It requires high-cardinality data—granular details like User IDs, Request IDs, and specific IP addresses—rather than just aggregated averages.

Pro Tip: Averages are the enemy of reliability. If 99 requests take 10ms and 1 request takes 10 seconds, your average is roughly 110ms. That looks fine. But that one user is furious. Always monitor percentiles (p95, p99), not averages.

The Holy Trinity: Metrics, Logs, and Traces

To achieve observability, you cannot rely on a single tool. You need to correlate three pillars.

1. Metrics (The "What")

We use Prometheus. It is the de-facto standard for a reason. Unlike the old Nagios push model, Prometheus pulls metrics, which is far more reliable for dynamic environments like Docker Swarm or Kubernetes.

Here is a standard scrape config we use for our internal services. Note the scrape interval—don't go below 15s unless you have the storage throughput to handle it.

global:
  scrape_interval:     15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node_exporter'
    static_configs:
      - targets: ['10.0.0.5:9100', '10.0.0.6:9100']

2. Logs (The "Why")

Metrics tell you usage spiked. Logs tell you it was a brute-force attack on `wp-login.php`. However, standard text logs are useless at scale. You must use Structured Logging (JSON). If you are parsing regex in Logstash, you are wasting CPU cycles.

Configure Nginx to output JSON directly. This makes ingestion into the ELK Stack (Elasticsearch, Logstash, Kibana) trivial.

http {
    log_format json_combined escape=json
      '{ "time_local": "$time_local", '
      '"remote_addr": "$remote_addr", '
      '"remote_user": "$remote_user", '
      '"request": "$request", '
      '"status": "$status", '
      '"body_bytes_sent": "$body_bytes_sent", '
      '"request_time": "$request_time", '
      '"http_referrer": "$http_referer", '
      '"http_user_agent": "$http_user_agent" }';

    access_log /var/log/nginx/access.json json_combined;
}

3. Distributed Tracing (The "Where")

If you run microservices, a request might touch five different servers. If the database is slow, the frontend blames the API, the API blames the cache, and the cache blames the DB. Tracing (using tools like Jaeger) allows you to visualize the request lifecycle.

The Infrastructure Cost of Observability

Here is the brutal truth that cloud providers don't tell you: Observability is I/O heavy.

Running an ELK stack (Elasticsearch v7.0 released just last month) requires massive write throughput. Every log line is an index operation. Every metric scrape writes to the disk.

I recently audited a setup where a client tried to run their logging stack on a cheap, standard HDD VPS. The result? `iowait` hit 40%, the logging agent blocked the application, and the "monitoring" tool actually caused the outage.

This is why we architect CoolVDS exclusively on NVMe storage with KVM virtualization. We don't oversell our I/O. When you are ingesting 5,000 log lines per second during a DDoS attack or a viral marketing campaign, you need the underlying storage to write as fast as the network card can receive.

Configuration for Performance

If you are deploying Elasticsearch on CoolVDS, tune your `jvm.options` and system limits immediately. Do not stick with defaults.

# /etc/security/limits.conf
elasticsearch soft memlock unlimited
elasticsearch hard memlock unlimited

# /etc/sysctl.conf
vm.max_map_count=262144

And ensure your heap size is set correctly in `jvm.options`, usually 50% of your available RAM, but never crossing the 32GB compressed oops threshold:

-Xms4g
-Xmx4g

Data Sovereignty and GDPR in Norway

Observability data contains PII. IP addresses, User IDs, and sometimes raw query parameters are stored in your logs. Under GDPR and the scrutiny of the Norwegian Data Protection Authority (Datatilsynet), you are the data controller.

Sending your raw logs to a SaaS provider hosted in the US creates a compliance headache. By hosting your Prometheus and ELK stack on CoolVDS instances located in our Oslo data center, you ensure that your customer data never leaves Norwegian jurisdiction. You get lower latency for your ingestion pipeline and peace of mind regarding compliance.

Summary: Stop Guessing

If you are still SSH-ing into servers to run `tail -f /var/log/syslog`, you are operating blind. Building an observability pipeline takes effort—you need to configure Docker logging drivers, set up the ELK stack, and tune Prometheus.

But the prerequisite for all of this is raw compute power and fast I/O. Don't let your monitoring infrastructure be the bottleneck.

Ready to build a stack that sees everything? Deploy a high-performance NVMe KVM instance on CoolVDS today and stop fearing the 3 AM pager.