Console Login

Observability vs. Monitoring: Why Your "Green" Dashboard Is Lying to You

Stop Looking at CPU Usage. It Tells You Nothing.

It’s 3:00 AM. PagerDuty screams. You open your Grafana dashboard. All the lights are green. CPU is at 40%, RAM is stable, and disk space is plentiful. Yet, Twitter is melting down because users in Trondheim can't process payments.

This is the failure of Monitoring. Monitoring is a predefined set of checks: "Is the disk full?" "Is the ping responding?" It answers known unknowns.

Observability is entirely different. It allows you to ask arbitrary questions about your system without shipping new code. It answers unknown unknowns. "Why is latency spiking to 500ms, but only for requests hitting the /api/v2/cart endpoint containing a specific promo code?"

In mid-2021, if you are still relying solely on Nagios checks or simple health endpoints, you are flying blind. Let’s architect a solution that actually works, keeping the data strictly within Norwegian borders to satisfy GDPR.

The Three Pillars: Logs, Metrics, Traces

To move from monitoring to observability, you need to correlate three data streams. If you break one, the stool falls over.

1. Structured Logs (The Context)

Grep is dead. If you are ssh-ing into a server to read /var/log/nginx/access.log, you’ve already lost. You need structured JSON logs that can be ingested by Loki or Elasticsearch. Plain text logs are useless for aggregation.

Here is how we configure Nginx on our CoolVDS instances to output machine-readable data suitable for ingestion:

http {
    log_format json_analytics escape=json
    '{'
      '"msec": "$msec", ' # connection time
      '"connection": "$connection", ' # connection serial number
      '"connection_requests": "$connection_requests", ' # number of requests made in this connection
      '"pid": "$pid", ' # process pid
      '"request_id": "$request_id", ' # the unique request id
      '"request_length": "$request_length", ' # request length (including headers and body)
      '"remote_addr": "$remote_addr", ' # client IP
      '"remote_user": "$remote_user", ' # client HTTP username
      '"remote_port": "$remote_port", ' # client port
      '"time_local": "$time_local", '
      '"time_iso8601": "$time_iso8601", ' # local time in the ISO 8601 standard format
      '"request": "$request", ' # full path no arguments if the request is GET
      '"request_uri": "$request_uri", ' # full path and arguments if the request is GET
      '"args": "$args", ' # args
      '"status": "$status", ' # response status code
      '"body_bytes_sent": "$body_bytes_sent", ' # the number of body bytes exclude headers sent to a client
      '"bytes_sent": "$bytes_sent", ' # the number of bytes sent to a client
      '"http_referer": "$http_referer", ' # HTTP referer
      '"http_user_agent": "$http_user_agent", ' # user agent
      '"http_x_forwarded_for": "$http_x_forwarded_for", ' # http_x_forwarded_for
      '"http_host": "$http_host", ' # the request Host: header
      '"server_name": "$server_name", ' # the name of the vhost serving the request
      '"request_time": "$request_time", ' # request processing time in seconds with msec resolution
      '"upstream": "$upstream_addr", ' # upstream backend server for proxied requests
      '"upstream_connect_time": "$upstream_connect_time", ' # upstream handshake time incl. TLS
      '"upstream_header_time": "$upstream_header_time", ' # time spent receiving upstream headers
      '"upstream_response_time": "$upstream_response_time", ' # time spend receiving upstream body
      '"upstream_response_length": "$upstream_response_length", ' # upstream response length
      '"upstream_cache_status": "$upstream_cache_status", ' # cache HIT/MISS where applicable
      '"ssl_protocol": "$ssl_protocol", ' # TLS protocol
      '"ssl_cipher": "$ssl_cipher", ' # TLS cipher
      '"scheme": "$scheme", ' # http or https
      '"request_method": "$request_method", ' # request method
      '"server_protocol": "$server_protocol", ' # request protocol, like HTTP/1.1 or HTTP/2.0
      '"pipe": "$pipe", ' # "p" if request was pipelined, "." otherwise
      '"gzip_ratio": "$gzip_ratio", '
      '"http_cf_ray": "$http_cf_ray"'
    '}';

    access_log /var/log/nginx/json_access.log json_analytics;
}

By capturing $request_id and $upstream_response_time, we stop guessing if the database is slow or if PHP-FPM is hanging. The data is right there.

2. Metrics (The Trends)

Metrics differ from logs because they are aggregations. Prometheus is the standard here. Don't use it to store text; use it to store numbers.

A common mistake I see on client servers is alerting on node_load1. Load average is a relic of the 90s. On a modern multi-core system, high load doesn't necessarily mean degradation. Instead, alert on saturation.

Here is a useful PromQL query for detecting actual user-facing latency issues, not just "busy servers":

histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, route)) > 0.5

This alerts you if the 99th percentile of requests on any route exceeds 500ms over a 5-minute window. This is what your users feel. They don't care if your CPU is at 90% as long as the page loads.

3. Tracing (The Glue)

This is where OpenTelemetry (OTel) comes in. It's 2021; if you aren't looking at OTel to replace OpenTracing, you're falling behind.

Tracing allows you to follow that Nginx $request_id all the way through your microservices. It visualizes the waterfall of the request.

The Infrastructure Reality: IOPS Matter

Here is the hard truth nobody talks about: Observability is heavy.

Running a localized stack (Prometheus + Loki + Grafana) generates massive disk I/O. Loki indexes chunks of log streams. Prometheus writes Write-Ahead Logs (WAL) constantly. If you attempt this on a cheap VPS with standard SSDs (or worse, spinning rust) and "noisy neighbors," your monitoring stack will crash exactly when you need it most—during a high-traffic incident.

Pro Tip: Never run your observability stack on the same physical disk controller as your production database. If the DB spirals, it kills the metrics that diagnose it.

This is why at CoolVDS, we enforce strict KVM isolation and use NVMe storage arrays. We’ve seen Docker containers running the ELK stack (Elasticsearch, Logstash, Kibana) saturate I/O on competitor platforms, causing iowait to spike to 40%. On our NVMe instances, the high IOPS throughput digests these log streams without stealing cycles from your application.

The