Beyond Green Lights: Why Monitoring Fails and Observability Saves Production
It is 3:00 AM on a Tuesday. PagerDuty screams. You open your dashboard. CPU is at 40%. Memory is fine. Disk space is plentiful. All status lights are green. Yet, your biggest e-commerce client in Oslo is calling to say nobody can check out.
This is the failure of Monitoring. You monitored the infrastructure, but you failed to observe the system. In late 2022, if you are still relying solely on Nagios checks or simple Zabbix triggers, you are flying blind.
I’ve spent the last decade debugging distributed systems across Europe. I’ve seen robust setups crumble not because of load, but because of "unknown unknowns"—complex interactions between microservices that no static threshold could ever catch. Today, we dissect the shift from Monitoring to Observability (O11y), how to build a compliant stack in Norway without violating Schrems II, and why your underlying hardware (specifically the KVM isolation we mandate at CoolVDS) dictates your success.
The Lie of "99.9% Uptime"
Monitoring is for Known Unknowns. You know disk space can run out, so you set a threshold at 90%. You know Nginx can crash, so you check the process state. It answers the question: "Is the system healthy?"
Observability is for Unknown Unknowns. It answers the question: "Why is the system behaving weirdly?" It allows you to ask new questions of your system without shipping new code.
Pro Tip: If you have to SSH into a server to `grep` logs to find out why an error occurred, you do not have observability. You have a log archive. True observability means correlating a spike in latency with a specific database query and a specific Nginx error log in a single UI.
The Three Pillars in 2022: Implementation Guide
We are currently seeing a massive consolidation around the LGTM stack (Loki, Grafana, Tempo, Mimir) and OpenTelemetry. Here is how to implement this on a standard Linux node (Ubuntu 22.04 LTS).
1. Metrics (The Context)
Prometheus remains the king here. However, raw node_exporter data isn't enough. You need to verify resource saturation before it hits the limit. On a CoolVDS NVMe instance, we often see customers neglecting I/O wait times because the disks are so fast, but bad queries can still saturate the bus.
Configuration: prometheus.yml optimization
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'coolvds-node'
static_configs:
- targets: ['localhost:9100']
# Vital: Drop high-cardinality metrics that bloat your TSDB
metric_relabel_configs:
- source_labels: [__name__]
regex: 'go_.*'
action: drop
2. Logs (The Evidence)
Grepping `/var/log/nginx/access.log` is slow. Modern DevOps use Loki. Unlike Elasticsearch (ELK), Loki doesn't index the text of the logs, only the metadata labels. This makes it incredibly cheap to run on your own VPS Norway infrastructure.
To make logs useful, stop using standard Nginx formats. Use JSON. It allows tools like Loki or jq to parse fields instantly.
Snippet: /etc/nginx/nginx.conf
http {
log_format json_analytics escape=json
'{'
'"msec": "$msec", ' # Request time in seconds with milliseconds resolution
'"connection": "$connection", '
'"connection_requests": "$connection_requests", '
'"pid": "$pid", '
'"request_id": "$request_id", ' # Critical for tracing correlation
'"request_length": "$request_length", '
'"remote_addr": "$remote_addr", '
'"remote_user": "$remote_user", '
'"remote_port": "$remote_port", '
'"time_local": "$time_local", '
'"time_iso8601": "$time_iso8601", '
'"request": "$request", '
'"request_uri": "$request_uri", '
'"args": "$args", '
'"status": "$status", '
'"body_bytes_sent": "$body_bytes_sent", '
'"bytes_sent": "$bytes_sent", '
'"http_referer": "$http_referer", '
'"http_user_agent": "$http_user_agent", '
'"http_x_forwarded_for": "$http_x_forwarded_for", '
'"http_host": "$http_host", '
'"server_name": "$server_name", '
'"request_time": "$request_time", '
'"upstream": "$upstream_addr", '
'"upstream_connect_time": "$upstream_connect_time", '
'"upstream_header_time": "$upstream_header_time", '
'"upstream_response_time": "$upstream_response_time", '
'"upstream_response_length": "$upstream_response_length", '
'"upstream_cache_status": "$upstream_cache_status", '
'"ssl_protocol": "$ssl_protocol", '
'"ssl_cipher": "$ssl_cipher", '
'"scheme": "$scheme", '
'"request_method": "$request_method"'
'}';
access_log /var/log/nginx/json_access.log json_analytics;
}
3. Tracing (The Causality)
This is the heavy lifter. Tracing follows a request from the load balancer, through the PHP-FPM worker, into the MySQL database, and back. In 2022, OpenTelemetry (OTel) is the standard SDK. It unifies metrics, logs, and traces.
Running a collector agent on your VPS requires CPU overhead. This is where the "noisy neighbor" problem on cheap shared hosting kills you. If your neighbor spikes, your observability agent might time out, leaving you with gaps in your data during a critical incident.
OpenTelemetry Collector Config (otel-collector-config.yaml):
receivers:
otlp:
protocols:
grpc:
http:
processors:
batch:
memory_limiter:
check_interval: 1s
limit_mib: 1000
spike_limit_mib: 200
exporters:
logging:
loglevel: debug
otlp:
endpoint: "tempo-backend:4317"
tls:
insecure: true
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [otlp]