Console Login

Monitoring is Dead: Building True Observability Stacks in a Post-Schrems II World

Monitoring is Dead: Building True Observability Stacks in a Post-Schrems II World

It was 03:14 AM on a Tuesday. My phone buzzed. I woke up, grabbed the laptop, and checked the dashboard. All green. CPU load: 15%. RAM usage: 40%. Disk space: Plenty. Yet, Twitter was exploding with users claiming the checkout API was timing out.

This is the "Green Dashboard Fallacy."

Monitoring told me the server was alive. It failed completely to tell me what it was doing. We eventually found the root cause: a third-party payment gateway API was hanging, causing PHP-FPM workers to stack up, waiting for a timeout that was configured too high. The CPU wasn't working hard; it was just waiting. Traditional monitoring is about staring at a speedometer. Observability is lifting the hood to see why the engine is making that clicking sound.

In late 2020, if you are still relying solely on Nagios checks or simple CPU graphs, you are flying blind. Let's fix that.

The Shift: Known Unknowns vs. Unknown Unknowns

There is a fundamental difference that separates junior admins from senior architects. Monitoring answers questions you already know to ask: "Is the disk full?" "Is the ping latency high?"

Observability answers questions you didn't know you had: "Why is latency spiking only for iOS users in Bergen between 18:00 and 19:00?"

To achieve this, we need the three pillars: Metrics, Logs, and Traces. And we need to host them on infrastructure that doesn't choke when we ingest 5,000 log lines per second.

1. Structured Logging: Stop Grepping Text Files

If you are still parsing standard Nginx `access.log` files with `awk` or `grep`, stop. You need structured data that can be ingested by Elasticsearch or Loki. Text logs are for humans; JSON logs are for machines.

Here is how we configure Nginx to output JSON. This allows us to index fields like `upstream_response_time`, which is critical for detecting the "lazy external API" problem I mentioned earlier.

http {
    log_format json_analytics escape=json
    '{'
        '"msec": "$msec", ' # Request time in seconds with milliseconds
        '"connection": "$connection", ' # Connection serial number
        '"connection_requests": "$connection_requests", ' # Number of requests made in this connection
        '"pid": "$pid", ' # Process ID
        '"request_id": "$request_id", ' # Unique request ID
        '"request_length": "$request_length", ' # Request length (including headers and body)
        '"remote_addr": "$remote_addr", ' # Client IP
        '"remote_user": "$remote_user", ' # Client HTTP username
        '"remote_port": "$remote_port", ' # Client port
        '"time_local": "$time_local", '
        '"time_iso8601": "$time_iso8601", ' # Local time in the ISO 8601 standard format
        '"request": "$request", ' # Full request line
        '"request_uri": "$request_uri", ' # Full request URI
        '"args": "$args", ' # Args
        '"status": "$status", ' # Response status code
        '"body_bytes_sent": "$body_bytes_sent", ' # Number of body bytes sent to the client
        '"bytes_sent": "$bytes_sent", ' # Number of bytes sent to the client
        '"http_referer": "$http_referer", ' # HTTP referer
        '"http_user_agent": "$http_user_agent", ' # User agent
        '"http_x_forwarded_for": "$http_x_forwarded_for", ' # HTTP X-Forwarded-For
        '"http_host": "$http_host", ' # the request Host: header
        '"server_name": "$server_name", ' # the name of the vhost serving the request
        '"request_time": "$request_time", ' # request processing time in seconds with msec resolution
        '"upstream": "$upstream_addr", ' # upstream backend server for proxied requests
        '"upstream_connect_time": "$upstream_connect_time", ' # upstream handshake time spent
        '"upstream_header_time": "$upstream_header_time", ' # upstream header response time spent
        '"upstream_response_time": "$upstream_response_time", ' # upstream total response time spent
        '"upstream_response_length": "$upstream_response_length", ' # upstream response length
        '"upstream_cache_status": "$upstream_cache_status", ' # upstream cache status
        '"ssl_protocol": "$ssl_protocol", ' # TLS protocol
        '"ssl_cipher": "$ssl_cipher", ' # TLS cipher
        '"scheme": "$scheme", ' # http or https
        '"request_method": "$request_method" ' # request method
    '}';

    access_log /var/log/nginx/analytics.json json_analytics;
}

By piping this into an ELK stack (Elasticsearch, Logstash, Kibana), you can visualize upstream_response_time. If your application feels slow but the CPU is idle, this graph will spike, revealing the bottleneck is external.

2. Metrics: The Prometheus Standard

In 2020, Prometheus is the undisputed king of metrics. Unlike push-based systems (like the old Graphite setups), Prometheus pulls data. This is crucial for security—you don't need to open outbound ports on your secure backend servers; the Prometheus server just scrapes the `/metrics` endpoint.

To get started, you don't just need to monitor the application; you need to monitor the metal. Install the node_exporter on your Linux instances:

# Create a user for node_exporter
useradd --no-create-home --shell /bin/false node_exporter

# Download and install (Version 1.0.1 is stable as of late 2020)
wget https://github.com/prometheus/node_exporter/releases/download/v1.0.1/node_exporter-1.0.1.linux-amd64.tar.gz
tar xvf node_exporter-1.0.1.linux-amd64.tar.gz
cp node_exporter-1.0.1.linux-amd64/node_exporter /usr/local/bin/
chown node_exporter:node_exporter /usr/local/bin/node_exporter

# Systemd service file
cat < /etc/systemd/system/node_exporter.service
[Unit]
Description=Node Exporter
Wants=network-online.target
After=network-online.target

[Service]
User=node_exporter
Group=node_exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter

[Install]
WantedBy=multi-user.target
EOF

systemctl daemon-reload
systemctl start node_exporter
Pro Tip: Don't just alert on "High CPU." Alert on Saturation. Use the Linux Load Average divided by the number of vCPUs. If `node_load1 / count(node_cpu_seconds_total{mode="idle"}) > 1.2`, you are queueing tasks. That implies latency.

The Elephant in the Server Room: Schrems II and GDPR

Here is why this matters specifically for us in Europe right now. In July 2020, the CJEU (Court of Justice of the European Union) invalidated the Privacy Shield framework in the Schrems II ruling.

What this means for you: Sending your server logs (which often contain IP addresses—considered Personal Data under GDPR) to a US-based SaaS monitoring platform (like Datadog, New Relic, or AWS CloudWatch) is now a legal minefield. Datatilsynet (The Norwegian Data Protection Authority) is taking a hard line on data transfers.

If you export your user's IP addresses to a US cloud for analysis, you are likely non-compliant. The solution? Self-hosted Observability.

You need to run your own Prometheus and ELK stack within the EEA (European Economic Area). This keeps the data under your control and within the protection of European law.

The Hardware Reality: Why IOPS Matter

This is where things get heavy. Running an observability stack is resource-intensive. Elasticsearch is notorious for devouring I/O. If you try to run an ELK stack on a budget VPS with magnetic storage or shared spinning rust, your monitoring will crash before your application does.

I recently tried to deploy a Graylog cluster on a standard cloud instance with network-attached storage. The write latency for indexing logs pushed the iowait to 40%. The logging queue filled up, and the application started blocking because it couldn't write to stdout fast enough. A disaster.

Component Resource Hunger CoolVDS Solution
Elasticsearch High Random Write I/O Local NVMe Storage (10x faster than SATA SSD)
Prometheus High Sequential Write, High RAM Dedicated RAM allocation (No ballooning)
Kibana/Grafana CPU for rendering aggregations High-frequency vCPUs

At CoolVDS, we built our infrastructure on NVMe storage by default. We didn't do it just for marketing speeds; we did it because modern workloads like observability stacks demand it. When you self-host your monitoring to comply with GDPR, you need the raw power to ingest thousands of data points per second without latency.

Conclusion

The era of "is it pinging?" is over. In late 2020, you need to know why a request failed, which microservice caused the delay, and you need to store that data legally within Europe.

Don't let a legal ruling or a slow hard drive compromise your visibility. Take control of your data.

Ready to build a compliant, high-performance observability stack? Deploy a CoolVDS NVMe instance in Oslo today and get full root access in under 55 seconds.