Console Login

Stop Monitoring, Start Observing: Why Your Green Dashboards Are Lying to You

Stop Monitoring, Start Observing: Why Your Green Dashboards Are Lying to You

It’s 3:00 AM on a Tuesday. PagerDuty just slapped you awake. Your Zabbix dashboard is a sea of comforting green. CPU load is nominal. RAM usage is at 40%. Disk space is fine. Yet, Twitter is blowing up because nobody in Oslo can process a payment on your platform.

Welcome to the limitation of traditional monitoring. In the monolithic days, "is the process running?" was a sufficient question. But in 2019, with the explosion of microservices, Docker containers, and dynamic orchestration, knowing that a system is healthy is useless if you don't understand what it's doing internally.

This is the shift from Monitoring to Observability. Monitoring is for known unknowns. Observability is for unknown unknowns. Let's break down how to architect a stack that actually helps you debug production, rather than just alerting you that the house is on fire.

The Three Pillars: Metrics, Logs, and Tracing

If you are still grep-ing text logs on a production server, stop. You are wasting valuable time. A true observability stack in 2019 relies on three distinct data types.

1. Metrics (The "What")

Metrics are aggregatable counts and gauges. They are cheap to store and fast to query. We use Prometheus for this. It pulls data (scrapes) rather than waiting for pushes, which prevents your monitoring system from becoming a DDoS bot during high load.

Here is a standard scrape config for a Go application exposing metrics:

global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'payment_service'
    scrape_interval: 5s
    static_configs:
      - targets: ['10.0.1.5:9090']
    metrics_path: /metrics

2. Structured Logs (The "Context")

Text logs are garbage for analysis. You need JSON. When you parse Nginx access logs, you want to filter by latency, upstream response time, or specific user agents instantly. If you are running the ELK stack (Elasticsearch, Logstash, Kibana) or the lighter EFK (using Fluentd), your logs need structure.

Change your nginx.conf to output JSON immediately. This saves your Logstash indexer from burning CPU cycles trying to use grok patterns to parse raw text.

http {
    log_format json_analytics escape=json '{"
        "msec": "$msec", 
        "connection": "$connection", 
        "connection_requests": "$connection_requests", 
        "pid": "$pid", 
        "request_id": "$request_id", 
        "request_length": "$request_length", 
        "remote_addr": "$remote_addr", 
        "remote_user": "$remote_user", 
        "remote_port": "$remote_port", 
        "time_local": "$time_local", 
        "time_iso8601": "$time_iso8601", 
        "request": "$request", 
        "request_uri": "$request_uri", 
        "args": "$args", 
        "status": "$status", 
        "body_bytes_sent": "$body_bytes_sent", 
        "bytes_sent": "$bytes_sent", 
        "http_referer": "$http_referer", 
        "http_user_agent": "$http_user_agent", 
        "http_x_forwarded_for": "$http_x_forwarded_for", 
        "http_host": "$http_host", 
        "server_name": "$server_name", 
        "request_time": "$request_time", 
        "upstream": "$upstream_addr", 
        "upstream_connect_time": "$upstream_connect_time", 
        "upstream_header_time": "$upstream_header_time", 
        "upstream_response_time": "$upstream_response_time", 
        "upstream_response_length": "$upstream_response_length", 
        "upstream_cache_status": "$upstream_cache_status", 
        "ssl_protocol": "$ssl_protocol", 
        "ssl_cipher": "$ssl_cipher", 
        "scheme": "$scheme", 
        "request_method": "$request_method", 
        "server_protocol": "$server_protocol", 
        "pipe": "$pipe", 
        "gzip_ratio": "$gzip_ratio", 
        "http_cf_ray": "$http_cf_ray"
    }';

    access_log /var/log/nginx/access_json.log json_analytics;
}

3. Distributed Tracing (The "Where")

This is where most setups fail. If Service A calls Service B, and Service B calls the Database, and the request is slow, metrics won't tell you which hop caused the lag. Logging might show errors, but connecting them is a nightmare.

Enter Jaeger (or Zipkin). By implementing OpenTracing standards, you pass a context ID along with every request header. You can then visualize the entire waterfall of a single request.

Here is how you initialize a tracer in a Python Flask app (using the jaeger-client library available now):

import logging
from jaeger_client import Config
from flask_opentracing import FlaskTracing

def init_tracer(service):
    logging.getLogger('').handlers = []
    logging.basicConfig(format='%(message)s', level=logging.DEBUG)

    config = Config(
        config={
            'sampler': {
                'type': 'const',
                'param': 1,
            },
            'logging': True,
            'local_agent': {
                'reporting_host': '127.0.0.1',
                'reporting_port': '6831',
            }
        },
        service_name=service,
    )
    return config.initialize_tracer()

# Inside your Flask app setup
jaeger_tracer = init_tracer('payment-service')
tracing = FlaskTracing(jaeger_tracer, True, app)

The Infrastructure Cost of Observability

There is a catch. Storing millions of metric points and indexing gigabytes of JSON logs every hour is heavy. It kills I/O. If you try to run an ELK stack on a standard HDD-based VPS, your Elasticsearch queue will fill up, your Kibana dashboards will time out, and you will lose data exactly when you need it most—during an incident.

Pro Tip: Never host your monitoring stack on the same physical hardware as your production app if you can avoid it. If you must, ensure you have I/O isolation.

This is where infrastructure choice becomes an architectural decision, not just a billing one. We specifically build CoolVDS instances with local NVMe storage because database applications like Elasticsearch are I/O bound. A typical SATA SSD provides around 500 MB/s read/write. Our NVMe drives push 3000+ MB/s. When you are aggregating logs from the last 24 hours to find a security breach pattern, that speed difference is the gap between finding the root cause in 5 minutes vs. 5 hours.

Data Sovereignty in 2019

We are seeing tighter scrutiny from Datatilsynet regarding where Norwegian user data lives. While the Privacy Shield framework currently allows data transfer to the US, the legal ground is shaky and many Nordic CTOs are opting for "Data in Norway" policies to be safe.

Observability data is user data. Your Nginx logs contain IP addresses. Your tracing payloads might accidentally contain email addresses or user IDs. Storing this data on a US-controlled cloud adds a compliance layer you don't need.

Putting It All Together: A Docker Compose Example

Ready to test this locally or on a staging server? Here is a docker-compose.yml snippet to spin up a basic Grafana + Prometheus stack. This works perfectly on a 2GB CoolVDS instance.

version: '3'

services:
  prometheus:
    image: prom/prometheus:v2.10.0
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--web.console.libraries=/etc/prometheus/console_libraries'
      - '--web.console.templates=/etc/prometheus/consoles'
      - '--storage.tsdb.retention.time=200h'
      - '--web.enable-lifecycle'
    ports:
      - 9090:9090

  grafana:
    image: grafana/grafana:6.2.4
    depends_on:
      - prometheus
    ports:
      - 3000:3000
    environment:
      - GF_SECURITY_ADMIN_USER=admin
      - GF_SECURITY_ADMIN_PASSWORD=secret
      - GF_USERS_ALLOW_SIGN_UP=false

volumes:
  prometheus_data: {}

Conclusion

Monitoring answers: "Is the system healthy?" Observability answers: "Is the system doing what the user expects?"

The transition requires a change in culture and a change in hardware. You cannot observe what you cannot catch. Don't let slow I/O be the reason you can't query your logs during an outage.

If you are building a stack for the Nordic market and need the low latency of a local datacenter combined with the raw throughput required for heavy logging workloads, give our NVMe instances a spin.

Stop guessing. Start measuring. Deploy your observability stack on CoolVDS today.