Console Login

Observability vs Monitoring: Why Your Green Dashboard Is Lying to You

Observability vs. Monitoring: Why Your Green Dashboard Is Lying to You

It is 03:00 AM CET. Your PagerDuty is silent. The Grafana dashboard on the wall is a sea of comforting green. The load balancer in Oslo reports a healthy status with 200 OK responses.

Yet, your support ticket queue is flooding with angry users from Bergen to Trondheim claiming the checkout page is hanging. This is the nightmare scenario where monitoring fails you. You know something is wrong, but your tools are only designed to tell you if the system is dead or alive. They aren't designed to tell you why it is sick.

In the Norwegian hosting market, where latency to the NIX (Norwegian Internet Exchange) is measured in single-digit milliseconds, standard uptime checks are no longer enough. We need to move from checking health to understanding behavior. This is the shift from Monitoring to Observability.

The Fundamental Difference

Let’s cut through the marketing noise. Monitoring is about the known knowns. You define a threshold, and if the metric crosses it, you get an alert. It is reactive.

The Battle-Hardened Rule: If you have to SSH into a server to read a log file to understand an error, you do not have observability. You just have logs.

Observability (o11y) is about the unknown unknowns. It is a property of a system that allows you to understand its internal state based purely on its external outputs (Logs, Metrics, and Traces). It allows you to ask arbitrary questions like: "Why do API requests involving the 'inventory-check' service spike in latency only when the user has more than 50 items in the cart?"

1. The Old Way: Monitoring (The "Is it on?" Check)

Traditional monitoring relies heavily on polling. You install an agent, it scrapes data, and stores it in a time-series database (TSDB).

Here is a classic Prometheus configuration snippet found in thousands of deployments:

scrape_configs:
  - job_name: 'node_exporter'
    static_configs:
      - targets: ['localhost:9100']
    scrape_interval: 15s

This tells you your CPU usage every 15 seconds. But what happens if a micro-burst of traffic hits at second 2, spikes the CPU to 100%, causes packet loss, and resolves by second 10? Your monitoring sees an average. It sees "Green." Meanwhile, packets were dropped.

2. The New Way: Observability (The "Why is it slow?" Check)

Observability relies on instrumentation at the code level. In 2023, the industry standard is OpenTelemetry (OTel). Instead of just polling CPU, we inject context into every request.

Here is how we instrument a Python application to trace a specific operation. This doesn't just say "error occurred"; it captures the exact user context and database query duration.

from opentelemetry import trace

tracer = trace.get_tracer(__name__)

# We manually start a span to track this specific logic block
with tracer.start_as_current_span("process_checkout") as span:
    span.set_attribute("user.id", user_id)
    span.set_attribute("cart.item_count", len(cart_items))
    
    try:
        # Simulate a database call
        result = db.query("SELECT * FROM inventory...")
    except Exception as e:
        span.record_exception(e)
        span.set_status(trace.Status(trace.StatusCode.ERROR))
        raise

When this data hits your backend (Jaeger, Tempo, or Honeycomb), you can filter by cart.item_count. Suddenly, the correlation becomes visible. That is observability.

Infrastructure: The Hidden Variable

Here is the uncomfortable truth that SaaS providers rarely mention: Observability is heavy.

Generating traces, structured logs, and high-cardinality metrics generates a massive amount of I/O. If you are running an ELK stack (Elasticsearch, Logstash, Kibana) or a high-volume Loki instance on cheap, spinning rust storage, your observability platform will crash before your application does.

I have seen DevOps teams try to save money by hosting their monitoring stack on shared hosting or budget VPS providers with "SSD caching." It fails. Always.

The I/O Bottleneck

Elasticsearch is notoriously hungry for IOPS. If you are logging every HTTP request in a high-traffic Nginx environment, you need sustained write speeds.

Configuration for High-Volume Nginx JSON Logging:

http {
    log_format json_analytics escape=json '{
        "time_local": "$time_local",
        "remote_addr": "$remote_addr",
        "request_uri": "$request_uri",
        "status": "$status",
        "request_time": "$request_time",
        "upstream_response_time": "$upstream_response_time",
        "user_agent": "$http_user_agent"
    }';

    access_log /var/log/nginx/access_json.log json_analytics buffer=32k flush=5s;
}

Note the buffer=32k flush=5s. We use this to reduce I/O pressure. However, when that buffer flushes, it hits the disk hard. On a CoolVDS instance, we use pure NVMe storage with KVM isolation. This means when your log aggregator flushes to disk, you get the raw throughput of the drive, not a slice of time shared with 50 other noisy neighbors.

The "Noisy Neighbor" Effect on Metrics

This is critical for accuracy. If you are monitoring latency sensitive applications—say, a financial trading bot in Oslo—you need to trust your baseline.

In containerized environments (like OpenVZ or LXC), the host kernel controls CPU scheduling for all guests. If another container on the host gets DDOSed, the kernel context-switching overhead increases. Your application might report high "Steal Time" (st), which alerts you.

But sometimes, it just manifests as micro-latency. Your observability tools show a database query took 50ms instead of 2ms, leading you to debug your SQL. But the SQL is fine. The physical CPU was just busy serving a neighbor.

CoolVDS Architecture Note: We strictly use KVM (Kernel-based Virtual Machine). This provides a hardware-level virtualization. Your RAM is yours. Your CPU cycles are allocated strictly. If your dashboard says the database is slow, it is actually the database, not our infrastructure.

Data Sovereignty and GDPR

Observability data often contains PII (Personally Identifiable Information). IP addresses, User IDs, and sometimes (if developers are careless) email addresses in query strings.

Under GDPR and the rulings of Datatilsynet (The Norwegian Data Protection Authority), you must know exactly where this data lives. Sending your traces to a SaaS platform hosted in the US (even with Safe Harbor claims) is a risk many Norwegian CTOs are no longer willing to take post-Schrems II.

Hosting your own OpenTelemetry Collector and backend (Grafana/Loki/Tempo) on a server physically located in Oslo ensures that your user data never leaves Norwegian jurisdiction. It is the only way to be 100% compliant.

OpenTelemetry Collector Configuration (Local Storage)

Here is how you configure the OTel collector to batch data effectively before writing to your local backend, ensuring efficiency:

receivers:
  otlp:
    protocols:
      grpc:
      http:

processors:
  batch:
    timeout: 1s
    send_batch_size: 1024

exporters:
  otlp:
    endpoint: "localhost:4317" # Sending to local Jaeger/Tempo instance
    tls:
      insecure: true

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [otlp]

Conclusion: Stop Guessing

Green dashboards are comforting, but they are often a mirage. To truly own your uptime, you need to implement observability. You need to trace requests across services and visualize the bottlenecks.

But remember: Observability requires resources. It requires fast disk I/O for logs, strict CPU isolation for accurate metrics, and data sovereignty for compliance.

Don't let slow I/O kill your insights. If you are ready to build a monitoring stack that actually works, deploy a KVM-based, NVMe-powered instance on CoolVDS today. You will see the difference in your first graph.