Console Login

Observability vs. Monitoring: Why Your Green Dashboard is Lying to You

Observability vs. Monitoring: Why Your Green Dashboard is Lying to You

It’s 3:00 AM. Your pager goes off. You groggily open your laptop, expecting to see red alerts everywhere. But you don't. Your dashboard is a sea of calming green. CPU is at 40%, memory is stable, and disk space is plentiful. Yet, Twitter is blowing up because your checkout page in Oslo is timing out.

This is the failure of traditional monitoring in 2019. We have moved from monoliths to microservices, from bare metal to containers, and our "is it up?" checks are no longer sufficient. Welcome to the era of Observability.

The Semantics of Failure

Let's cut through the marketing fluff. There is a distinct technical difference between these two concepts, and understanding it determines whether you sleep through the night or spend it grepping through /var/log/syslog.

  • Monitoring tells you the system is broken. It tracks known unknowns. "Is the disk full?" "Is the API returning 500s?"
  • Observability tells you why the system is broken. It allows you to ask arbitrary questions about unknown unknowns. "Why is latency spiking only for iOS users in Bergen on the /cart endpoint?"

If you are running a monolithic LAMP stack, monitoring is probably fine. If you are orchestrating Kubernetes clusters (and since version 1.15 dropped in June, who isn't?), you need observability.

The Three Pillars in Practice

To achieve observability, we need to correlate three data sources: Metrics, Logs, and Traces.

1. Metrics (The "What")

We are long past the days of MRTAG graphs. In 2019, Prometheus is the undisputed king here. It doesn't just check if a service is up; it scrapes multidimensional time-series data. However, Prometheus is resource-intensive. Running a scraper that handles high-cardinality metrics requires fast I/O.

Here is a standard prometheus.yml scrape config for a node exporter. Note the interval—if you are scraping every minute, you are missing micro-bursts.

global:
  scrape_interval: 15s 

scrape_configs:
  - job_name: 'coolvds_nodes'
    static_configs:
      - targets: ['10.0.0.5:9100', '10.0.0.6:9100']
    relabel_configs:
      - source_labels: [__address__]
        regex: '(.*):.*'
        target_label: instance
        replacement: '${1}'

2. Structured Logs (The "Context")

Grepping text files is dead. If you aren't logging in JSON, you can't effectively index in Elasticsearch. When we deploy high-traffic Nginx instances on CoolVDS, we immediately override the default logging format to make it ingestible by Logstash or Fluentd.

Put this in your nginx.conf inside the http block:

log_format json_combined escape=json
  '{ "time_local": "$time_local", '
  '"remote_addr": "$remote_addr", '
  '"remote_user": "$remote_user", '
  '"request": "$request", '
  '"status": "$status", '
  '"body_bytes_sent": "$body_bytes_sent", '
  '"request_time": "$request_time", '
  '"http_referrer": "$http_referrer", '
  '"http_user_agent": "$http_user_agent" }';

access_log /var/log/nginx/access.json json_combined;

Now, when you visualize this in Kibana, you can filter by request_time > 1.0 to find the slow endpoints instantly.

Pro Tip: The ELK stack (Elasticsearch, Logstash, Kibana) is a memory hog. Java heap sizes can easily eat 4GB+ of RAM. Don't try to run a production ELK stack on a budget VPS with shared CPU. You need dedicated resources. This is why we provision CoolVDS instances with dedicated NVMe—Elasticsearch indexing speed is directly tied to disk IOPS.

3. Distributed Tracing (The "Where")

This is the missing link for most organizations in 2019. When a request hits your load balancer, travels to an auth service, hits a database, and calls a payment gateway, where is the latency coming from? Jaeger (compliant with the OpenTracing standard) visualizes this path.

Here is how you initialize a tracer in Python using the jaeger-client library (version 4.0.0):

import logging
from jaeger_client import Config

def init_tracer(service):
    logging.getLogger('').handlers = []
    logging.basicConfig(format='%(message)s', level=logging.DEBUG)

    config = Config(
        config={
            'sampler': {
                'type': 'const',
                'param': 1,
            },
            'logging': True,
            'local_agent': {'reporting_host': 'jaeger_agent', 'reporting_port': 6831},
        },
        service_name=service,
    )

    # this call also sets opentracing.tracer
    return config.initialize_tracer()

The Norwegian Context: Why Location Matters

You might ask, "Why not just host this monitoring stack on a US cloud provider?"

1. Latency: If your servers are in Oslo and your monitoring stack is in Virginia, you are introducing 80-100ms of latency just to ship the logs. When you are debugging a DDoS attack, that lag is unacceptable. You want your observability stack close to your workloads. With CoolVDS located directly in Norway, latency to NIX (Norwegian Internet Exchange) is often under 2ms.

2. GDPR & Datatilsynet: Logs contain PII (IP addresses, User IDs). Under GDPR (which has been fully enforceable for over a year now), transferring this data outside the EEA requires strict justification. By keeping your logs on Norwegian soil, you simplify compliance significantly.

Infrastructure Requirements for Observability

Observability generates data gravity. Storing weeks of high-fidelity traces and logs consumes massive amounts of storage and I/O.

Component Bottleneck CoolVDS Solution
Prometheus TSDB Disk I/O (Write Heavy) NVMe SSDs (High IOPS)
Elasticsearch RAM & Disk Seek DDR4 ECC RAM & Dedicated Resources
Jaeger Collector Network Throughput 1Gbps Uplinks

We see it all the time: a developer sets up an ELK stack on a standard HDD VPS. The moment they turn on debug logging, the IO wait shoots to 90%, and the monitoring system itself crashes. Irony at its finest.

Conclusion

Building an observable system isn't just about installing tools; it's about ensuring those tools have the foundation to run. You can't debug a high-performance application with low-performance infrastructure.

Whether you are implementing Prometheus exporters or fine-tuning your Jaeger sampling rates, ensure your underlying host can handle the heat. Don't let slow I/O kill your insights.

Ready to build a monitoring stack that actually works? Deploy a high-performance NVMe instance on CoolVDS in Oslo today.