Observability vs Monitoring: Why Your Green Dashboards Are Lying to You
It was 03:14 AM on a Tuesday. My phone buzzed with a PagerDuty alert: 502 Bad Gateway on a major Norwegian e-commerce platform we manage. I opened Grafana. Everything was green. CPU usage? Nominal. Memory? 60%. Disk I/O? Flat. According to our "monitoring," the server was perfectly healthy. But according to the customers trying to buy hiking gear, the site was dead.
This is the fundamental failure of traditional monitoring. It focuses on known unknowns. You ask: "Is CPU high?" The answer is yes or no. But when a microservice starts failing because of a race condition in a Redis connection pool that only triggers under specific high-concurrency scenarios—your CPU monitor won't save you.
That night, we didn't need monitoring. We needed observability. We needed to ask the system: "Why is the checkout service hanging for exactly 30 seconds before timing out?"
The Paradigm Shift: From Dashboarding to Debugging
In the DevOps community here in Europe, we often conflate these terms. Let's draw a line in the sand.
- Monitoring is about the health of the system. It gives you an overview. "The server is up."
- Observability is about the internal state of the system inferred from its outputs (logs, metrics, traces). It gives you context. "The server is up, but the payment gateway API is returning 403s, causing thread starvation in the application layer."
To achieve true observability in 2023, we rely on the three pillars: Metrics, Logs, and Traces. And increasingly, we are unifying these with OpenTelemetry (OTel).
The Cost of Visibility
Before we look at the config, a warning: Observability is expensive. Not just in setup time, but in compute resources. Ingesting, processing, and storing millions of spans and log lines requires serious I/O throughput.
Pro Tip: Do not attempt to run a production-grade TIG (Telegraf, InfluxDB, Grafana) or LGTM (Loki, Grafana, Tempo, Mimir) stack on shared hosting with "burstable" CPU. The moment you need to query your logs during an incident is exactly the moment you need max performance. We deploy our observability clusters on CoolVDS NVMe instances in Oslo because the consistent disk I/O prevents the monitoring stack itself from falling over when ingestion rates spike.
Step 1: The Metrics (Prometheus)
Metrics are cheap and fast. They tell you what is happening. In 2023, Prometheus remains the king. However, raw CPU metrics are boring. You should be instrumenting your application code.
Here is a standard prometheus.yml scrape config. Notice how we are scraping an endpoint protected by basic auth—common for internal services running behind Nginx.
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'payment_service'
scheme: https
tls_config:
insecure_skip_verify: false
basic_auth:
username: 'metrics_user'
password: '${METRICS_PASSWORD}'
static_configs:
- targets: ['10.0.0.5:9090', '10.0.0.6:9090']
relabel_configs:
- source_labels: [__address__]
regex: '(.*):.*'
target_label: instance
replacement: '${1}'
Step 2: The Traces (OpenTelemetry)
Tracing is where the magic happens. A distributed trace follows a request as it hops from your load balancer, to your frontend, to your backend, to your database. It connects the dots.
As of March 2023, OpenTelemetry has stabilized enough for production use in Python, Go, and Java. Gone are the days of vendor-locked agents. Here is how you instrument a Python application to export traces to an OTel Collector.
First, install the necessary libraries:
pip install opentelemetry-distro opentelemetry-exporter-otlp
Now, inject the instrumentation. This code snippet initializes the tracer and sends data to a collector running on localhost (sidecar pattern).
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import Resource
# Define the resource (service name is critical for filtering in Jaeger/Tempo)
resource = Resource(attributes={
"service.name": "checkout-service-norway",
"deployment.environment": "production"
})
trace.set_tracer_provider(TracerProvider(resource=resource))
tracer = trace.get_tracer(__name__)
# Send traces to the local OTel collector via gRPC
otlp_exporter = OTLPSpanExporter(endpoint="http://localhost:4317", insecure=True)
span_processor = BatchSpanProcessor(otlp_exporter)
trace.get_tracer_provider().add_span_processor(span_processor)
with tracer.start_as_current_span("process_payment") as span:
span.set_attribute("payment.currency", "NOK")
span.set_attribute("user.id", "u-12345")
try:
# Simulate work
print("Processing payment...")
except Exception as e:
span.record_exception(e)
span.set_status(trace.Status(trace.StatusCode.ERROR))
Step 3: The Logs (Loki)
Logs are the source of truth, but grep is not a strategy. We use Grafana Loki because it doesn't index the full text of the log, only the metadata labels. This makes it incredibly storage-efficient compared to Elasticsearch, but it relies heavily on fast storage for query time (another reason to prioritize NVMe storage).
A typical LogQL query to find errors in the payment service for a specific Norwegian user ID might look like this:
{app="payment-service", env="prod"} |= "error" | json | user_id = "u-8812"
The Glue: OpenTelemetry Collector
Instead of sending data directly to backends (Jaeger, Prometheus, Loki), you should send everything to an OpenTelemetry Collector. This binary sits in the middle, processes data, scrubs PII (critical for GDPR compliance in Europe), and exports it to your backend of choice.
Here is a robust `otel-collector-config.yaml` that receives data via OTLP and exports metrics to Prometheus and traces to Jaeger.
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch: # Bundles data to reduce network calls
memory_limiter:
check_interval: 1s
limit_mib: 1000
attributes/gdpr: # Masking PII for compliance
actions:
- key: user.email
action: hash
exporters:
prometheus:
endpoint: "0.0.0.0:8889"
otlp/jaeger:
endpoint: "jaeger-collector:4317"
tls:
insecure: true
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, attributes/gdpr, batch]
exporters: [otlp/jaeger]
metrics:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [prometheus]
Data Sovereignty and Latency
If you are operating in Norway, you have two extra headaches: Latency to end-users and Data Sovereignty (Datatilsynet).
Sending your observability data to a US-based SaaS cloud can be a violation of Schrems II if that data contains PII (and traces often do, despite your best efforts). Hosting your own observability stack on CoolVDS in our Oslo datacenter solves both problems:
- Compliance: Data never leaves Norwegian jurisdiction.
- Performance: With direct peering to NIX (Norwegian Internet Exchange), your latency for log shipping is negligible (often <2ms).
Conclusion
Observability is not a tool you buy; it's a culture you build. It requires shifting from "Is it up?" to "Is it working?". It requires instrumenting your code, managing your own collectors, and having the infrastructure to support heavy write loads.
Don't wait for the next silent failure to realize your monitoring is blind. Spin up a high-performance instance on CoolVDS, deploy the OTel collector, and start seeing what's actually happening inside your systems.
Ready to take control of your stack? Deploy a CoolVDS NVMe instance in Oslo today and get full root access in under 60 seconds.