Observability is Not Just "Monitoring on Steroids"
It was 3:00 AM on a Tuesday. My pager went off. The alert was simple: High Latency: API Gateway. I opened Grafana. Everything was green. CPU usage on the nodes? 15%. Memory? Stable. Disk I/O? Flat. Yet, customers in Stavanger were timing out trying to pay for their orders.
This is the classic failure of Monitoring. I knew that something was wrong, but I had absolutely no idea why. Monitoring tells you the state of known unknowns. It answers questions you predicted you'd need to ask, like "Is the disk full?"
Observability is different. It allows you to ask questions about your system that you never anticipated. It connects the dots between a spiked SQL query, a garbage collection pause in a microservice, and a network jitter at the ISP level.
In 2025, if you are still relying solely on static thresholds, you aren't managing infrastructure; you're just waiting for it to break.
The Data Hierarchy: Logs, Metrics, and Traces
To move from "green dashboard syndrome" to actual understanding, you need to implement the three pillars properly. And no, dumping logs into a text file doesn't count.
1. Metrics (The "What")
Metrics are cheap. They are aggregatable numbers. They are great for spotting trends but terrible for root cause analysis. On a standard VPS, you check these via node_exporter.
# Standard Prometheus config for a Linux node
scrape_configs:
- job_name: 'node'
static_configs:
- targets: ['localhost:9100']But here is the catch often ignored by budget hosting providers: Steal Time. If your VPS is on a crowded node (common with cheap providers), your metrics might show 50% CPU usage, but your application is stalling because the hypervisor is throttling you.
Pro Tip: Always alert on node_cpu_seconds_total{mode="steal"}. If this creates a graph that looks like the Oslo skyline, migrate your workload. At CoolVDS, we use KVM with strict resource guarantees, so 50% CPU actually means you have 50% left to use.2. Logs (The "Context")
Logs provide the narrative. But in 2025, with microservices sprawling across clusters, raw logs are noise. Structured logging (JSON) is mandatory. If you are parsing regex in production, you are wasting cycles.
3. Traces (The "Why")
This is where the magic happens. Tracing follows a request from the Load Balancer, through the auth service, into the database, and back. It visualizes the bottleneck.
Implementing OpenTelemetry (OTel) is the standard now. Here is a basic Collector configuration to receive traces and export them to a backend like Jaeger or Tempo:
receivers:
otlp:
protocols:
grpc:
http:
processors:
batch:
exporters:
otlp:
endpoint: "tempo-backend:4317"
tls:
insecure: true
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch]
exporters: [otlp]The Infrastructure Trap: Why Observability Fails on Bad Hosting
You can have the most sophisticated Grafana dashboard in Europe, but it's useless if the underlying infrastructure lies to you. I've seen "DevOps Engineers" spend weeks debugging application code for latency issues, only to realize their hosting provider had oversold the SSD throughput.
Observability relies on the premise that the hardware behaves predictably. When you deploy on CoolVDS, you are getting NVMe storage that actually hits the advertised IOPS. We don't play games with "burstable" performance that disappears when you need it most. When you trace a request and see a 50ms disk write, you know it's the disk write, not a neighbor mining crypto on the same physical core.
Data Sovereignty and The "Schrems II" Headache
Here is a specific pain point for my Norwegian and European colleagues. If you send your traces and logs to a US-based SaaS observability platform (Datadog, New Relic, etc.), you are likely exporting PII (IP addresses, user IDs embedded in logs). The Norwegian Data Protection Authority (Datatilsynet) does not look kindly on this.
The pragmatic solution? Self-hosted Observability.
Spin up a dedicated instance on CoolVDS in Oslo. Deploy the LGTM stack (Loki, Grafana, Tempo, Mimir). Keep your data within Norwegian borders. Not only is this compliant with strict interpretation of GDPR, but the latency between your app servers and your monitoring stack is negligible via local peering.
Example: Instrumenting a Python App for OTel
Don't rely on auto-instrumentation magic agents alone. Define your spans manually for critical business logic.
from opentelemetry import trace
tracer = trace.get_tracer(__name__)
with tracer.start_as_current_span("process_payment") as span:
span.set_attribute("payment.currency", "NOK")
span.set_attribute("user.region", "Vestland")
try:
# Critical payment logic here
process_transaction()
except Exception as e:
span.record_exception(e)
span.set_status(trace.Status(trace.StatusCode.ERROR))Moving Forward
Stop treating monitoring as a checkbox. It is an engineering discipline. Start by instrumenting your most critical API endpoint with OpenTelemetry today. If you need a sandbox that won't throttle you while you compile the collector, spin up a CoolVDS instance.
We provide the raw, unadulterated compute power; you bring the code. Don't let slow I/O or noisy neighbors kill your metrics. Deploy a high-performance environment now and actually see what your code is doing.