Observability vs. Monitoring: Why Green Dashboards Can Still Mean Angry Users
It was 02:14 AM when the PagerDuty alert hit. The dashboard was a sea of reassuring green. CPU usage? Nominal. Memory? 60%. Disk I/O? Within limits. Yet, our biggest client in Oslo was blowing up my phone because their checkout latency had spiked from 200ms to 4 seconds.
That is the fundamental failure of monitoring. It tells you the system is alive. It doesn't tell you if the system is happy.
As we navigate the infrastructure landscape of 2023, the shift from "monitoring" to "observability" isn't just a buzzword bingo game; it's a survival mechanism for distributed systems. If you are running services targeting the Nordic market, where users expect near-instant interactions and strict data compliance, you cannot rely on simple ICMP pings and CPU graphs anymore.
The Difference: Known Unknowns vs. Unknown Unknowns
Let's strip away the marketing fluff.
- Monitoring answers questions you already predicted: "Is the disk full?" "Is the CPU over 90%?"
- Observability allows you to ask questions you didn't know you needed to ask: "Why is latency high only for iOS users in Bergen hitting the
/api/v1/searchendpoint?"
To achieve the latter, you need three pillars: Metrics, Logs, and Traces. And you need infrastructure that doesn't choke when you start ingesting gigabytes of high-cardinality telemetry data.
Technical Implementation: From Nginx Stub to OpenTelemetry
In the old days (read: 2018), we'd just scrape Nginx. You probably have a config block like this somewhere:
server {
listen 80;
server_name localhost;
location /stub_status {
stub_status;
allow 127.0.0.1;
deny all;
}
}
This gives you Active connections. Useful? Somewhat. But it tells you nothing about request context.
In 2023, we use OpenTelemetry (OTel) to instrument the application code itself. Instead of guessing why the database is slow, we inject trace IDs. Here is a practical example of how we instrument a Go service before deploying it to a CoolVDS instance:
package main
import (
"context"
"log"
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/sdk/trace"
)
func initTracer() *trace.TracerProvider {
exporter, err := stdouttrace.New(stdouttrace.WithPrettyPrint())
if err != nil {
log.Fatal(err)
}
return trace.NewTracerProvider(
trace.WithBatcher(exporter),
)
}
func main() {
tp := initTracer()
defer func() { _ = tp.Shutdown(context.Background()) }()
otel.SetTracerProvider(tp)
tracer := otel.Tracer("coolvds-checkout-service")
ctx, span := tracer.Start(context.Background(), "process_payment")
defer span.End()
// Your business logic here
}
When this runs, every operation carries a "span". If the database query hangs, the span duration makes it obvious immediately.
The Infrastructure Tax of Observability
Here is the trade-off nobody tells you about: Observability is expensive on resources.
Running a Prometheus stack to scrape thousands of targets, plus Loki for logs, plus Jaeger or Tempo for traces, requires serious IOPS. I have seen "budget" VPS providers absolutely melt when a developer tries to persist high-resolution metrics to disk.
Pro Tip: Never run your TSDB (Time Series Database) on standard HDD or shared SATA SSD storage. The random write patterns of Prometheus WAL (Write-Ahead Log) will induce high iowait, causing gaps in your graphs.
This is where the underlying hardware of CoolVDS becomes a technical necessity rather than a luxury. We utilize enterprise-grade NVMe storage arrays. When you are pushing 50,000 samples per second into your monitoring stack, you need the low latency of NVMe to ensure your observability tools don't become the bottleneck themselves.
Configuring Prometheus for High Load
If you are self-hosting Prometheus on a CoolVDS instance (which gives you full data sovereignty—crucial for Norwegian GDPR compliance), you need to tune your retention and block sizes in the startup flags:
# /etc/systemd/system/prometheus.service
[Service]
ExecStart=/usr/local/bin/prometheus \
--config.file=/etc/prometheus/prometheus.yml \
--storage.tsdb.path=/var/lib/prometheus/data \
--storage.tsdb.retention.time=15d \
--storage.tsdb.min-block-duration=2h \
--storage.tsdb.max-block-duration=2h \
--web.enable-lifecycle
Setting fixed block durations helps fragment the data on disk, making compactions more predictable and preventing I/O spikes that could affect your production services running on the same node.
Data Sovereignty and the "Schrems II" Reality
For Norwegian CTOs, observability data often contains PII (Personally Identifiable Information). IP addresses, user IDs, and email fragments can leak into logs.
Sending these logs to a US-based SaaS observability platform is a legal minefield in 2023 due to Schrems II rulings. The Datatilsynet (Norwegian Data Protection Authority) is increasingly strict about where data "lives."
Hosting your own observability stack (Grafana/Prometheus/Loki) on a CoolVDS server located physically in Oslo or nearby European hubs ensures:
- Compliance: Data never leaves the EEA.
- Latency: Your monitoring probes are close to your users (NIX peering points).
- Cost Control: You aren't paying per-gigabyte ingestion fees that SaaS vendors charge.
Building the Dashboard
Once you have the data, you need to visualize it. Don't just graph "CPU Usage." Graph Saturation.
Use the USE Method (Utilization, Saturation, Errors). Here is a PromQL query to detect disk saturation on a Linux node, which is a leading indicator of upcoming downtime:
rate(node_disk_io_time_seconds_total[1m])
If this value approaches 1.0 (or 100%), your processes are blocking on disk. On a standard VPS, you might hit this often. On CoolVDS NVMe instances, you have significantly more headroom.
Conclusion
Observability is not about buying a tool; it's about building a culture of "why." It requires instrumentation, discipline, and the right hardware to support the data ingestion.
Don't let slow I/O kill your insights. If you are ready to build a robust, compliant observability stack that gives you X-ray vision into your applications, you need the raw power to back it up.
Deploy a high-performance NVMe KVM instance on CoolVDS today and stop guessing why your app is slow.