Stop Staring at Green Dashboards While Your App Burns
It is 03:00 CET. PagerDuty just woke you up. You stumble to your workstation, open Grafana, and see a wall of green. CPU is at 40%. RAM is fine. Disk usage is stable. Yet, support tickets are flooding in: "Checkout is timing out for users in Bergen."
This is the failure of Monitoring. You are monitoring for known unknowns—things you predicted might break. You checked CPU because CPU has spiked before. You checked disk space because you ran out of it last year.
Observability is different. It is the property of a system that allows you to ask new questions without deploying new code. It answers the unknown unknowns. It tells you that the checkout is timing out because a third-party payment gateway API is adding 400ms of latency, causing a thread lock in your PHP-FPM pool, which isn't visible on your CPU graph because the processes are in a sleeping state waiting for I/O.
In the Nordic market, where reliability is expected and latency to Oslo exchanges (NIX) is scrutinized, the difference between "it's up" and "it's working" is the difference between retaining a client and losing them to a competitor.
The Three Pillars (And Why They Usually Fail)
In 2022, we talk about the three pillars: Metrics, Logs, and Traces. But having them implies nothing if they aren't correlated.
1. Metrics (The "What")
Metrics are cheap. They are aggregatable numbers. Counts, gauges, histograms. You use Prometheus for this. It answers: "Is memory usage high?"
2. Logs (The "Why")
Logs are expensive. They are high-fidelity text records. You use the ELK stack (Elasticsearch, Logstash, Kibana) or Loki. They answer: "What error did the database return?"
3. Traces (The "Where")
Traces are the glue. They track a request ID across microservices. They answer: "Which service caused the bottleneck?"
Pro Tip: If you aren't using structured logging (JSON), you are wasting CPU cycles parsing text with Regex. Stop writing logs for humans; write them for machines.
Structuring Nginx for Observability
Most default Nginx configs are useless for observability. They dump unstructured text. Here is how I configure Nginx on high-traffic CoolVDS instances to feed directly into an ELK or Graylog pipeline without heavy parsing overhead.
http {
log_format json_combined escape=json
'{ "time_local": "$time_local", '
'"remote_addr": "$remote_addr", '
'"remote_user": "$remote_user", '
'"request": "$request", '
'"status": "$status", '
'"body_bytes_sent": "$body_bytes_sent", '
'"request_time": "$request_time", '
'"upstream_response_time": "$upstream_response_time", '
'"upstream_addr": "$upstream_addr", '
'"http_referrer": "$http_referer", '
'"http_user_agent": "$http_user_agent" }';
access_log /var/log/nginx/access.json json_combined;
}
The $upstream_response_time variable is critical here. It isolates whether the slowness is Nginx or the backend application (Node, PHP, Python). If request_time is high but upstream_response_time is low, your bottleneck is the network or Nginx itself (perhaps SSL termination overhead).
OpenTelemetry: The Standard You Must Adopt
Vendor lock-in is a plague. A few years ago, you had to choose between Jaeger libraries or Zipkin libraries. As of 2022, OpenTelemetry (OTel) has matured enough to be the default choice for instrumenting your code. It provides a single set of APIs to generate traces, metrics, and logs.
Here is a practical example of instrumenting a Python service to send traces to a collector. Note that we are using the OTel Python SDK.
# app.py
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
# define the service name for the trace
resource = Resource(attributes={
"service.name": "payment-service-oslo-1"
})
trace.set_tracer_provider(TracerProvider(resource=resource))
tracer = trace.get_tracer(__name__)
# Configure exporter to send data to your local collector
otlp_exporter = OTLPSpanExporter(endpoint="http://localhost:4317", insecure=True)
span_processor = BatchSpanProcessor(otlp_exporter)
trace.get_tracer_provider().add_span_processor(span_processor)
with tracer.start_as_current_span("process_payment") as span:
span.set_attribute("payment.currency", "NOK")
span.set_attribute("customer.region", "Vestland")
# ... logic here ...
print("Payment processed")
By tagging the span with customer.region, we can later filter traces to see if latency is specific to users in Northern Norway versus Oslo.
The Infrastructure Impact: Why "Cloud" Isn't Enough
You cannot observe what you do not trust. This is where the hardware reality hits.
Running an ELK stack or a heavy Prometheus instance requires significant I/O throughput. Elasticsearch is notoriously I/O hungry during indexing. If you are running this on a budget VPS with shared magnetic spinning disks (or throttled SSDs), your observability tool itself will become the bottleneck.
I have seen clusters crash because the logging volume spiked during a DDoS attack, and the disk I/O wait choked the CPU. The logs describing the attack caused the server to fail before the attack actually did.
The Hardware Requirement
| Component | Resource Criticality | CoolVDS Solution |
|---|---|---|
| Prometheus TSDB | Memory & Random Write IOPS | Dedicated RAM allocation (no ballooning) |
| Elasticsearch | High I/O Throughput | NVMe Storage (essential for indexing speeds) |
| Tracing Collectors | Network Latency | 1Gbps Uplink & Local Peering |
At CoolVDS, we enforce strict isolation using KVM. Unlike container-based VPS (OpenVZ/LXC), where a neighbor's heavy logging could spike your iowait, KVM provides dedicated resource mapping. When you are debugging a 5ms latency spike in your app, you need to be sure that 5ms is your code, not your host's noisy neighbor.
Data Sovereignty and GDPR (Schrems II)
In Norway, observability data is often personal data. IP addresses in access logs, User IDs in traces—these fall under GDPR. Since the Schrems II ruling, sending this data to US-hosted SaaS observability platforms (like New Relic or Datadog's US regions) is a compliance minefield.
Self-hosting your observability stack (Grafana/Prometheus/Loki) on a Norwegian server isn't just a technical preference; it is a legal safeguard. By keeping the data residing on physical hardware in Oslo, you satisfy the requirement for data residency. You are the data controller and the processor.
Configuration for Prometheus Scaping
Don't just scrape everything. High cardinality kills Prometheus. Here is a disciplined scrape config that drops unnecessary labels to keep your memory footprint on the VPS manageable.
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'node_exporter'
static_configs:
- targets: ['localhost:9100']
metric_relabel_configs:
- source_labels: [__name__]
regex: 'node_disk_read_bytes_total|node_disk_written_bytes_total'
action: keep
# Drop high cardinality filesystem types we don't care about
- source_labels: [fstype]
regex: 'tmpfs|fuse.lxcfs'
action: drop
Final Thoughts: Buy or Build?
Building an observability stack takes time. But in 2022, the tools are robust enough that the TCO (Total Cost of Ownership) often favors self-hosting if you have the skills. You avoid data egress fees, you ensure GDPR compliance by keeping data in Norway, and you gain total granular control.
However, this stack requires performant iron. Don't throw a heavy OTel collector and Elasticsearch cluster on a generic $5 VPS and expect it to survive Black Friday.
If you are ready to build a monitoring stack that actually tells you the truth, start with a foundation that doesn't lie about resources. Spin up a Performance NVMe instance on CoolVDS today and see what valid iowait actually looks like.