Stop Staring at Red Lights: Transforming Monitoring into True Observability
It was 3:00 AM on a Tuesday. The pager screamed. Our primary load balancer in Oslo had just dropped 40% of incoming traffic. The dashboard was a sea of red. CPU? Normal. RAM? 60%. Disk space? Plenty. The monitoring system, effectively, was shrugging its shoulders. It told us that we were failing, but it had zero clue why.
This is the failure of traditional monitoring. It checks specific pulse points—is port 80 open? Is the process running?—but fails to account for the complex interactions in a modern distributed system. If you are running serious workloads in 2025, knowing the server is "up" is meaningless if the application logic is deadlocked.
In this guide, we aren't just installing Zabbix and calling it a day. We are building an observability pipeline using OpenTelemetry, Prometheus, and eBPF that allows you to trace a single request from a user in Bergen through your Nginx ingress, into your Go microservices, and down to the specific syscalls on the kernel.
The Lie of "99.9% Uptime"
Most VPS providers sell you "uptime" based on whether the hypervisor is powered on. That is a vanity metric. If your disk I/O latency spikes to 500ms because your neighbor is mining crypto, your database is effectively down, even if the ping works. This is why we built CoolVDS on strict KVM isolation with direct NVMe paths. We don't steal your CPU cycles, and we don't hide I/O wait times.
Step 1: Structured Logging (Stop Grepping Text)
If you are still parsing raw text logs with regex, you are wasting valuable minutes during an outage. In 2025, logs must be machine-readable data streams. We configure Nginx to output JSON immediately. This allows tools like Loki or Elasticsearch to aggregate based on `request_time` or `upstream_response_time` instantly.
Nginx JSON Configuration
Edit your /etc/nginx/nginx.conf to define a structured log format:
http {
log_format json_analytics escape=json
'{'
'"msec": "$msec", ' # request unixtime in seconds with a milliseconds resolution
'"connection": "$connection", ' # connection serial number
'"connection_requests": "$connection_requests", ' # number of requests made in connection
'"pid": "$pid", ' # process pid
'"request_id": "$request_id", ' # the unique request id
'"request_length": "$request_length", ' # request length (including headers and body)
'"remote_addr": "$remote_addr", ' # client IP
'"remote_user": "$remote_user", ' # client HTTP username
'"remote_port": "$remote_port", ' # client port
'"time_local": "$time_local", '
'"time_iso8601": "$time_iso8601", ' # local time in the ISO 8601 standard format
'"request": "$request", ' # full path no arguments if the request is GET
'"request_uri": "$request_uri", ' # full path and arguments if the request is GET
'"args": "$args", ' # args
'"status": "$status", ' # response status code
'"body_bytes_sent": "$body_bytes_sent", ' # the number of body bytes exclude headers sent to a client
'"bytes_sent": "$bytes_sent", ' # the number of bytes sent to a client
'"http_referer": "$http_referer", ' # HTTP referer
'"http_user_agent": "$http_user_agent", ' # user agent
'"http_x_forwarded_for": "$http_x_forwarded_for", ' # http_x_forwarded_for
'"http_host": "$http_host", ' # the request Host: header
'"server_name": "$server_name", ' # the name of the vhost serving the request
'"request_time": "$request_time", ' # request processing time in seconds with msec resolution
'"upstream": "$upstream_addr", ' # upstream backend server for proxied requests
'"upstream_connect_time": "$upstream_connect_time", ' # upstream handshake time incl. SSL
'"upstream_header_time": "$upstream_header_time", ' # time spent receiving upstream headers
'"upstream_response_time": "$upstream_response_time", ' # time spend receiving upstream body
'"upstream_response_length": "$upstream_response_length", ' # upstream response length
'"upstream_cache_status": "$upstream_cache_status", ' # cache HIT/MISS where applicable
'"ssl_protocol": "$ssl_protocol", ' # TLS protocol
'"ssl_cipher": "$ssl_cipher", ' # TLS cipher
'"scheme": "$scheme", ' # http or https
'"request_method": "$request_method", ' # request method
'"server_protocol": "$server_protocol", ' # request protocol, like HTTP/1.1 or HTTP/2.0
'"pipe": "$pipe", ' # "p" if request was pipelined, "." otherwise
'"gzip_ratio": "$gzip_ratio", '
'"http_cf_ray": "$http_cf_ray"'
'}';
access_log /var/log/nginx/access_json.log json_analytics;
}
Step 2: Metrics with Prometheus & Node Exporter
Metrics show trends. But high-cardinality metrics (like tracking latency per user ID) will explode the memory usage of your Time Series Database (TSDB). On cheap hosting, this causes the OOM killer to murder your monitoring process. Because CoolVDS allocates dedicated RAM, you can maintain deeper retention periods.
Pro Tip: When monitoring systems in Norway, always check the `node_time_zone_offset_seconds` metric. Misaligned timezones between servers in Oslo and UTC-based databases often lead to "missing" data chunks during aggregation.
Here is a robust prometheus.yml scrape config that uses service discovery (essential for dynamic environments):
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'node_exporter'
static_configs:
- targets: ['localhost:9100']
relabel_configs:
- source_labels: [__address__]
regex: '(.*):(.*)'
target_label: instance
replacement: '${1}'
- job_name: 'coolvds_services'
file_sd_configs:
- files:
- '/etc/prometheus/targets/*.json'
refresh_interval: 5m
Step 3: Distributed Tracing with OpenTelemetry (OTel)
By late 2025, OpenTelemetry is the undisputed standard. Tracing allows you to see the waterfall of a request. If your API takes 200ms, tracing shows that 180ms was spent waiting for a lock in Redis.
To do this, you run the OTel Collector as a sidecar or agent. It receives traces from your app, batches them, and sends them to your backend (Tempo, Jaeger, etc.).
receivers:
otlp:
protocols:
grpc:
http:
processors:
batch:
timeout: 1s
send_batch_size: 1024
exporters:
otlp:
endpoint: "tempo-backend.internal:4317"
tls:
insecure: true
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch]
exporters: [otlp]
Step 4: Kernel Visibility with eBPF
Sometimes the application thinks it's fine, but the kernel is struggling. eBPF (Extended Berkeley Packet Filter) lets us run sandboxed programs in the Linux kernel without changing source code. This is how we debug "phantom" latency on CoolVDS instances when customers suspect network issues.
Use bpftrace to catch processes performing excessive disk syncs, which kills I/O throughput:
bpftrace -e 'tracepoint:block:block_rq_issue { @ = hist(args->bytes); }'
Or tracking TCP retransmits (a sign of packet loss between the user and our NIX connection):
bpftrace -e 'kprobe:tcp_retransmit_skb { @[comm] = count(); }'
The Hardware Reality of Observability
Running this stack is expensive on resources. An OpenTelemetry collector processing thousands of spans per second requires CPU. A Prometheus instance ingesting millions of data points needs incredibly fast random write speeds to disk.
| Feature | Standard VPS | CoolVDS Architecture |
|---|---|---|
| Storage | Shared SSD (High I/O Wait) | Dedicated NVMe (Low Latency for TSDB) |
| CPU Steal | Common (Noisy Neighbors) | Zero (Dedicated Cores) |
| Compliance | Data often routed via US/Cloud Act | 100% Norwegian Sovereignty (GDPR Safe) |
When you dump 50GB of log data daily into an ELK stack or Grafana Loki, a standard spinning disk or cheap SATA SSD will choke. Queries that should take milliseconds will take minutes. We designed CoolVDS specifically for these high-throughput workloads. Our NVMe arrays ensure that your write-heavy observability tools never bottleneck the very applications they are supposed to monitor.
Local Compliance: The Norwegian Context
Observability data often contains PII (Personally Identifiable Information). IP addresses, user IDs, and even URL parameters can identify individuals. Under GDPR and the strict interpretations by Datatilsynet here in Norway, sending this trace data to a US-based SaaS cloud can be legally risky.
Hosting your observability stack on CoolVDS ensures data residency. Your logs stay on servers physically located in Oslo, governed by Norwegian law, protected from foreign surveillance overreach. For CTOs and Systems Architects, this removes a massive compliance headache.
Conclusion
Monitoring is asking "Is the system happy?" Observability is asking "Why is the system acting weird?" You need the latter to survive in 2025.
Don't let your monitoring infrastructure be the weak link. You need the I/O throughput to ingest logs and the CPU stability to analyze traces in real-time. Deploy a CoolVDS NVMe instance today and start seeing what is actually happening inside your code.