Observability vs Monitoring: Why Green Dashboards Lie
It was 3:42 AM on a Tuesday. My phone buzzed on the nightstand. I groggily checked the alert: "CPU Load High on db-primary-01." By the time I SSH'd in, the load had dropped. The site was up. All status checks in Nagios were a comforting, lying green. Yet, the support ticket queue was filling up with angry Norwegian users claiming checkout failures.
This is the failure of traditional monitoring. We knew that the server was stressed, but we had absolutely no idea why. In 2019, with distributed microservices becoming the standard even here in the Nordics, knowing "what is broken" isn't enough. You need to know "why it's weird." That is Observability.
The Lie of "System Healthy"
Monitoring is for known unknowns. You know disk space can run out, so you set a threshold at 90%. You know the Nginx process might die, so you check for the PID.
Observability is for unknown unknowns. It’s the ability to ask questions about your system without knowing in advance what you wanted to ask. It requires high-cardinality data—User IDs, request IDs, shopping cart totals—contexts that simple "up/down" checks ignore.
To achieve this, we rely on the three pillars: Metrics, Logs, and Traces.
1. Metrics: The "What"
We used to rely on Cacti or munin graphs that averaged data over 5 minutes. That hides spikes. Today, we use Prometheus. It pulls data rather than waiting for it to be pushed, which is critical when your firewall rules are tight.
Here is a standard prometheus.yml configuration we deploy on our management nodes to scrape endpoints every 15 seconds. Note the evaluation interval—don't go lower than 10s unless you have the storage I/O to back it up.
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'coolvds-nodes'
static_configs:
- targets: ['10.0.0.5:9100', '10.0.0.6:9100']
- job_name: 'nginx-exporter'
static_configs:
- targets: ['10.0.0.5:4040']
- job_name: 'mysql-exporter'
params:
auth_module: [client]
static_configs:
- targets: ['10.0.0.7:9104']
When you run this, you aren't just checking if MySQL is up. You are scraping mysql_global_status_innodb_row_lock_waits. If that metric spikes, your database isn't down, but your checkout process is definitely frozen.
2. Logs: The "Why"
Grepping text files is fine for a single VPS. It fails when you have five web servers behind a load balancer. You need centralized logging. The ELK Stack (Elasticsearch, Logstash, Kibana) is the heavyweight champion here, though Graylog is a decent alternative for smaller shops.
The problem? Java. Elasticsearch is hungry. Running an ELK stack on a budget VPS with spinning rust (HDD) is suicide. The I/O wait will kill your indexing rate.
Pro Tip: Always set your JVM heap size to 50% of available RAM, but never cross 31GB due to compressed oops pointer limits. On a CoolVDS NVMe instance with 16GB RAM, setES_JAVA_OPTS="-Xms8g -Xmx8g"in/etc/default/elasticsearch.
Here is a robust Logstash pipeline configuration used to parse Nginx access logs into structured JSON. This allows you to filter by client_ip or response_code instantly in Kibana.
input {
beats {
port => 5044
}
}
filter {
if [type] == "nginx-access" {
grok {
match => { "message" => "%{IPORHOST:client_ip} - %{DATA:user_name} \[%{HTTPDATE:timestamp}\] \"%{WORD:method} %{DATA:request_path} HTTP/%{NUMBER:http_version}\" %{NUMBER:response_code} %{NUMBER:bytes} \"%{DATA:referrer}\" \"%{DATA:agent}\"" }
}
date {
match => [ "timestamp", "dd/MMM/YYYY:HH:mm:ss Z" ]
}
geoip {
source => "client_ip"
}
}
}
output {
elasticsearch {
hosts => ["localhost:9200"]
index => "nginx-%{+YYYY.MM.dd}"
}
}
3. Tracing: The "Where"
If you have a monolith, stack traces are enough. If you are splitting code into microservices (even just separating your frontend from your API), you need distributed tracing. We are seeing a lot of traction with Jaeger in 2019.
Tracing allows you to visualize the lifespan of a request as it hops from your Load Balancer -> Nginx -> Python API -> PostgreSQL. You can see exactly which span took 400ms.
Here is how you initialize a Jaeger tracer in a Python application (using the jaeger-client library):
import logging
from jaeger_client import Config
def init_tracer(service):
logging.getLogger('').handlers = []
logging.basicConfig(format='%(message)s', level=logging.DEBUG)
config = Config(
config={
'sampler': {
'type': 'const',
'param': 1,
},
'logging': True,
'reporter_batch_size': 1,
},
service_name=service,
validate=True,
)
# this call also sets opentracing.tracer
return config.initialize_tracer()
The Infrastructure Reality Check
Observability isn't free. Storing metrics, indexing logs, and sampling traces generates massive I/O. I recently audited a setup where the logging cluster consumed more resources than the application itself. They were trying to run Elasticsearch on shared hosting. It was a disaster.
Latency Matters. If your servers are in Frankfurt but your customers are in Oslo, network latency adds noise to your tracing data. You want your observability stack close to your metal.
| Feature | Monitoring (Old School) | Observability (2019 Standard) |
|---|---|---|
| Question | Is the server up? | Why is the server slow? |
| Data Source | Passive Checks (Ping, SNMP) | Instrumentation (Code, Events) |
| Granularity | Server Level | Request/User Level |
| Storage Impact | Low | Very High (Requires NVMe) |
Practical Diagnostic Commands
Before you build the dashboard of your dreams, master the terminal. These are the tools I use daily when the dashboard is ambiguous.
1. Check real-time I/O usage:
Don't guess if it's the database or the logs. Look.
iotop -oPa
2. Verify network connections:
Are you running out of file descriptors or ports? netstat is deprecated; use ss.
ss -tunlp | grep :80
3. Quick log parsing:
Sometimes you just need to count errors now.
awk '{print $9}' /var/log/nginx/access.log | sort | uniq -c | sort -rn
4. Check Docker stats:
If you are containerized, this is your first stop.
docker stats --no-stream
5. Test endpoint latency manually:
This gives you the time to first byte (TTFB).
curl -w "Connect: %{time_connect} TTFB: %{time_starttransfer} Total: %{time_total}\n" -o /dev/null -s https://coolvds.com
The Legal & Physical Layer
We operate under strict GDPR mandates. Datatilsynet (The Norwegian Data Protection Authority) is not lenient. When you log user data—IP addresses, usernames, emails—into your observability stack, that log storage becomes a compliance target.
Hosting your ELK stack on US-controlled cloud infrastructure introduces complex liability regarding data processing agreements. Keeping your data within Norway or the EEA is the safest architectural decision you can make in 2019. It simplifies your Article 30 records significantly.
Why We Built CoolVDS for This
We didn't just buy servers and put a sticker on them. We architected CoolVDS because we were tired of "noisy neighbors" ruining our metric collection. When another user on a shared host hammers their disk, your database latency spikes, and your alerts wake you up. That’s false positive fatigue.
We use KVM virtualization to ensure strict resource isolation. Our local storage is pure NVMe, providing the IOPS necessary to run high-ingest Elasticsearch clusters without choking your production app. And we peer directly at NIX in Oslo, ensuring your monitoring data reflects internal network speed, not internet congestion.
Observability requires power. Don't starve your eyes to feed your application. If you need a sandbox to test a Prometheus + Grafana stack, spin up a high-performance instance with us.
Ready to stop guessing? Deploy a CoolVDS instance in 55 seconds and see what your code is actually doing.