The "Everything is OK" Fallacy
It is 03:00 CET. Your Nagios dashboard is a comforting sea of green. Every check—HTTP, SSH, MySQL—returns OK. Yet, your inbox is filling up with angry emails from customers in Trondheim and Oslo claiming the checkout process is timing out. You check the server load: 0.5. You check memory: 4GB free. What is happening?
This is the failure of monitoring in its traditional sense. We have spent the last decade perfecting the art of asking "Is the daemon running?" while ignoring the far more important question: "Is the daemon doing its job efficiently?"
As we move into 2014, the complexity of web applications (especially with the rise of Magento and heavy PHP frameworks) demands a shift. We must move from black-box monitoring to white-box introspection—collecting granular metrics that reveal the internal state of the system, not just its pulse.
The Limitation of OpenVZ and Shared Kernels
Before we touch a single configuration file, we must address the infrastructure. Many budget VPS providers in the Nordics still rely heavily on OpenVZ. While efficient, OpenVZ creates a "noisy neighbor" environment where you share the kernel with hundreds of other containers.
Pro Tip: If you cannot access /proc/sys/vm/swappiness or load custom kernel modules for advanced packet tracing, you are flying blind. This is why CoolVDS strictly utilizes KVM (Kernel-based Virtual Machine) virtualization. You need your own kernel to measure your own latency accurately.
Step 1: Exposing Application Internals
Stop relying on port 80 checks. You need to know how Nginx is handling the connection pool. If you are running a high-traffic site, you must enable the stub_status module. This is lightweight and provides real-time data on active connections.
In your /etc/nginx/sites-available/default (or specific vhost), add this inside your server block. Note the security restriction—metrics should never be public.
location /nginx_status {
stub_status on;
access_log off;
allow 127.0.0.1;
# Allow your monitoring server IP
allow 192.168.1.50;
deny all;
}
Reload Nginx: service nginx reload. Now, a simple curl gives you raw data:
$ curl http://127.0.0.1/nginx_status
Active connections: 243
server accepts handled requests
11562 11562 34920
Reading: 0 Writing: 7 Waiting: 236
If "Waiting" is high but CPU usage is low, your application backend (PHP-FPM) is likely the bottleneck, not the web server. Nagios would show this as "OK". This metric shows you the truth.
Step 2: The Database Reality Check
MySQL is often the silent killer of performance. A "process running" check tells you nothing about lock contention. In 2014, we are seeing massive IO wait times on spinning rust (standard HDDs) caused by unoptimized queries.
On CoolVDS SSD instances, IOPS are plentiful, but bad queries can still stall the CPU. Configure your my.cnf to catch the queries that take longer than 1 second. This is the first step to introspection.
[mysqld]
# Enable the slow query log
slow_query_log = 1
slow_query_log_file = /var/log/mysql/mysql-slow.log
long_query_time = 1
# Log queries not using indexes
log-queries-not-using-indexes = 1
Analyze this log weekly with mysqldumpslow. It allows you to fix code, rather than just throwing more RAM at the problem.
Step 3: Visualizing the Pulse with Graphite
Text logs are hard to correlate. The current best practice is feeding metrics into Graphite. While RRDTool (used by Cacti/Munin) is great for trends, Graphite is superior for real-time granularity.
You can use a simple Python script to pipe system metrics to the Carbon daemon. Here is a raw example of how to send load average to Graphite using netcat (nc), valid for any CentOS 6 or Ubuntu 12.04 system:
#!/bin/bash
SERVER=graphite.yourdomain.internal
PORT=2003
TIMESTAMP=$(date +%s)
# Get 1 minute load average
LOAD=$(cat /proc/loadavg | awk '{print $1}')
echo "servers.coolvds-node-01.loadavg $LOAD $TIMESTAMP" | nc $SERVER $PORT
Put this in crontab to run every minute. Suddenly, you aren't just seeing "UP"; you are seeing the correlation between your 09:00 AM backup script and the spike in load average.
The Importance of I/O Latency
In Norway, where connectivity is excellent thanks to NIX (Norwegian Internet Exchange), network latency is rarely the issue. Disk latency is. If you see high %iowait in top, your disk subsystem is thrashing.
Use iostat (part of the sysstat package) to diagnose this:
$ iostat -x 1
avg-cpu: %user %nice %system %iowait %steal %idle
5.20 0.00 2.10 45.30 0.00 47.40
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
vda 0.00 12.00 0.00 145.00 0.00 1256.00 8.66 2.10 14.50 6.89 100.00
If %util is near 100% and await is high (over 20ms), your current hosting solution is choking. This is where hardware matters. We provision CoolVDS instances with high-performance RAID-10 SSD arrays specifically to keep await times under 5ms, even during heavy database writes.
Data Sovereignty and Compliance
Finally, deep introspection means logging data. Be aware of the Personopplysningsloven (Personal Data Act). If you are logging IP addresses or user behavior in your ELK (Elasticsearch, Logstash, Kibana) stack, that data must be treated with care. The Norwegian Data Inspectorate (Datatilsynet) is strict.
Hosting outside of Norway adds legal complexity regarding safe harbor. By keeping your monitoring data and production data on CoolVDS servers in Oslo, you simplify compliance. Latency is lower, and legal standing is clearer.
Final Thoughts
Green lights on a dashboard are a vanity metric. True stability comes from understanding the resource usage of your application stack. Start logging your slow queries, graph your Nginx connections, and ensure your underlying virtualization technology (KVM) gives you the access you need to see the truth.
Ready to see what your application is actually doing? Deploy a KVM instance on CoolVDS today. We provide the raw performance; you bring the code.