Stop Just Monitoring Up-Time: The Shift to System Visibility
It is 3:00 AM. Your phone buzzes. It’s a Nagios alert: DISK CRITICAL - 92%. You SSH in, clear some rotated logs in /var/log, and go back to sleep. The server was "up" the whole time. The status check remained green.
But the next morning, you wake up to a dozen emails from angry Norwegian customers saying the checkout page on their Magento store was timing out all night. Your monitoring said everything was fine. Your customers said everything was broken. Who is right?
They are.
In the classic hosting world of 2010, we cared about binary states: Up or Down. But in 2013, with complex LAMP stacks, Varnish layers, and third-party APIs, "Up" is a meaningless metric if the latency is 5,000ms. We need to move from Monitoring (checking status) to Visibility (understanding behavior).
The Lie of "Load Average"
Most sysadmins trigger alerts on Linux Load Average. If it goes above the core count, panic. But what does load actually mean? On a Linux system, it includes processes waiting for CPU and processes waiting for Disk I/O (uninterruptible sleep).
I recently debugged a MySQL server in Oslo that had a Load Average of 40 on a 4-core box. The CPU usage was only 5%. The culprit? Slow I/O. The disks were thrashing. A standard CPU monitor would have told us nothing.
To see this, stop relying on top and learn to love iostat.
$ iostat -x 1
avg-cpu: %user %nice %system %iowait %steal %idle
4.50 0.00 1.20 45.30 0.00 49.00
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 12.00 55.00 40.00 4500.00 3200.00 81.05 4.50 45.20 8.00 98.50
Look at %util (utilization). If your disk is at 98.50% utilization but your CPU is idle, your VPS provider has oversold their storage array. This is common with budget hosts who cram hundreds of VMs onto spinning HDDs.
Graphing Over Alerting: The Graphite Revolution
Nagios tells you the state now. It is terrible at telling you the state yesterday or spotting trends before they break. We are seeing a massive shift towards time-series metrics. Tools like Graphite and StatsD allow us to stream metrics in real-time and render graphs that show correlation.
For example, instead of alerting when Apache fails, we graph the rate of HTTP 500 errors. If the rate spikes from 0.1/sec to 5.0/sec, something is wrong, even if the PID is running.
Here is how simple it is to send a metric to Carbon (Graphite's backend) from a shell script:
#!/bin/bash
# Send the current load average to Graphite
LOAD=`cat /proc/loadavg | awk '{print $1}'`
DATE=`date +%s`
echo "servers.oslo-node-1.load $LOAD $DATE" | nc -w 1 graphite.coolvds.net 2003
If you plot this against your Nginx active connections, you might see that load spikes exactly when a specific crawl bot hits your site. That is visibility.
Logs Are Data, Not Just Text
Grepping through /var/log/syslog across 10 different servers is a nightmare. The emerging best practice in 2013 is centralized logging. We are seeing great results with the ELK stack (Elasticsearch, Logstash, Kibana). Logstash parses the logs, and Elasticsearch lets you search them instantly.
This is critical for compliance in Norway. Under the Personopplysningsloven (Personal Data Act), you must ensure security of processing. Being able to audit exactly who accessed what resource via centralized logs is a massive advantage.
A basic logstash.conf to parse Nginx logs looks like this:
input {
file {
path => "/var/log/nginx/access.log"
type => "nginx-access"
}
}
filter {
grok {
match => [ "message", "%{COMBINEDAPACHELOG}" ]
}
}
output {
elasticsearch {
host => "127.0.0.1"
}
}
Once this is running, you can use Kibana to visualize "Top 404 URLs" or "Slowest Response Times" in real-time. Try doing that with grep.
Pro Tip: Do not run Elasticsearch on OpenVZ containers. Java heaps and OpenVZ memory accounting (beancounters) do not play well together. You will see random OOM (Out Of Memory) kills. Always use KVM virtualization for Java-heavy workloads.
Database Visibility: The Slow Query Log
Your application is likely bound by the database. If you aren't logging slow queries, you are flying blind. In your my.cnf (for MySQL 5.1/5.5), ensure this is set:
[mysqld]
slow_query_log = 1
slow_query_log_file = /var/log/mysql/mysql-slow.log
long_query_time = 1
log_queries_not_using_indexes = 1
A long_query_time of 1 second is a good start. If you are tuning for high performance, drop it to 0.5. Just be warned: on slow I/O systems, writing the log itself can degrade performance.
The Infrastructure Factor
You cannot monitor performance if your underlying infrastructure is the bottleneck. Running Logstash and Elasticsearch requires significant Disk I/O. If you try to run this stack on a cheap VPS with shared HDDs, the logging system itself will cause the server to hang.
This is where hardware choice matters. For our internal monitoring nodes at CoolVDS, we strictly use Solid State Drives (SSDs). The difference in random read/write speeds between SSD and SAS HDD is roughly 100x. When you are indexing thousands of log lines per second, that IOPS capacity is non-negotiable.
Furthermore, network latency plays a role in remote monitoring. If your Graphite server is in the US and your servers are in Oslo, the UDP packets (StatsD) might drop, or latency will skew your timestamps. Keeping your monitoring infrastructure local—peering at NIX (Norwegian Internet Exchange)—ensures your data is accurate and compliant.
Conclusion
Stop waiting for the red alert. By the time Nagios pages you, the customer has already left. Start graphing your metrics and centralizing your logs. The tools are here—Logstash, Graphite, and robust KVM virtualization.
Visibility requires power. Don't let IOwait kill your insights. Deploy a high-performance SSD instance on CoolVDS today and finally see what your servers are actually doing.