Stop Guessing: A Sysadmin's Guide to Application Performance Monitoring (APM) on Linux
"It works on my machine."
If I had a krone for every time a developer said that to me while the production server was melting down, I could retire to a cabin in Geilo. The reality of hosting in 2017 is that users have zero patience. Amazon found that every 100ms of latency costs them 1% in sales. If your application takes three seconds to load, you aren't just losing patience; you are losing revenue.
Too many sysadmins fly blind. They wait for the phone to ring or a ticket to open. That is not monitoring; that is negligence. Today, we are going to look at how to actually see what is happening inside your stack, specifically focusing on Nginx, PHP-FPM, and why your "cloud" provider might be lying to you about performance.
The Silent Killer: Disk I/O and Wait Times
Before we install fancy dashboards, look at the terminal. When a server feels sluggish but CPU usage seems low, the culprit is almost always I/O Wait (wa in top).
I recently audited a Magento installation for a client in Oslo. They were paying for a "High Performance" VPS from a generic European host. Their site was crawling. A simple check with iostat revealed the truth.
$ iostat -x 1
avg-cpu: %user %nice %system %iowait %steal %idle
5.20 0.00 2.10 45.30 0.00 47.40
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await r_await w_await svctm %util
vda 0.00 4.00 0.00 82.00 0.00 688.00 8.39 1.20 15.20 0.00 15.20 1.20 18.40
See that 45.30% iowait? The CPU is sitting idle, twiddling its thumbs, waiting for the hard disk to write data. This is the bottleneck of spinning rust (HDD) or oversold SATA SSDs. In a shared environment, this is often caused by "noisy neighbors"—other tenants on the same physical host hammering the disk.
Pro Tip: Always check %steal in top. If it's above 0, your hypervisor is throttling your CPU cycles because the host node is overloaded. At CoolVDS, we strictly limit tenancy per node to ensure 0% steal time and dedicated resource allocation.
Turning Nginx into a Data Source
Nginx is incredible, but its default logging configuration is useless for APM. It tells you who visited, but not how long it took. We need to define a custom log format that captures $request_time (total time) and $upstream_response_time (how long PHP/Python took to generate the page).
Open your /etc/nginx/nginx.conf and add this inside the http block:
log_format apm_json '{"@timestamp": "$time_iso8601", '
'"remote_addr": "$remote_addr", '
'"request_method": "$request_method", '
'"request_uri": "$request_uri", '
'"status": $status, '
'"request_time": $request_time, '
'"upstream_response_time": "$upstream_response_time", '
'"user_agent": "$http_user_agent" }';
access_log /var/log/nginx/access_json.log apm_json;
Now reload Nginx: systemctl reload nginx.
You are now generating structured JSON logs. Why JSON? because parsing raw text with regex is a nightmare we left behind in 2015. JSON is native to modern ingestion tools like the ELK Stack (Elasticsearch, Logstash, Kibana).
Visualizing the Pain: The ELK Stack (5.x)
In 2017, the ELK stack is the gold standard for open-source monitoring. Version 5.4 was just released this month, and it's significantly faster than the old 2.x days.
You can pipe your new JSON logs into Logstash. Here is a simple logstash.conf snippet to ingest those Nginx logs:
input {
file {
path => "/var/log/nginx/access_json.log"
codec => "json"
}
}
filter {
mutate {
convert => { "request_time" => "float" }
convert => { "upstream_response_time" => "float" }
}
}
output {
elasticsearch {
hosts => ["localhost:9200"]
index => "nginx-apm-%{+YYYY.MM.dd}"
}
}
Once this data is in Kibana, you can build a dashboard that answers the critical question: "Is it the network, or is it the database?"
- High
upstream_response_time? Your PHP/MySQL is slow. Optimize your queries or check your code. - High
request_timebut lowupstream_time? The client has a slow connection, or you are sending back a massive payload (check your Gzip settings).
The Hardware Foundation
You can tune Nginx buffers and MySQL innodb_buffer_pool_size all day, but software cannot fix physics. If your underlying storage has high latency, your APM graphs will always look jagged.
Storage Technologies Comparison (2017)
| Technology | Avg Latency | IOPS (Random Read) | Verdict |
|---|---|---|---|
| HDD (SATA) | 10-15 ms | 80 - 120 | Backup only. |
| SSD (SATA) | 0.1 - 0.5 ms | 5,000 - 80,000 | Standard for Web. |
| NVMe (PCIe) | 0.02 ms | 200,000+ | Performance Critical. |
This is why at CoolVDS, we have transitioned our primary clusters to NVMe storage. When your database fits entirely in RAM, life is good. But the moment you hit swap or need to read from disk, NVMe is the difference between a 200ms load time and a 2-second timeout.
Data Sovereignty and The "Datatilsynet" Factor
We are seeing a massive shift in regulatory requirements here in Norway and across Europe. With the GDPR enforcement date looming next year (May 2018), knowing exactly where your data lives is no longer optional—it is a legal necessity.
When you use hyperscale US clouds, you are often routing traffic through Frankfurt or London. For a Norwegian user base, this introduces unnecessary latency (usually 20-35ms round trip). Hosting locally in Oslo (connected via NIX) keeps that latency under 5ms.
Furthermore, running your own APM stack on a Norwegian VPS ensures your user logs (which contain IP addresses—Personal Data under EU law) never leave the EEA. This simplifies your compliance posture significantly compared to sending logs to a US-based SaaS provider.
Summary
Monitoring isn't just about pretty graphs. It is about root cause analysis. To survive high traffic in 2017, you need:
- Visibility: Structured logs (JSON) piped into an analytics engine (ELK).
- Isolation: KVM virtualization to prevent neighbor noise.
- Speed: NVMe storage to eliminate I/O bottlenecks.
Don't let slow I/O kill your SEO rankings or your user experience. If you are tired of fighting with sluggish hardware, deploy a test instance on CoolVDS. Our NVMe instances are provisioned in under 55 seconds.