Stop Guessing: The 2016 Guide to Application Performance Monitoring & Linux Tuning
It’s 3:14 AM. Your phone buzzes. Nagios is screaming that response times on the production cluster just hit 4 seconds. You SSH in, run top, and stare at the numbers. CPU is idle. RAM is free. Yet the application is crawling.
If this scenario makes your stomach turn, you are in the right place. Too many systems administrators in Norway rely on "gut feeling" or restart scripts to solve performance issues. That stops today.
In the DevOps world of late 2016, we have moved beyond simple uptime checks. We need deep Application Performance Monitoring (APM). Whether you are running a Magento shop targeting Oslo consumers or a high-traffic API for the European market, you need visibility. Let's dissect how to extract metrics that actually matter, utilizing the latest stable tools like the ELK Stack 5.0 and Ubuntu 16.04.
1. The First Line of Defense: Nginx Timing
Most default Nginx configurations are useless for debugging. They tell you who visited, but not how long it took. If you are blindly optimizing PHP or Node.js without knowing if the delay is in the application or the upstream, you are wasting time.
Modify your nginx.conf to track request time and upstream response time. This is the single most valuable change you can make today.
http {
log_formatapm '"$time_local" client=$remote_addr '
'method=$request_method request="$request" '
'request_length=$request_length '
'status=$status bytes_sent=$body_bytes_sent '
'body_bytes_sent=$body_bytes_sent '
'referer="$http_referer" '
'user_agent="$http_user_agent" '
'upstream_addr=$upstream_addr '
'upstream_status=$upstream_status '
'request_time=$request_time '
'upstream_response_time=$upstream_response_time '
'connect_time=$upstream_connect_time '
'header_time=$upstream_header_time';
access_log /var/log/nginx/access_apm.log apm;
}
The Breakdown:
$request_time: Total time spent processing the request (including writing to the client).$upstream_response_time: Time the server (PHP-FPM, Python, etc.) took to generate the response.
If $request_time is high but $upstream_response_time is low, your client has a slow connection (or you have a network bottleneck). If both are high, your code or database is the problem.
2. Database Profiling: The Usual Suspect
90% of the time, the bottleneck is the database. With MySQL 5.7 becoming the standard on Ubuntu 16.04, we have better defaults, but the Slow Query Log is still disabled by default in most package installations.
Turn it on. But be smart about it. Don't just log everything; log queries that don't use indexes.
[mysqld]
slow_query_log = 1
slow_query_log_file = /var/log/mysql/mysql-slow.log
long_query_time = 1
log_queries_not_using_indexes = 1
Once you have data, don't read the raw text file like a novice. Use mysqldumpslow to aggregate the pain points:
# Show top 5 queries sorted by total time
mysqldumpslow -s t -t 5 /var/log/mysql/mysql-slow.log
Pro Tip: Ensure your innodb_buffer_pool_size is set correctly (typically 70-80% of available RAM on a dedicated DB server). On a shared environment, this is risky. This is why we recommend dedicated CoolVDS instances for databases—so you can allocate RAM without fighting other tenants.
3. Centralizing with the ELK Stack (Elasticsearch, Logstash, Kibana)
Grepping logs on five different web servers is a nightmare. In 2016, the ELK stack is the gold standard for open-source log aggregation. With the release of Elastic Stack 5.0 just last month, performance has improved significantly.
However, ELK is resource-hungry. Elasticsearch loves IOPS. If you try to run this on a cheap VPS with spinning rust (HDD) or network-throttled storage, your logging infrastructure will crash before your application does.
Here is a basic Logstash configuration to parse the Nginx APM format we created earlier:
input {
file {
path => "/var/log/nginx/access_apm.log"
type => "nginx_access"
}
}
filter {
grok {
match => { "message" => "\"%{HTTPDATE:timestamp}\" client=%{IP:client_ip} ... request_time=%{NUMBER:req_time:float} upstream_response_time=%{NUMBER:upstream_time:float}" }
}
}
output {
elasticsearch {
hosts => ["localhost:9200"]
index => "nginx-logs-%{+YYYY.MM.dd}"
}
}
Note: The grok pattern above is abbreviated for readability. You will need to map the full string.
Once indexed, you can build visualizations in Kibana 5.0 to show "Average Response Time per Endpoint." If /checkout spikes, you know immediately.
4. The Hidden Killer: I/O Wait and CPU Steal
Code optimization is useless if the underlying infrastructure is gasping for air. When hosting in virtualized environments, you must watch for two metrics in top or vmstat:
- wa (I/O Wait): The percentage of time the CPU is idle because it is waiting for disk access. If this is above 10%, your storage is too slow for your database.
- st (Steal Time): The percentage of time your virtual CPU is waiting for the physical hypervisor to give it attention.
High steal time means your hosting provider has oversold the physical server. You are fighting