Stop Guessing: A SysAdmin’s Guide to Application Performance Monitoring (APM) in 2016
It is 3:00 AM. Your pager is screaming. The monitoring dashboard shows a sea of red, and the CEO of your biggest client just emailed asking why the checkout page takes ten seconds to load. You check htop. CPU is at 10%. RAM is fine. So, what is the problem?
If you answer "I don't know," you are already dead. In the systems administration world, ignorance isn't bliss; it's downtime.
The landscape of 2016 is unforgiving. We are moving from monolithic LAMP stacks to fragmented microservices using Docker (now at version 1.10), and while this adds agility, it turns debugging into a forensic nightmare. If you are still relying on tail -f /var/log/syslog and hope, you are doing it wrong.
The Bottleneck Triad: CPU, RAM, and the Silent Killer (I/O)
Most developers blame the code. Most hosting providers blame the traffic. Usually, it is neither. It is Input/Output (I/O). In a virtualized environment, noisy neighbors can steal your disk throughput, causing your database to hang while waiting to write a transaction log. This is why we argue so heavily for KVM over OpenVZ at CoolVDS—you need guaranteed resources, not shared promises.
Diagnosing I/O Wait
Don't just look at load average. Look at %iowait. If your CPU is idle but your load is high, your disk is too slow.
Here is the command you need to run immediately:
iostat -x 1
You are looking for the %util and await columns. If %util is near 100% and await is high (over 10-20ms), your storage subsystem is the bottleneck. This is common on legacy VPS providers still running spinning rust (HDDs) or cheap SATA SSDs. This is why we standardized on NVMe storage for all CoolVDS instances in Oslo.
The 2016 APM Stack: Beyond Nagios
Nagios is great for telling you if a server is up. It is terrible at telling you why a server is slow. For that, you need deep introspection.
1. The Application Level: New Relic vs. Blackfire
If you are running PHP—and with the release of PHP 7.0 in December, you should be upgrading immediately—you need to see function-level execution time. New Relic remains the gold standard here, though it can get expensive.
To install the PHP agent on a CentOS 7 system:
rpm -Uvh http://yum.newrelic.com/pub/newrelic/el5/x86_64/newrelic-repo-5-3.noarch.rpm
yum install newrelic-php5
newrelic-install install
# Add your license key when prompted
Once installed, check your /etc/php.d/newrelic.ini. A common mistake is leaving the transaction tracer threshold too high.
newrelic.transaction_tracer.enabled = true
newrelic.transaction_tracer.threshold = 200ms
newrelic.transaction_tracer.detail = 1
2. Log Aggregation: The ELK Stack
Grepping logs across five different web nodes is impossible. The ELK Stack (Elasticsearch, Logstash, Kibana) has matured significantly this year. With Elasticsearch 2.2 recently released, clustering is more stable.
You should be shipping your Nginx logs to Logstash. First, define a JSON log format in your nginx.conf so Logstash doesn't have to guess with grok filters:
http {
log_format json_combined escape=json
'{ "time_local": "$time_local", '
'"remote_addr": "$remote_addr", '
'"remote_user": "$remote_user", '
'"request": "$request", '
'"status": "$status", '
'"body_bytes_sent": "$body_bytes_sent", '
'"request_time": "$request_time", '
'"upstream_response_time": "$upstream_response_time", '
'"http_referrer": "$http_referer", '
'"http_user_agent": "$http_user_agent" }';
access_log /var/log/nginx/access.json json_combined;
}
Pro Tip: Pay attention to$upstream_response_time. If$request_timeis high but$upstream_response_timeis low, the latency is in the network between the client and your server, not your PHP application. This often means you need a CDN or a better localized host.
The Sovereignty Factor: Why Norway?
We cannot talk about architecture in early 2016 without addressing the elephant in the room: the invalidation of Safe Harbor last October. The legal ground for transferring user data to the US is shaky at best. The "Privacy Shield" is being discussed, but do you really want to bet your compliance strategy on political handshakes?
Hosting locally is no longer just about latency—though ping times of 2ms to the NIX (Norwegian Internet Exchange) are fantastic for user experience. It is about data sovereignty. Keeping your data on Norwegian soil, protected by the Datatilsynet, is the safest move for any European business right now.
Optimizing MySQL 5.7 for Performance
MySQL 5.7 is now generally available and it brings massive improvements over 5.6. However, default settings are still conservative. If you have a CoolVDS instance with 16GB RAM, do not leave the defaults alone.
Adjust your my.cnf:
[mysqld]
# 70-80% of available RAM for dedicated DB servers
innodb_buffer_pool_size = 12G
innodb_log_file_size = 512M
innodb_flush_log_at_trx_commit = 2 # Faster, but riskier on crash. Use 1 for strict ACID.
innodb_flush_method = O_DIRECT
query_cache_type = 0 # Disable query cache, it is a bottleneck in high concurrency
Note the innodb_flush_method = O_DIRECT. This bypasses the OS cache and writes directly to disk. This is where NVMe storage shines. On standard SSDs, this can still be fast, but on NVMe, it is instantaneous.
Conclusion
Performance monitoring isn't a product you buy; it's a discipline you practice. It requires visibility into every layer: the network (latency to Oslo), the hardware (I/O wait), the database (buffer pools), and the code (execution traces).
You can spend weeks tuning a Magento config, but if your underlying infrastructure suffers from I/O steal or high latency, you are wasting your time. You need a foundation that respects your engineering efforts.
Don't let slow I/O kill your SEO rankings or your conversion rates. Deploy a test instance on CoolVDS today. We offer pure KVM virtualization and local NVMe storage in Norway, ensuring your metrics stay green and your data stays compliant.