Console Login

Stop Guessing: A Sysadmin's Guide to Application Performance Monitoring in 2014

Stop Guessing: A Sysadmin's Guide to Application Performance Monitoring

It starts with a ticket at 3:14 AM. Subject: "The site is slow."

Not down. Just slow. The worst kind of ticket. If the server were down, I’d know what to do: reboot, check the hardware, or yell at the upstream provider. But "slow" is a ghost. It could be a rogue MySQL query, a memory leak in your Python worker, or—the silent killer—noisy neighbors on your virtual host stealing your CPU cycles.

In the Norwegian market, where users expect latency to be practically non-existent thanks to the NIX (Norwegian Internet Exchange), a 500ms delay is an eternity. If you are still relying on top and a simple Pingdom check, you are flying blind.

Here is how we diagnose application performance in 2014, moving from system metrics to code-level profiling.

1. The System Level: Identifying the Bottleneck

Before you blame the code, check the metal. When a Linux server feels sluggish, your first instinct is running top. But standard top doesn't tell the whole story, especially in a virtualized environment.

Look at the %st (steal time) value.

Cpu(s): 12.5%us,  4.2%sy,  0.0%ni, 81.0%id,  0.2%wa,  0.0%hi,  0.1%si,  2.0%st

If that st number is consistently above 0%, your hosting provider is overselling their physical cores. The hypervisor is making your VM wait while another customer uses the CPU. This is common with budget providers using OpenVZ. At CoolVDS, we strictly use KVM virtualization with dedicated resource allocation, so your steal time should effectively remain zero. If you see high steal time elsewhere, no amount of code optimization will fix it. Move hosts.

Disk I/O: The Usual Suspect

Web servers are read-heavy. If you are still on spinning rust (HDDs) or cheap shared storage, iowait will destroy your concurrency. Use iostat (part of the sysstat package) to verify.

# Install on Debian/Ubuntu
apt-get install sysstat

# Run extended stats every 1 second
iostat -x 1

Watch the %util column. If it is hitting 100% while your traffic is moderate, your disk cannot keep up. This is why we deploy Enterprise SSDs on all CoolVDS instances. In 2014, running a database on rotational media is professional negligence.

2. The Web Server: Nginx Timing

Most sysadmins leave Nginx logging at default. This is a mistake. You need to know exactly how long Nginx takes to process a request and, crucially, how long the upstream (PHP-FPM, Gunicorn, Tomcat) took to reply.

Modify your nginx.conf to include timing variables:

log_format performance '$remote_addr - $remote_user [$time_local] '
                       '"$request" $status $body_bytes_sent '
                       '"$http_referer" "$http_user_agent" '
                       'rt=$request_time uct="$upstream_connect_time" uht="$upstream_header_time" urt="$upstream_response_time"';

server {
    access_log /var/log/nginx/access_perf.log performance;
    ...
}

Breakdown:

  • rt=$request_time: Total time passed since the first byte was received from the client.
  • urt=$upstream_response_time: Time the backend (e.g., PHP) took to generate the page.

If rt is high but urt is low, the slowness is likely network latency or the client's connection. If urt is high, your application logic is the bottleneck. Parse this log with awk or send it to a centralized syslog server if you are fancy.

3. The Application: PHP-FPM Slow Logs

If you are running the standard LEMP stack (Linux, Nginx, MySQL, PHP), PHP-FPM has a built-in profiler that is often disabled by default. It dumps a stack trace whenever a script takes too long to execute.

Edit your pool config (usually in /etc/php5/fpm/pool.d/www.conf):

; The timeout for serving a single request after which a PHP backtrace will be dumped to the 'slowlog' file.
request_slowlog_timeout = 5s
 
; The log file for slow requests
slowlog = /var/log/php5-fpm.log.slow

Reload PHP-FPM. Now, whenever a script hangs for more than 5 seconds, you get a trace showing exactly which function is sleeping.

Pro Tip: Don't set the timeout too low in production, or you will fill your disk with logs during a traffic spike. Start with 5 seconds and tune down to 2 seconds as you optimize.

4. External Monitoring: Trust but Verify

Internal tools are great, but they don't tell you if the network path from Bergen to Oslo is congested. You need external verification.

Currently, New Relic is the gold standard for APM. It hooks directly into the PHP/Ruby/Python runtime and gives you a waterfall view of SQL queries. It is expensive, but the free tier usually covers a basic server. Alternatively, for a fully open-source route, setting up Graphite with StatsD is becoming the trend for gathering metrics, though the learning curve is steep compared to Munin or Nagios.

The Infrastructure Factor

You can tune your my.cnf buffers and optimize your Nginx worker processes all day, but software cannot overcome hardware deficiencies. We see this constantly: developers trying to optimize code when the real problem is high latency storage or noisy neighbors.

This is why architecture matters. At CoolVDS, we don't oversell. We provide:

  • KVM Isolation: No "steal time." Your CPU cores are yours.
  • SSD Storage: Essential for high-transaction databases.
  • Data Sovereignty: Your data stays in Norway, compliant with the Personal Data Act (Personopplysningsloven) and safe from foreign snooping.

Performance monitoring is not about looking at graphs that look nice. It is about finding the root cause quickly so you can go back to sleep. Start by enabling the logs mentioned above. Data is your only weapon against the "slow site" ghost.

Need a baseline environment that doesn't fluctuate? Deploy a high-performance SSD VPS on CoolVDS today and see what 0% steal time looks like.