Console Login

Surviving the Traffic Spike: A Sysadmin’s Guide to Monitoring Infrastructure at Scale (2014 Edition)

Surviving the Traffic Spike: A Sysadmin’s Guide to Monitoring Infrastructure at Scale

It is 3:14 AM. Your phone buzzes. It’s not a text from a friend; it’s a PagerDuty alert. Your primary database server has locked up, the load average is sitting at 45.00, and your Norwegian e-commerce client is launching their fall sale in four hours.

If your monitoring strategy consists solely of checking if the server responds to ping, you have already failed. In the era of heavy virtualization and rapid deployment, "up" does not mean "functional."

I have spent the last decade debugging servers that were technically "online" but utterly useless due to I/O bottlenecks or noisy neighbors. Today, we are going to cut through the marketing fluff and look at how to monitor infrastructure correctly using tools available right now in late 2014, specifically focusing on the unique constraints of the Nordic market.

The "Steal Time" Trap: Why Your VPS Feels Slow

Most hosting providers oversell their nodes. It is an industry secret that isn't really a secret. If you are running on legacy OpenVZ containers, you are sharing a kernel with twenty other customers. If they decide to compile a kernel or run a backup script, your performance tanks.

The most critical metric you are probably ignoring is CPU Steal Time (st). This metric tells you how long your virtual CPU waits for the physical hypervisor to give it attention. On a high-quality KVM instance—like the ones we architect at CoolVDS—this should be near zero.

Open your terminal and run top. Look at the %Cpu(s) line:

top - 14:23:45 up 12 days,  4:12,  1 user,  load average: 0.89, 0.65, 0.45
Tasks: 123 total,   1 running, 122 sleeping,   0 stopped,   0 zombie
%Cpu(s): 12.5 us,  3.2 sy,  0.0 ni, 82.1 id,  0.5 wa,  0.0 hi,  0.1 si,  1.6 st

See that 1.6 st at the end? That means 1.6% of the time, your server wanted to work but the host wouldn't let it. If that number hits 10% or 20%, migrate immediately. You cannot optimize code to fix a noisy neighbor.

The Stack: Zabbix 2.4 on CentOS 7

While Nagios has served us well for years, its configuration can be a nightmare of text files. With the release of Zabbix 2.4 this month, we have a robust, enterprise-grade solution that handles graphing natively—no need to hack Cacti into it.

Here is a battle-tested configuration for deploying a Zabbix Agent on a high-traffic Nginx web server. We assume you are running the new CentOS 7 (systemd is here to stay, might as well get used to it).

1. The Agent Configuration

Don't stick with defaults. Increase your timeout to avoid false positives during high load, and enable UnsafeUserParameters to run custom scripts.

Edit /etc/zabbix/zabbix_agentd.conf:

# /etc/zabbix/zabbix_agentd.conf

Server=10.20.30.40  # Your Zabbix Server IP
ServerActive=10.20.30.40
Hostname=web-node-oslo-01

# Critical for custom scripts taking > 3 seconds
Timeout=10

# Allow special characters in arguments
UnsafeUserParameters=1

# Buffer size for active checks
BufferSend=10
BufferSize=100

2. Custom UserParameter for Nginx

Zabbix can't see inside Nginx by default. We need to expose the stub_status module in Nginx and then grep it. First, ensure your nginx.conf has this block inside a server context:

location /nginx_status {
    stub_status on;
    access_log off;
    allow 127.0.0.1;
    deny all;
}

Now, add this UserParameter to your Zabbix agent config to parse the connection count:

UserParameter=nginx.active[*],curl -s "http://127.0.0.1/nginx_status" | awk '/Active connections/ {print $3}'
UserParameter=nginx.accepts[*],curl -s "http://127.0.0.1/nginx_status" | awk 'NR==3 {print $1}'
UserParameter=nginx.handled[*],curl -s "http://127.0.0.1/nginx_status" | awk 'NR==3 {print $2}'
UserParameter=nginx.requests[*],curl -s "http://127.0.0.1/nginx_status" | awk 'NR==3 {print $3}'

Restart the agent: systemctl restart zabbix-agent.

Disk I/O: The Silent Killer of E-Commerce

In a recent project for a client using Magento, the site would freeze randomly. CPU was low. RAM was free. The culprit? Disk I/O wait.

Standard hard drives (HDD) spin. They have physical latency. When MySQL tries to write to the binary log while Nginx is writing access logs and the system is swapping, the disk queue fills up. The CPU sits idle (Wait I/O), doing nothing while waiting for the disk.

To diagnose this, use iostat (part of the sysstat package):

yum install sysstat -y
iostat -x 1

Watch the %util column:

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
vda               0.00     2.00    0.00  145.00     0.00  1250.00     8.62     0.85    5.40    0.00    5.40   0.60  87.20

If %util is consistently over 90%, your disk is the bottleneck. This is why at CoolVDS we have standardized on enterprise-grade SSDs in RAID 10 configurations for all our KVM instances. In 2014, spinning rust has no place in a production web server.

Pro Tip: Tune your MySQL InnoDB buffer pool. If your dataset fits in RAM, your disk I/O drops significantly. Set innodb_buffer_pool_size to 70% of your total RAM in /etc/my.cnf if it is a dedicated DB server.

The Norwegian Context: Latency and Law

Hosting physically in Norway isn't just about patriotism; it's about physics and compliance.

1. The Speed of Light

If your target audience is in Oslo, Bergen, or Trondheim, hosting in a datacenter in Texas or even Frankfurt adds unavoidable latency. A ping from Oslo to NIX (Norwegian Internet Exchange) is 1-2ms. To Frankfurt, it's 25-35ms. For a dynamic application executing 50 sequential database queries, that latency compounds.

2. Data Sovereignty (Personopplysningsloven)

Under the Norwegian Personal Data Act (Personopplysningsloven) and EU Directive 95/46/EC, you have strict obligations regarding where personal data is stored. While safe harbor agreements exist, the safest legal stance for Norwegian businesses is to keep data on Norwegian soil. This simplifies compliance audits significantly.

Why Infrastructure Choice Dictates Uptime

You can script Zabbix until your fingers bleed, but software cannot fix bad hardware.

Feature Generic Shared Hosting CoolVDS KVM
Virtualization OpenVZ / Virtuozzo KVM (Kernel-based Virtual Machine)
Resources Burst (Oversold) Dedicated RAM & CPU
Storage SATA HDD SSD RAID 10
Kernel Access Shared Custom Kernel Allowed

When we built the CoolVDS platform, we chose KVM because it provides true hardware virtualization. If another customer on the node crashes their kernel, your instance keeps running. Combined with our low-latency network connected directly to NIX in Oslo, it provides the stable foundation required for the monitoring strategies discussed above.

Final Thoughts

Monitoring is not a "set it and forget it" task. It is an active discipline. By moving to Zabbix 2.4, monitoring the right metrics (Steal Time, I/O Wait), and hosting on infrastructure that respects resource isolation, you can sleep through the night—even during the biggest sale of the year.

Ready to upgrade your infrastructure? Don't let slow I/O kill your SEO. Deploy a test KVM instance on CoolVDS in 55 seconds and see the difference real hardware isolation makes.