Console Login

Stop Trusting JavaScript: Hardcore Log Analysis with AWStats on CentOS 5

The Lie of Client-Side Analytics

If you are relying solely on Google Analytics to tell you what is happening on your server, you are flying blind. I said it. JavaScript-based tracking is a marketer's toy. It fails when users disable JS, it fails when corporate firewalls block tracking pixels, and most importantly, it fails to show you the real load on your infrastructure.

I recently audited a media site hosting heavy image galleries. Their analytics dashboard showed a steady stream of visitors. Their server load, however, was spiking to 15.00 on a dual-core box. Why? Hotlinking. Other sites were scraping their images, bypassing the HTML pages entirely. Analytics didn't trigger. The server melted. Logs don't lie.

Today, we implement AWStats (Advanced Web Statistics) on CentOS 5. We are going to parse raw Apache logs to see exactly who is touching your metal, from the Googlebot to the script-kiddie probing for SQL injections.

Prerequisites and The Hardware Reality

Log parsing is I/O intensive. If you are trying to parse a 2GB access_log on a cheap shared hosting account, the process will likely be killed by the host's resource monitor before it finishes. You need dedicated resources.

Pro Tip: This is why we provision CoolVDS instances with dedicated RAM and high-speed RAID-10 SAS storage. When AWStats crunches a month of data, you need disk throughput that doesn't choke. Shared spindles are a bottleneck you can't afford in 2010.

Step 1: Installation on RHEL/CentOS 5

We assume you have the EPEL repository enabled. If not, get it. Then, hit the terminal:

yum install awstats

This installs the Perl scripts and the cron jobs. But the default config is useless. We need to map it to your Apache environment.

Step 2: Configuring Apache for Analysis

AWStats craves the combined log format. Standard Common Log Format (CLF) isn't enough; it misses the User-Agent and Referer, which are critical for identifying hotlinkers. Check your /etc/httpd/conf/httpd.conf:

LogFormat "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"" combined CustomLog logs/access_log combined

If you change this, restart httpd. service httpd restart. Don't forget.

Step 3: Tuning awstats.model.conf

Copy the model config to your domain config:

cp /etc/awstats/awstats.model.conf /etc/awstats/awstats.yourdomain.com.conf

Now, edit it. There are three critical variables you must get right to avoid garbage data:

  • LogFile: Point this to your actual log path. Usually /var/log/httpd/access_log.
  • LogFormat: Set this to 1 (Apache Combined).
  • SiteDomain: yourdomain.com

The Privacy Elephant in the Room: Datatilsynet

Here in Norway, the Data Inspectorate (Datatilsynet) is strict about IP addresses. They consider IPs as personal data under the Personal Data Act (Personopplysningsloven). Unlike US hosting where anything goes, operating a VDS in Oslo means you have a responsibility.

If you are storing raw logs for years, you are a liability. Configure log rotation. In /etc/logrotate.conf, ensure you are rotating weekly and keeping only what is necessary for technical maintenance. AWStats aggregates data, so you don't necessarily need the raw logs forever once they are parsed.

Automating the Crunch

You do not want to run this manually. It takes time. Add a cron job to update your stats every hour. This keeps the load distributed rather than spiking your CPU at midnight.

0 * * * * /usr/bin/perl /usr/share/awstats/wwwroot/cgi-bin/awstats.pl -config=yourdomain.com -update > /dev/null

Performance: The VDS Advantage

Running log analysis on a live web server is a trade-off. It consumes CPU cycles that could be serving HTTP requests. This is where the architecture matters.

Hosting Type AWStats Impact Risk
Shared Hosting Process limited/Killed High (Incomplete stats)
CoolVDS (Xen/KVM) Dedicated CPU Core None (It's your metal)

With a CoolVDS Virtual Dedicated Server, you have the isolation of Xen. Even if the log parse spikes CPU usage to 100% on one core, your other cores keep serving traffic. That is the stability serious businesses pay for.

Secure the Interface

By default, AWStats is accessible to the world. Do not let competitors see your traffic sources. Lock it down in your Apache config using htpasswd:

<Directory "/usr/share/awstats/wwwroot"> AuthType Basic AuthName "Restricted Access" AuthUserFile /etc/awstats/htpasswd.users Require valid-user </Directory>

Security isn't optional. Neither is performance. If you are tired of "noisy neighbors" on oversold shared hosting slowing down your reporting, it's time to move. Get a VDS with dedicated resources in our Oslo datacenter. Low latency, high compliance, zero nonsense.

Ready to see what's actually hitting your server? Deploy a CentOS 5 VDS on CoolVDS today.