Console Login

Surviving the Kernel Panic: A Manual for Disaster Recovery in 2013

When `uptime` Becomes a Liability: Architecting for Failure

It is not a matter of if your primary node will fail. It is a matter of when. I learned this the hard way two years ago, watching a degraded RAID-5 array on a legacy dedicated server struggle to rebuild for 19 hours. The load average hit 40.0, the IOwait was practically 100%, and the client was calling me every ten minutes. We lost data that day. Not because we didn't have backups—we had plenty of tarballs—but because we didn't have a recovery plan.

In the systems administration world, backups are just files. Disaster Recovery (DR) is a strategy. If you are running mission-critical applications in Norway, relying on a single point of failure is professional negligence.

The Norwegian Context: Latency and Law

Before we touch `vim`, we need to address the physical and legal reality. Hosting your failover node in Ashburn, Virginia might save you a few kroner on hosting fees, but it introduces two critical problems:

  1. Latency: Round-trip time (RTT) from Oslo to US East Coast averages 90-110ms. For a static site, maybe that's fine. For a database application doing synchronous replication? It's a performance death sentence.
  2. Data Sovereignty: Under the Personopplysningsloven (Personal Data Act of 2000), you are responsible for where your user data lives. While Safe Harbor is currently in effect, keeping data within the EEA (or ideally, on Norwegian soil) simplifies compliance with Datatilsynet immensely.

For this guide, we assume you are running a primary node in Oslo and a secondary failover node, perhaps in a geographically separated datacenter. This is where CoolVDS shines—their KVM-based infrastructure in Norway provides the low-latency link (typically <5ms within national routing) required for near-real-time replication.

The Stack: Simplicity Survives

Forget complex, bleeding-edge orchestration tools that just hit version 0.1. We are using tools that have survived the test of time: CentOS 6.4, MySQL 5.5, and Rsync.

Step 1: Database Replication (Master-Slave)

We need the slave to be a near-exact copy of the master. First, configure the Master server. Open /etc/my.cnf and ensure you are listening on the private interface, not just localhost.

[mysqld]
server-id = 1
log-bin = /var/lib/mysql/mysql-bin
binlog-do-db = production_db
# Safety first: Ensure ACID compliance
innodb_flush_log_at_trx_commit = 1
sync_binlog = 1

On the Slave server (your CoolVDS instance), the config looks similar, but with a different ID:

[mysqld]
server-id = 2
relay-log = /var/lib/mysql/mysql-relay-bin
read_only = 1

Pro Tip: Setting `read_only = 1` on the slave is crucial. It prevents you (or a rogue script) from accidentally writing data to the backup, which would break the replication chain immediately.

Next, we create the replication user on the Master:

CREATE USER 'repl_user'@'10.0.0.%' IDENTIFIED BY 'StrongPassword123!';
GRANT REPLICATION SLAVE ON *.* TO 'repl_user'@'10.0.0.%';
FLUSH PRIVILEGES;
FLUSH TABLES WITH READ LOCK;
SHOW MASTER STATUS;

Take note of the File and Position. Do not close this terminal session yet, or the lock will release. Open a new terminal to dump the data using `mysqldump` or, if you have a large dataset, use Percona's `xtrabackup` for a hot backup. Once the data is imported to the slave:

CHANGE MASTER TO
MASTER_HOST='10.0.0.10',
MASTER_USER='repl_user',
MASTER_PASSWORD='StrongPassword123!',
MASTER_LOG_FILE='mysql-bin.000001', 
MASTER_LOG_POS= 107;
START SLAVE;

Step 2: Filesystem Synchronization

Database replication handles the structured data, but what about user uploads or configuration files? `rsync` is the standard here. We don't want to run this every minute (too much overhead), but every 5 minutes is usually acceptable for DR.

Create a script at /root/sync_dr.sh:

#!/bin/bash
# Sync Web Root to DR Node
# Exclude logs and cache to save bandwidth

rsync -avz -e "ssh -p 22" \
--exclude 'cache/' \
--exclude 'logs/' \
/var/www/vhosts/ mysite.no/httpdocs/ \
root@10.0.0.20:/var/www/vhosts/mysite.no/httpdocs/

Warning: Ensure you are using SSH keys for password-less authentication. If your private key has a passphrase, this cron job will fail silently.

The Hardware Factor: Why Virtualization Matters

In 2013, we are seeing a lot of budget providers overselling OpenVZ containers. For a test dev box, OpenVZ is fine. For Disaster Recovery? It's a gamble.

OpenVZ shares the host kernel. If the host kernel panics, your "isolated" container dies with it. Furthermore, you cannot load your own kernel modules (like DRBD for block-level replication) inside a standard OpenVZ container.

This is why we architect DR plans on KVM (Kernel-based Virtual Machine), which is the standard at CoolVDS. KVM treats your VPS as a true hardware emulation. You get your own kernel, your own swap partition, and true isolation. If a neighbor on the host node fork-bombs their server, your KVM instance keeps humming along.

Storage I/O Bottlenecks

When you are restoring a database from a dump file, your biggest enemy is Disk I/O. On a traditional 7.2k RPM SATA drive, restoring a 10GB MySQL dump can take 45 minutes. On the enterprise-grade SSD RAID-10 arrays used by CoolVDS, I've seen that same restore happen in under 4 minutes. Speed isn't just a luxury; during a disaster, it's the difference between a minor outage and a business-ending event.

The "Switch"

Automated failover is dangerous. It leads to "split-brain" scenarios where both servers think they are the master, leading to data corruption. In 2013, unless you have a fencing device (like a networked PDU to kill power to the master), manual failover is safer.

The Emergency Procedure:

  1. Update DNS TTL to 300 seconds (do this now, not during the crash).
  2. Stop the Master (if it's not already dead).
  3. On Slave: STOP SLAVE; RESET MASTER;
  4. On Slave: Update application config to point to localhost DB.
  5. Update DNS A record to point to the Slave IP.

Once you switch, your Slave is the new Master. When the old Master comes back online, it must be wiped and rebuilt as a Slave.

Final Thoughts

A backup is a copy of your data. A Disaster Recovery plan is a copy of your infrastructure. By leveraging standard tools like MySQL replication and `rsync` on robust KVM platforms like CoolVDS, you ensure that a hardware failure is an inconvenience, not a resume-generating event.

Don't wait for the RAID card to smoke. Provision a secondary KVM instance today, set up the replication, and sleep better knowing you have a parachute.