Console Login

Surviving the Meltdown: A Battle-Hardened Sysadmin’s Guide to Disaster Recovery in Norway (2014 Edition)

Surviving the Meltdown: A Battle-Hardened Sysadmin’s Guide to Disaster Recovery

It’s 03:14 AM on a Tuesday. The coffee is cold, the office is dark, and your Nagios dashboard is a sea of red. Your primary database server just stopped responding to ping. No SSH, no console output. Just silence.

If you don't have a plan right now, you aren't a sysadmin; you're a spectator to your own funeral. In the world of systems administration, particularly here in the Nordic region where we pride ourselves on reliability, downtime isn't just an inconvenience—it's a breach of trust.

I’ve seen seasoned engineers weep over corrupted InnoDB tablespaces because they trusted a RAID-5 array as a "backup strategy." Let’s be clear: RAID is redundancy, not backup. If you `rm -rf /` on a RAID-10 array, the controller will faithfully mirror that deletion across all disks faster than you can blink.

Today, we are going to talk about real Disaster Recovery (DR) planning. Not the paper-pushing kind you show to auditors, but the tactical, command-line reality of keeping your infrastructure alive in 2014.

The Norwegian Context: Latency and Law

Before we touch the config files, we need to address the geography. Many dev teams tempted by cheap hosting across the Atlantic forget two things: the speed of light and the Datatilsynet (Norwegian Data Protection Authority).

1. Latency Kills Conversion: A packet round-trip from Oslo to a data center in Texas takes about 130-150ms. From Oslo to a VPS in Norway (like CoolVDS infrastructure), it’s often under 5ms via NIX (Norwegian Internet Exchange). For a high-transaction e-commerce site running Magento, those milliseconds stack up on every database query.

2. Data Sovereignty: Under current implementation of the Personal Data Act (Personopplysningsloven), keeping sensitive Norwegian customer data within the EEA is crucial. While Safe Harbor exists, local storage is the only way to sleep soundly knowing you aren't subject to foreign subpoenas.

The Tech Stack: Stability Over Hype

We aren't chasing the latest beta software here. For a bulletproof DR plan in 2014, we stick to the rock-solid triumvirate: CentOS 6.5, Nginx 1.4, and Percona Server (or MySQL 5.6).

1. The Foundation: KVM over OpenVZ

At CoolVDS, we exclusively use KVM (Kernel-based Virtual Machine) for our serious tiers. Why? Because OpenVZ containers share a kernel. If a "noisy neighbor" triggers a kernel panic, your node goes down too. With KVM, you have hardware-level virtualization. You can run your own kernel. You can install kernel modules for DRBD (Distributed Replicated Block Device).

Pro Tip: Always check your virtualization type. If `uname -a` shows a kernel version that looks suspiciously old or modified (like `2.6.32-042stab...`), you are likely in a container. For DR, insist on KVM.

Strategy 1: The "Poor Man's" Hot Spare with Rsync

For static content (uploaded images, config files), `rsync` is still king. Don't overcomplicate this with heavy enterprise tools if you just need to mirror `/var/www/html` to a secondary VPS in a different datacenter.

Here is a robust script pattern I use for hourly syncs. It uses SSH keys for authentication and limits bandwidth to avoid choking the production interface.

#!/bin/bash
# /usr/local/bin/sync_failover.sh

SOURCE_DIR="/var/www/vhosts/"
REMOTE_HOST="backup-user@192.168.10.55"
REMOTE_DIR="/var/www/vhosts/"
LOG_FILE="/var/log/sync_failover.log"

# -a: archive mode (preserves permissions, times, symlinks)
# -v: verbose
# -z: compress during transfer
# --delete: remove files on destination that are gone on source
# --bwlimit: limit to 5000KB/s to save bandwidth for web traffic

echo "Starting sync at $(date)" >> $LOG_FILE

rsync -avz --delete --bwlimit=5000 -e "ssh -p 22" \
    $SOURCE_DIR $REMOTE_HOST:$REMOTE_DIR >> $LOG_FILE 2>&1

if [ $? -eq 0 ]; then
    echo "Sync successful at $(date)" >> $LOG_FILE
else
    echo "CRITICAL: Sync failed at $(date)" | mail -s "Backup Fail" admin@example.no
fi

Configure this in your crontab to run every hour. In the event of a total catastrophic failure of your main server, your static assets are waiting for you on the failover node. You just need to switch the DNS A-record.

Strategy 2: MySQL Master-Slave Replication

Files are easy. Databases are hard. You cannot `rsync` a running MySQL data directory; you will get corrupted MyISAM or InnoDB tables. You need replication.

In a disaster scenario, we want a "Hot Standby." This is a secondary server that replicates data from the master in near real-time. If the Master dies, you promote the Slave.

Configuration for the Master (my.cnf)

Add these lines to `/etc/my.cnf` on your primary server. We use `binlog_format=MIXED` for the best balance of safety and performance in MySQL 5.6.

[mysqld]
server-id = 1
log-bin = /var/lib/mysql/mysql-bin
binlog_format = MIXED
expire_logs_days = 7
max_binlog_size = 100M

# Safety net: ensure InnoDB creates a file per table
innodb_file_per_table = 1

# Bind to private IP for security!
bind-address = 10.0.0.1

Configuration for the Slave

On the secondary VPS (the failover), your config looks similar but with a different `server-id`.

[mysqld]
server-id = 2
relay-log = /var/lib/mysql/mysql-relay-bin
read_only = 1  # Crucial: prevents accidental writes to the slave

Once configured, you don't need expensive Oracle licenses. You have a real-time copy of your data. If the master server melts down, you run this on the slave:

STOP SLAVE;
RESET MASTER;
-- Remove read_only in my.cnf or via runtime
SET GLOBAL read_only = 0;

Suddenly, your slave is the master. Your app comes back online.

The Hardware Reality: Why SSD Matters

In 2014, disk I/O is still the single biggest bottleneck for database recovery. When you are restoring a 50GB dump file or catching up on replication logs, mechanical spinning rust (HDDs) will choke.

Standard SATA drives push maybe 100-150 IOPS. Under load, your recovery time objective (RTO) balloons from minutes to hours.

This is why CoolVDS invests heavily in Enterprise SSD storage. We are talking about 10,000+ IOPS compared to the paltry 100 of a standard drive. When you are replaying binary logs to get your store back online, that speed difference isn't just a metric; it's the difference between a minor outage and a business-ending event. While some providers are still selling you "Cached RAID-10 SAS," we provide pure flash storage on our performance tiers.

Network Level High Availability

If you want to get fancy (and we always do), you can automate the IP failover. On Linux, we use keepalived which implements VRRP (Virtual Router Redundancy Protocol).

Here is a basic `/etc/keepalived/keepalived.conf` snippet for a failover IP setup:

vrrp_instance VI_1 {
    state MASTER
    interface eth0
    virtual_router_id 51
    priority 101
    advert_int 1
    authentication {
        auth_type PASS
        auth_pass s3cr3t
    }
    virtual_ipaddress {
        192.168.1.100
    }
}

If the Master node stops broadcasting, the Backup node automatically claims the virtual IP `192.168.1.100`. Your users won't even notice the server burned down.

Conclusion: Don't Wait for the Kernel Panic

Disaster recovery is not a product you buy; it is a mindset you adopt. It requires testing. When was the last time you actually tried to restore your backups? If the answer is "never," you don't have backups—you have wishes.

At CoolVDS, we provide the raw power and the architectural freedom to build these systems. We give you full root access, KVM isolation, and the low-latency bandwidth you need to synchronize data between nodes instantly. Whether you script it yourself with Bash or deploy it with Puppet, the infrastructure needs to be solid.

Stop gambling with your uptime. Deploy a high-availability test environment on CoolVDS today and see the I/O difference for yourself.