Disaster Recovery 101: Why Your "Backups" Won't Save You (And What Will)
It was 3:17 AM on a Tuesday when the RAID controller on our primary database server decided to commit seppuku. It wasn't a graceful failure. It began writing garbage data across the stripes before locking up entirely. We had backups, sure. But restoring a 400GB MySQL dump takes hours, not minutes. That is hours of downtime, lost revenue, and angry executives breathing down my neck.
Backup is not a strategy. Recovery is a strategy.
In the Nordic hosting market, we often get complacent. We have stable power grids and low latency through NIX (Norwegian Internet Exchange). But a stable grid doesn't protect you from a kernel panic, a corrupted file system, or a fat-fingered rm -rf /. If you are running mission-critical applications in 2014 without a Warm Spare, you are playing Russian Roulette with a fully loaded pistol.
The 3-2-1 Rule is Dead (For Web Apps)
The old school IT crowd loves the 3-2-1 rule: 3 copies, 2 media types, 1 off-site. That's fine for archiving tax records. It is useless for a high-traffic e-commerce store running Magento or a SaaS platform. If your recovery time objective (RTO) is under 15 minutes, you need active replication, not tape drives.
We need to talk about Geographic Redundancy within Norwegian borders. If your primary server is in Oslo, your failover cannot be in the same rack. A switch failure takes out both. You need a secondary location, but—and this is critical for those of us dealing with Datatilsynet (The Norwegian Data Protection Authority)—you often need to keep that data within Norway to satisfy the Personal Data Act (Personopplysningsloven) and banking compliance standards.
Step 1: The Database (MySQL 5.6 Master-Slave)
Forget clustering for a moment. It introduces complexity that kills more systems than it saves. In 2014, the most robust, battle-tested method for DR is standard MySQL Master-Slave replication. With the release of MySQL 5.6, we finally have GTID (Global Transaction Identifiers) which makes failover significantly less painful than the old binary log position method.
Here is how you configure the Master (Active Node) in /etc/my.cnf. Note the innodb_flush_log_at_trx_commit=1 setting—this is non-negotiable for ACID compliance, even if it costs you some I/O performance.
[mysqld]
server-id = 1
log-bin = mysql-bin
gtid_mode = ON
enforce_gtid_consistency = true
log_slave_updates = true
# Safety Nets
sync_binlog = 1
innodb_flush_log_at_trx_commit = 1
innodb_buffer_pool_size = 4G # Adjust based on RAM
On your Slave (Warm Spare), running on a separate CoolVDS instance:
[mysqld]
server-id = 2
relay-log = /var/log/mysql/mysql-relay-bin.log
gtid_mode = ON
enforce_gtid_consistency = true
read_only = 1 # Crucial prevents accidental writes to slave
Pro Tip: Never run your database on OpenVZ or container-based virtualization if you care about data integrity. You need full kernel isolation. We use KVM (Kernel-based Virtual Machine) at CoolVDS because it treats your allocated RAM as strictly yours. No "noisy neighbors" stealing your inode cache.
Step 2: Filesystem Synchronization
Databases are only half the battle. What about your user uploads, configuration files, and application code? You don't need a fancy distributed file system like GlusterFS for a simple two-node setup—it adds latency and complexity. You need rsync.
I use a simple cron job that runs every 5 minutes. Is it real-time? No. Does it survive network partitions better than NFS? Absolutely.
#!/bin/bash
# /root/scripts/sync_failover.sh
SRC="/var/www/html/"
DEST="user@10.0.0.2:/var/www/html/"
EXCLUDE="--exclude 'cache/' --exclude 'logs/'"
# The -z flag compresses data to save bandwidth on the WAN link
rsync -avz -e "ssh -p 22" $EXCLUDE --delete $SRC $DEST
Add this to your crontab. Note the --delete flag. It ensures that if you delete a file on the master, it disappears from the slave. Be careful with this; if a hacker wipes your master, the wipe replicates. This is why you also need distinct daily snapshots. CoolVDS offers snapshotting at the block level, which saves your skin in ransomware scenarios.
Step 3: The Failover Script
When the Master dies, you need to promote the Slave. Do not automate this unless you have a very sophisticated heartbeat mechanism (like Pacemaker/Corosync) and fencing. Automated failover often leads to "split-brain" scenarios where both servers think they are the master.
Here is the manual promotion sequence for MySQL 5.6:
# On the Slave Server
# 1. Stop the slave process
mysql -u root -p -e "STOP SLAVE;"
# 2. Reset master status (removes slave config)
mysql -u root -p -e "RESET MASTER;"
# 3. Turn off read-only mode
# Edit my.cnf or run dynamically:
mysql -u root -p -e "SET GLOBAL read_only = OFF;"
# 4. Update your application config (IP change)
sed -i 's/10.0.0.1/127.0.0.1/g' /var/www/html/config/database.php
Infrastructure Matters: The SSD Factor
We are seeing a shift in 2014. Traditional spinning HDDs are becoming the bottleneck for recovery. Restoring a 50GB backup from a SATA drive takes roughly 15-20 minutes depending on fragmentation. On the new SSD-backed storage arrays we are deploying at CoolVDS, that drops to under 4 minutes.
When you are down, every second is a lost customer. While standard VPS providers are still overselling HDD space 40:1, we are pushing raw I/O throughput. If your database is I/O bound, you are already losing SEO rankings because Google is starting to penalize slow Time-To-First-Byte (TTFB).
Comparison: Recovery Times
| Scenario (100GB Data) | Standard HDD VPS | CoolVDS SSD KVM |
|---|---|---|
| Full Restore (tar.gz) | ~45 Minutes | ~12 Minutes |
| MySQL Import | ~90 Minutes | ~25 Minutes |
| Reboot Time | 2-3 Minutes | 15 Seconds |
Legal & Compliance in Norway
If you are hosting data for Norwegian citizens, you are bound by the Personopplysningsloven. Using a cheap VPS in the US might violate Safe Harbor agreements if you aren't careful. By keeping your primary and disaster recovery nodes within Norway (or the EEA), you simplify your compliance overhead significantly.
Furthermore, latency matters. The round-trip time (RTT) from Oslo to Amsterdam is decent, but Oslo to Oslo is instant. For synchronous replication or heavy rsync jobs, that low latency prevents the "lag" that causes data inconsistency.
The Verdict
Hardware fails. It is a law of physics. If you don't have a plan, you plan to fail. By leveraging MySQL 5.6 GTID replication, simple bash scripting, and the high-performance KVM architecture of CoolVDS, you can sleep through the night knowing that when (not if) the primary server melts, your business survives.
Don't wait for the kernel panic. Provision a secondary KVM instance on CoolVDS today and test your replication lag. It takes 55 seconds to deploy, but it saves your reputation forever.