Disaster Recovery for Ops: Why Your Backups Won't Save You When the Datacenter Goes Dark
Let’s be honest. If your Disaster Recovery (DR) plan consists solely of a cron job running a tarball script every midnight, you don't have a plan. You have a false sense of security.
I learned this the hard way three years ago. A catastrophic power failure in a "Tier 3" facility—not in Norway, thankfully—corrupted the RAID arrays on our primary database cluster. We had backups. But we hadn't accounted for the Mean Time to Recovery (MTTR). It took us 14 hours to transfer, decrypt, and re-import 600GB of SQL data because our recovery server was running on spinning rust (HDD) instead of NVMe. The CEO was screaming. The customers were leaving.
In 2019, with ransomware targeting Linux servers and GDPR fines from Datatilsynet looming over any data loss incident, "best effort" isn't enough. You need a war strategy.
The RTO/RPO Reality Check
Before touching a single config file, define these two metrics. If you can't recite them, stop deploying.
- RPO (Recovery Point Objective): How much data can you afford to lose? (e.g., "Max 15 minutes of transactions").
- RTO (Recovery Time Objective): How long can you stay offline? (e.g., "Max 1 hour before bankruptcy").
Achieving an RPO of zero requires synchronous replication, which introduces latency. If your servers are in Oslo and your DR site is in Frankfurt, the speed of light is your enemy. For most Norwegian businesses, an asynchronous setup with a warm standby in a separate local availability zone is the sweet spot between performance and safety.
Database Consistency: Stop Copying /var/lib/mysql
I still see sysadmins using `rsync` on running database directories. This guarantees corrupted tables. Unless you stop the MySQL service (which causes downtime), you must use tools that understand the InnoDB storage engine.
For a robust 2019-era stack, Percona XtraBackup is the standard. It performs hot backups without locking your database.
The "War Room" Backup Script
Here is a battle-tested wrapper for XtraBackup that we use on our high-performance CoolVDS instances. It handles the backup, compression, and encryption in a single pipe.
#!/bin/bash
# Configuration
BACKUP_DIR="/mnt/backups/mysql"
DATE=$(date +%Y-%m-%d_%H-%M-%S)
LOG_FILE="/var/log/mysql_dr.log"
# Ensure we have a secure place for credentials
# Do not put passwords in this script directly in production
source /root/.db_creds
echo "[INFO] Starting Hot Backup at $DATE" >> $LOG_FILE
# Stream backup -> compress (qpress) -> encrypt (openssl) -> disk
# Note: NVMe I/O allows us to do this without choking the CPU
xtrabackup --backup --stream=xbstream --extra-lsndir=$BACKUP_DIR/chkpoint \
--target-dir=$BACKUP_DIR | \
qpress -io | \
openssl enc -aes-256-cbc -pass pass:$ENCRYPTION_KEY -out $BACKUP_DIR/full_backup_$DATE.xb.enc
if [ $? -eq 0 ]; then
echo "[SUCCESS] Backup completed successfully" >> $LOG_FILE
else
echo "[CRITICAL] Backup FAILED" >> $LOG_FILE
# Trigger PagerDuty or send alert via mailx
echo "DR FAILURE" | mailx -s "CRITICAL: DB Backup Failed" ops@yourdomain.no
fi
The Offsite Imperative: BorgBackup
Storing backups on the same physical host is suicide. Storing them unencrypted is negligence. In the Nordic market, we often use BorgBackup. It offers deduplication (saving massive amounts of space) and authenticated encryption.
Why Borg? Because it mounts your backup repository as a FUSE filesystem. You can browse your disaster recovery points like normal directories.
# Initialize the repo (do this once)
borg init --encryption=repokey user@offsite-storage.coolvds.net:backups/main-repo
# The daily run command
borg create --stats --compression lz4 \
user@offsite-storage.coolvds.net:backups/main-repo::{now} \
/etc \
/var/www/html \
/mnt/backups/mysql \
--exclude '*.tmp'
Infrastructure: The "Warm Standby" Approach
Cold backups take too long to restore. If you have a high-traffic e-commerce site targeting Norway, you need a Warm Standby.
This is a secondary VPS—scaled down to save costs—that runs a replica of your stack. The database replicates from Master to Slave. The code is synced via CI/CD pipelines. When the Master dies, you promote the Slave.
Pro Tip: On CoolVDS, we recommend using KVM (Kernel-based Virtual Machine) for this. Unlike OpenVZ, KVM provides true hardware virtualization. If your neighbor is under a DDoS attack, your kernel doesn't panic. This isolation is critical when you are already in a crisis mode.
Configuring MySQL Replication (Async)
Modify your `my.cnf` on the Master. We need binary logging enabled.
[mysqld]
# Unique ID for the server
server-id = 1
log_bin = /var/log/mysql/mysql-bin.log
binlog_format = ROW
expire_logs_days = 7
# Durability settings (ACID)
innodb_flush_log_at_trx_commit = 1
sync_binlog = 1
On the Slave (your DR node), set `server-id = 2` and mark it read-only to prevent accidental writes during peace time.
[mysqld]
server-id = 2
relay-log = /var/log/mysql/mysql-relay-bin.log
read_only = 1
Network Failover: The DNS Trap
You have your data on the backup server. Great. How do users get there?
If you rely on changing DNS records manually, you are at the mercy of TTL (Time To Live). If your TTL is 3600 seconds, some users will see the down server for an hour.
Set your DNS TTL to 60 seconds or use a Floating IP (if available within the same datacenter). For cross-datacenter failover (e.g., Oslo to Stockholm), use a DNS Failover service that monitors HTTP 200 OK responses and updates the A-record automatically.
The Legal Angle: GDPR & Schrems II
This is specific to us operating in Europe. You cannot just dump your DR backups into an AWS bucket in US-East-1 without navigating a legal minefield regarding data transfer mechanisms.
Keeping your DR site within the EEA (European Economic Area) simplifies compliance significantly. When auditing your infrastructure, Datatilsynet will ask where the backups live. "I don't know, the cloud handles it" is an answer that leads to fines.
Testing: The "Scream Test"
A disaster recovery plan is theoretical until tested. We perform a "Scream Test" quarterly. We artificially sever the connection to the primary database and measure exactly how long it takes for the standby to accept writes.
If your current hosting provider suffers from "noisy neighbor" syndrome, your restore speeds will fluctuate wildly. Predictable I/O is not a luxury; it is a requirement for meeting RTO.
Don't wait for the fire. Audit your `rsync` scripts, verify your XtraBackup checksums, and ensure your secondary instance has the NVMe throughput to handle the load once it becomes the primary.