When RAID Fails: A Battle-Hardened Guide to Disaster Recovery in 2018
It is 3:14 AM. Your phone buzzes. It’s not a text from a friend; it’s PagerDuty. Your primary database cluster in Frankfurt just went dark. SSH times out. The hosting provider's status page is ominously static. This is the moment where careers are either saved by preparation or destroyed by optimism.
If your strategy relies solely on local RAID controllers or the hope that "the cloud never fails," you are already a casualty. In the post-GDPR world of 2018, losing data isn't just an operational failure; it is a legal liability that draws the ire of the Datatilsynet (Norwegian Data Protection Authority). We need to talk about real Disaster Recovery (DR), not just glorified file copying.
The RPO/RTO Reality Check
Before we touch a single config file, we must define two variables that dictate your budget and your sleep schedule:
- Recovery Point Objective (RPO): How much data can you afford to lose? One hour? One second?
- Recovery Time Objective (RTO): How long can you stay offline?
Clients often demand zero RPO and zero RTO. This is physically impossible without synchronous multi-site replication, which introduces latency penalties that kill performance for most applications. A pragmatic architect accepts a non-zero RPO (e.g., 15 minutes) to maintain high throughput on the primary node.
Geographic Redundancy: Why Oslo?
Hosting your primary infrastructure and your DR site in the same data center is suicide. A single power grid failure or fiber cut takes out both. You need geographic separation, but you also need data sovereignty.
With the uncertainty surrounding the US Cloud Act and the stringent requirements of GDPR (Article 32), moving backup data outside the EEA is a compliance nightmare. This is why I increasingly architect DR solutions using Norwegian infrastructure. Norway offers a unique blend: outside the EU but fully aligned with EEA privacy laws, cheap hydroelectric power keeping costs down, and rock-solid connectivity via NIX (Norwegian Internet Exchange).
Pro Tip: Latency matters even in backups. The round-trip time (RTT) from Frankfurt/Amsterdam to Oslo is often under 20ms. This makes Oslo an ideal "warm standby" location compared to cheaper, high-latency locations in the US.
The Technical Implementation: Simplicity Saves Systems
Complex DR tools fail when you are panicked. We stick to standard, battle-tested tools: rsync, SSH, and MySQL replication.
1. The Filesystem Sync
Do not use FTP. Do not use unencrypted transfers. We use rsync over SSH. Here is a robust script structure I deploy on production web servers. It handles the transfer and rotation of snapshots.
#!/bin/bash
# /usr/local/bin/dr-sync.sh
# Executed via cron every 15 minutes
REMOTE_USER="dr_user"
REMOTE_HOST="dr-node-01.coolvds.com" # Your CoolVDS NVMe instance
REMOTE_DIR="/data/backups/web01"
SOURCE_DIR="/var/www/html/"
SSH_KEY="/root/.ssh/id_rsa_dr"
# We use --link-dest to create hard-link based snapshots
# This saves space on the CoolVDS instance while giving us history
CURRENT_DATE=$(date +%Y-%m-%d-%H%M%S)
echo "Starting sync at $CURRENT_DATE"
rsync -avzPh \
--delete \
--link-dest="$REMOTE_DIR/current" \
-e "ssh -i $SSH_KEY" \
$SOURCE_DIR \
$REMOTE_USER@$REMOTE_HOST:"$REMOTE_DIR/backups/$CURRENT_DATE"
# Update the 'current' symlink on the remote side for the next run
ssh -i $SSH_KEY $REMOTE_USER@$REMOTE_HOST \
"rm -f $REMOTE_DIR/current && ln -s $REMOTE_DIR/backups/$CURRENT_DATE $REMOTE_DIR/current"
if [ $? -eq 0 ]; then
echo "Sync successful"
else
echo "CRITICAL: Sync failed" | mail -s "DR ALERT" ops@example.com
fi
2. The Database Strategy
Files are easy. Databases are hard. Using rsync on a running MySQL /var/lib/mysql directory results in corruption. You have two valid paths in 2018:
Option A: Replication (Low RPO)
Set up a Master-Slave configuration. The slave runs on your CoolVDS instance in Oslo. Ensure you enable SSL for replication traffic.
# /etc/mysql/mysql.conf.d/mysqld.cnf (On the Master)
[mysqld]
server-id = 1
log_bin = /var/log/mysql/mysql-bin.log
binlog_do_db = production_db
bind-address = 0.0.0.0 # Strict firewall rules required!
require_secure_transport = ON
Option B: Consistent Dumps (Higher RPO, Lower Cost)
If you don't want to manage replication lag, use mysqldump with --single-transaction to ensure consistency without locking InnoDB tables.
mysqldump --single-transaction --quick --lock-tables=false \
-u root -p$PASSWORD production_db | gzip > /tmp/db_backup.sql.gz
# Immediately ship it off-site
scp -i /root/.ssh/id_rsa_dr /tmp/db_backup.sql.gz dr_user@dr-node-01.coolvds.com:/data/sql/
The Hardware Factor: Why KVM and NVMe?
When you are restoring 500GB of data, IOPS (Input/Output Operations Per Second) is the only metric that matters. On a traditional HDD VPS, restoring that dump file could take 6 hours. On NVMe storage, it takes minutes.
Furthermore, virtualization technology dictates your stability. Many budget providers use OpenVZ (Container-based). In OpenVZ, if a "noisy neighbor" kernel panics, the whole node goes down—including your DR site. This is unacceptable for mission-critical infrastructure.
This is why we treat CoolVDS as the reference implementation for these setups. They enforce KVM (Kernel-based Virtual Machine) virtualization. KVM provides true hardware abstraction. Your kernel is your kernel. If another customer crashes their OS, your disaster recovery node keeps humming. Combined with their pure NVMe backing, the "Time to Recovery" drops drastically.
Testing the Failover
A backup that hasn't been restored is just a rumor. You need to simulate a failure.
- Spin up your CoolVDS instance.
- Modify your local
/etc/hostson your laptop to point your domain to the CoolVDS IP. - Run the restore script.
- Check application logs for database connection errors.
The Security/Compliance Balance
By storing data in Norway, you navigate the GDPR landscape effectively. However, security responsibilities remain yours. Your DR node should not be sitting wide open.
Configure iptables or ufw to only accept SSH and MySQL connections from your specific office IPs and your primary data center IP. Even better, set up a VPN tunnel.
# Simple UFW setup for the DR node
ufw default deny incoming
ufw allow from 192.0.2.10 to any port 22 proto tcp # Your Office
ufw allow from 198.51.100.20 to any port 3306 proto tcp # Primary Server
ufw enable
Conclusion
The cost of a standby VPS is negligible compared to the cost of business downtime. For the price of a few coffees, you can have a remote, KVM-isolated, NVMe-powered safety net in Oslo.
Do not wait for the drives to click or the RAID controller to smoke. The hardware will fail. It is not a matter of if, but when. Ensure your infrastructure is resilient enough to handle it.
Ready to harden your infrastructure? Deploy a KVM-based, NVMe-powered disaster recovery node on CoolVDS today and sleep through the next 3 AM alarm.