Disaster Recovery in 2019: It’s Not Just Backups, It’s About Survival
I have a simple rule for junior sysadmins: If you haven't tested the restore, the backup doesn't exist.
We all nod politely when the CTO talks about "business continuity," but those of us with pager duty know the reality. When a kernel panic hits your primary load balancer at 3:00 AM on a Saturday, or someone accidentally runs a destructive query on the master database, nobody cares about your PDF policy documents. They care about one thing: How fast can we be back online?
In Norway, where connectivity via NIX (Norwegian Internet Exchange) is stellar but strict data laws (GDPR) bind our hands regarding where data lives, building a solid Disaster Recovery (DR) plan is an engineering challenge, not an administrative one. Let's look at how to architect a failover strategy that actually works, using tools available right now in 2019.
The RPO vs. RTO Reality Check
Most VPS providers sell you "backups" that run once a day. That is a Recovery Point Objective (RPO) of 24 hours. Can your business lose 24 hours of data? Probably not. Then there is Recovery Time Objective (RTO)—how long it takes to restore. If you are restoring 500GB of data from a spinning rust SATA drive over a congested network, your RTO could be 6+ hours.
We need to do better. We need near-real-time replication and instant provisioning. This is where high-performance KVM architecture, like what we see on CoolVDS, becomes critical. You cannot automate disaster recovery on inconsistent hardware.
War Story: The "Silent" Corruption
Two years ago, I managed a Magento cluster for a retailer in Oslo. We had nightly backups. What we didn't catch was a silent file system corruption on the image server. The backups were running successfully—backing up corrupted JPEGs every single night. When the disk finally died, we restored the backup, only to find three months of product images were digital noise.
The Lesson: Backup verification must be automated, and your DR site must be a living environment, not a cold storage bucket.
Step 1: Consistent Database Snapshots (MySQL/MariaDB)
Stopping the database to copy files is not an option in 2019. We use Percona XtraBackup for hot backups. It ensures InnoDB consistency without locking tables for the duration of the copy.
Here is a robust script wrapper for innobackupex that streams the backup, compresses it, and encrypts it before sending it off-site (perhaps to a secondary CoolVDS instance in a different availability zone).
#!/bin/bash
# reliable-backup.sh
# Dependency: percona-xtrabackup-24, qpress, openssl
TIMESTAMP=$(date +%F_%H-%M)
BACKUP_DIR="/mnt/dr_storage/mysql"
LOG_FILE="/var/log/mysql_dr.log"
echo "Starting backup at $TIMESTAMP" >> $LOG_FILE
# Stream, compress, and encrypt on the fly
innobackupex --stream=xbstream --parallel=4 . | \
qpress -io | \
openssl enc -aes-256-cbc -k "YourSecretPass" > $BACKUP_DIR/db_$TIMESTAMP.xbstream.enc
if [ $? -eq 0 ]; then
echo "Backup Success: $TIMESTAMP" >> $LOG_FILE
# Trigger replication to DR site
rsync -avz -e "ssh -p 2222" $BACKUP_DIR/ remote_user@dr-node.coolvds.com:/backups/mysql/
else
echo "Backup FAILED: $TIMESTAMP" >> $LOG_FILE
# Alert OpsGenie or PagerDuty here
exit 1
fi
Pro Tip: Never rely on `mysqldump` for datasets larger than 10GB. The restore time (RTO) is too slow because it has to rebuild indexes. Binary backups like XtraBackup are physical copies; restoring them is as fast as disk I/O allows.
Step 2: Filesystem Replication with Restic
For file assets (configs, uploads), `rsync` is standard, but `restic` (which has matured significantly by v0.9.4 this year) offers encrypted, deduplicated snapshots. This saves massive amounts of space and bandwidth on your VPS.
Initialize a repository on your CoolVDS secondary storage:
export RESTIC_REPOSITORY=sftp:user@dr-host:/srv/restic-repo
export RESTIC_PASSWORD_FILE=/etc/restic_pass
# Initialize once
restic init
# Run via cron every 15 minutes
restic backup /var/www/html --exclude-file=/etc/restic_excludes
Because CoolVDS offers local peering in Norway, latency between instances is often under 2ms. This allows you to run these backups frequently without choking the network.
Step 3: The Infrastructure is Code (Terraform)
In 2019, if you are configuring servers by hand during a disaster, you have already failed. You need Terraform to provision the replacement infrastructure instantly. While we wait for Terraform 0.12, the 0.11 syntax is stable enough for production.
Define your DR environment. If your primary datacenter in Oslo goes dark, you need to spin up the compute resources immediately.
resource "openstack_compute_instance_v2" "dr_web_node" {
name = "dr-web-01"
image_name = "Ubuntu 18.04 LTS"
flavor_name = "v2-standard-4cpu-8gb"
key_pair = "deployer-key"
security_groups = ["web-public"]
network {
name = "dr-private-net"
}
user_data = <
Why Hardware Matters: The NVMe Factor
Here is the bottleneck nobody discusses: Disk Rehydration. When you need to restore 500GB of data to get back online, your disk write speed defines your downtime.
Storage Type
Sequential Write Speed
Restore Time (500GB)
Standard HDD (7.2k RPM)
~120 MB/s
~70 Minutes
SATA SSD
~500 MB/s
~17 Minutes
CoolVDS NVMe
~2500 MB/s
~3.5 Minutes
This is why we deploy on CoolVDS. Their standard KVM instances run on NVMe arrays. In a disaster, saving 60 minutes on restore time is the difference between a minor hiccup and a reputation-destroying outage.
The Norwegian Context: GDPR and Datatilsynet
We are operating under the strict requirements of GDPR. Article 32 mandates the "ability to restore the availability and access to personal data in a timely manner."
Storing your DR backups in an AWS bucket in Virginia is a legal grey area that makes compliance officers nervous. Storing them on a secondary CoolVDS instance in a different Norwegian facility keeps the data within national borders, satisfying both latency requirements and Datatilsynet's oversight. You get data sovereignty without sacrificing speed.
Automating the Failover (Nginx)
Your DNS TTL should be low (300 seconds), but you can also handle failover at the load balancer level if you have a floating IP. Here is a simple Nginx upstream config that marks the primary as down and routes to the DR site (functioning as a 'sorry' server or read-only replica) automatically:
upstream backend_cluster {
server 10.0.0.10:80 max_fails=3 fail_timeout=10s;
server 10.0.0.20:80 backup;
}
server {
listen 80;
location / {
proxy_pass http://backend_cluster;
proxy_set_header Host $host;
proxy_connect_timeout 2s;
}
}
Conclusion
Hope is not a strategy. Scripts, redundancy, and fast hardware are. By combining robust tools like XtraBackup and Restic with high-IOPS infrastructure, you turn disaster recovery from a nightmare into a checklist item.
Don't wait for the inevitable hardware failure to test your theories. Spin up a sandbox instance on CoolVDS today—deployments take less than 55 seconds—and break things on purpose. Your future self will thank you.