Console Login

When RAID Fails: A Battle-Tested Disaster Recovery Guide for 2017

When rm -rf / Hits Production: A Battle-Tested Guide to Disaster Recovery

There are two types of system administrators: those who have lost data, and those who will. I learned this the hard way in 2014 when a degraded RAID 5 array on a legacy server decided to fail its second drive during a rebuild. The silence in the server room was deafening.

Uptime guarantees are marketing fluff. SLAs refund your money, but they don't restore your customer database. In the current landscape of 2017, with ransomware threats escalating and the upcoming General Data Protection Regulation (GDPR) looming over every European business, having a "we copy files sometimes" strategy is professional negligence.

This isn't a high-level overview for managers. This is for the people with root access. We are going to look at how to architect a Disaster Recovery (DR) plan that actually works, specifically for infrastructure hosted in Norway, leveraging modern KVM virtualization.

The 3-2-1 Rule is Non-Negotiable

If you take nothing else from this post, remember the 3-2-1 rule. It’s the industry standard for a reason:

  • 3 copies of your data: One primary, two backups.
  • 2 different media types: E.g., NVMe storage on your VPS and object storage/cold storage.
  • 1 offsite location: Because data centers can catch fire.

Pro Tip: Snapshots are NOT backups. If the underlying storage array on your host node corrupts, your snapshot dies with it. CoolVDS provides automated snapshots for quick rollbacks, but we always advise clients to pipe critical data out of the datacenter for true redundancy.

RTO and RPO: The Metrics That Matter

Before writing scripts, define your failure tolerance:

  • Recovery Time Objective (RTO): How long can you be down? (e.g., "We need to be back up in 4 hours.")
  • Recovery Point Objective (RPO): How much data can you lose? (e.g., "We can lose up to 1 hour of transactions.")

If your boss asks for zero data loss and 100% uptime, ask for an unlimited budget. For the rest of us, we optimize.

The Technical Implementation: Automating Offsite Backups

Let's get our hands dirty. We will assume you are running a standard LAMP/LEMP stack on CentOS 7 or Ubuntu 16.04. We need to dump the database consistently and ship files to a remote secure location.

1. Consistent Database Dumps

Copying /var/lib/mysql while the server is running is a recipe for corruption. You need a logical dump. For MySQL/MariaDB, use mysqldump with the --single-transaction flag to ensure InnoDB consistency without locking tables for the duration of the backup.

#!/bin/bash

# Variables
DB_USER="backup_user"
DB_PASS="ComplexPassword123!"
BACKUP_DIR="/backup/mysql"
DATE=$(date +%F_%H-%M)

# Ensure directory exists
mkdir -p $BACKUP_DIR

# Dump all databases
mysqldump -u$DB_USER -p$DB_PASS --all-databases --single-transaction --quick --lock-tables=false > $BACKUP_DIR/full_dump_$DATE.sql

# Compress to save space (gzip is standard, pigz is faster if you have cores to spare)
gzip $BACKUP_DIR/full_dump_$DATE.sql

2. The Transport Layer: Rsync over SSH

FTP is dead. Do not use it. We use rsync over SSH. It allows for incremental backups, meaning you only transfer the changes, saving massive amounts of bandwidth and time.

First, set up SSH key authentication between your CoolVDS instance and your backup server so the script can run without a password prompt.

ssh-keygen -t rsa -b 4096
ssh-copy-id user@backup-server.example.com

Now, the synchronization script:

#!/bin/bash

SOURCE_DIR="/var/www/html"
BACKUP_DEST="user@backup-server.example.com:/home/backups/web"
LOG_FILE="/var/log/backup_job.log"

echo "Starting backup at $(date)" >> $LOG_FILE

# -a: archive mode (preserves permissions, times, owners)
# -v: verbose
# -z: compress during transfer
# --delete: remove files in destination that no longer exist in source

rsync -avz --delete -e "ssh -p 22" $SOURCE_DIR $BACKUP_DEST >> $LOG_FILE 2>&1

if [ $? -eq 0 ]; then
    echo "Backup Success" >> $LOG_FILE
else
    echo "Backup FAILED" >> $LOG_FILE
    # In 2017, mail is still king for alerts. 
    echo "Backup Failed" | mail -s "CRITICAL: Backup Failed" admin@example.com
fi

The "CoolVDS" Factor: Why Infrastructure Choice Impacts Recovery

Your scripts are only as fast as the I/O they run on. In a disaster scenario, RTO is usually bottled-necked by disk speed.

Many VPS providers in Europe still run on shared SATA SSDs or even cached HDD arrays. If you need to restore 500GB of data, SATA speeds will keep you offline for hours. This is why CoolVDS standardized on NVMe storage for all instances. An NVMe drive can handle parallel read/write operations significantly faster than SATA, slashing restoration times by up to 60%.

Furthermore, we use KVM (Kernel-based Virtual Machine). Unlike OpenVZ, where resources are shared and a "noisy neighbor" can steal your CPU cycles during a restore operation, KVM provides strict resource isolation. When you are restoring a backup, you need every cycle of CPU and every IOP of disk performance you paid for.

The Norwegian Context: Data Sovereignty

With the invalidation of Safe Harbor and the skepticism surrounding the Privacy Shield framework, relying on US-based cloud storage for your backups is a legal minefield. The Norwegian Data Protection Authority (Datatilsynet) is clear about the responsibilities of data controllers.

Hosting your primary infrastructure and your disaster recovery site within Norway—or at least within the EEA—is the safest bet for compliance. Latency is also a factor; ping times from Oslo to a backup server in Frankfurt are acceptable (~20ms), but keeping traffic local to NIX (Norwegian Internet Exchange) ensures maximum throughput and minimum latency.

Testing: The Missing Step

A backup is not a backup until you have successfully restored from it. Schedule a "Fire Drill" once a quarter. Spin up a fresh CoolVDS instance and attempt to rebuild your production environment using only your offsite backups.

Scenario Action Plan Expected Downtime
Corrupted File Restore single file from rsync < 5 Minutes
Database Drop Import mysqldump to fresh DB 15-30 Minutes
Server Failure Redeploy App + Restore Data 1-4 Hours

Conclusion

Disaster recovery isn't about pessimism; it's about professionalism. Hardware fails. Humans make typos. Updates break dependencies. By automating your backups and choosing a provider that offers the raw I/O performance of NVMe, you turn a potential catastrophe into a minor inconvenience.

Don't wait for the inevitable kernel panic. Deploy a high-performance NVMe KVM instance on CoolVDS today and build your safety net before you fall.