Console Login

Disaster Recovery in 2017: Why Your "Backups" Will Probably Fail When You Need Them Most

Disaster Recovery in 2017: Why Your "Backups" Will Probably Fail When You Need Them Most

Let’s address the elephant in the server room. On January 31st—barely five weeks ago—GitLab.com suffered a catastrophic database failure. An exhausted sysadmin accidentally ran rm -rf on the production directory instead of the standby. Five distinct backup mechanisms failed. Five. If a tech giant can lose 300GB of production data in a blink, what chance does your standard LEMP stack have?

I’ve spent the last decade staring at terminal screens across Europe, and I’ve learned one immutable truth: Backups are irrelevant. Recovery is everything.

Nobody cares if the backup cron job ran successfully. They care if the business can be back online before the CEO starts throwing chairs. In this guide, we aren't talking about theory. We are talking about raw, scriptable, battle-tested recovery strategies for the Norwegian market, keeping the looming 2018 GDPR enforcement and the current Datatilsynet strictness in mind.

The "I/O Choke" Problem

Most recovery plans fail because of physics. You calculated that you have 500GB of data. You have a 1Gbps uplink. Math says you can download it in an hour. Wrong.

The bottleneck isn't the network; it's the disk write speed on the receiving end. If you are restoring a MySQL database to a cheap VPS with spinning rust or shared SATA SSDs, your write IOPS will hit a wall. I've seen database imports crawl at 5MB/s because the underlying storage couldn't handle the random writes of rebuilding indexes.

Pro Tip: Always calculate your RTO (Recovery Time Objective) based on the target hardware's write IOPS. This is why we standardized on NVMe storage at CoolVDS. Restoring a 50GB InnoDB table on NVMe takes minutes. On standard SSDs, it can take an hour. On HDDs? You might as well go home for the day.

The Architecture of Survival

For a robust setup targeting Norwegian users (latency matters), you need a primary node in Oslo and a disaster recovery (DR) node that is geographically separated but low-latency enough for rapid sync.

1. Database Consistency is King

Stop using `cp -r /var/lib/mysql`. It doesn't work on a running server. You need a consistent snapshot. If you are running MySQL 5.7 or MariaDB 10.1 (which you should be on CentOS 7 or Ubuntu 16.04), `mysqldump` is your first line of defense, but only if you use the right flags.

Here is the script I use for hot backups that don't lock your tables:

#!/bin/bash
# /usr/local/bin/backup-db.sh

TIMESTAMP=$(date +"%F")
BACKUP_DIR="/backup/mysql"
MYSQL_USER="root"
MYSQL_PASS="YourSecurePassword"

# Ensure dir exists
mkdir -p $BACKUP_DIR

# The magic flags:
# --single-transaction: consistent backup for InnoDB without locking tables
# --quick: don't buffer query, dump directly to stdout
# --routines --triggers: don't forget your stored procs!

mysqldump -u$MYSQL_USER -p$MYSQL_PASS --all-databases \
  --single-transaction \
  --quick \
  --routines \
  --triggers \
  | gzip > "$BACKUP_DIR/full-backup-$TIMESTAMP.sql.gz"

# Rotation: Delete backups older than 7 days
find $BACKUP_DIR -type f -name "*.sql.gz" -mtime +7 -exec rm {} \;

2. The Filesystem Sync (Rsync over SSH)

For your web files, `rsync` is still the undisputed champion. It’s efficient, it’s standard, and it handles permissions correctly. Do not use FTP. It’s 2017, not 1999.

Run this from your Backup Server (pull model is safer than push model—if the main server gets hacked, the hacker can't wipe the backups).

#!/bin/bash
# Pull data from Production to DR site

SOURCE_USER="backup-user"
SOURCE_HOST="10.0.0.5" # Your Private IP within the CoolVDS VLAN
SOURCE_DIR="/var/www/html/"
DEST_DIR="/backup/www/"

rsync -avz --delete \
  --exclude 'var/cache' \
  --exclude 'var/log' \
  -e "ssh -i /home/backup-user/.ssh/id_rsa_backup" \
  $SOURCE_USER@$SOURCE_HOST:$SOURCE_DIR $DEST_DIR

3. Offsite Replication & Data Sovereignty

Since the invalidation of Safe Harbor, transferring user data outside the EEA (European Economic Area) is a legal minefield. If you are hosting Norwegian medical or financial data, putting your backups on AWS S3 in `us-east-1` is a compliance violation waiting to happen.

The solution? Keep it in the country or at least within the EEA. We often see clients setting up a secondary CoolVDS instance in a different availability zone. This ensures low latency (sub-10ms) for replication traffic while satisfying the data residency requirements of Datatilsynet.

Testing Your Restore Speed (Benchmark)

You need to know how fast your disk can write. Use `dd` or `fio` to test. Here is a quick sanity check you should run on your backup destination right now:

# Test Write Speed (Bypass Buffer Cache)
dd if=/dev/zero of=testfile bs=1G count=1 oflag=direct

If you see anything less than 300 MB/s, your recovery will be painful. On our NVMe-backed instances, we typically see speeds exceeding 1.2 GB/s. That’s the difference between a 10-minute downtime and a 2-hour outage.

The "CoolVDS" Factor: KVM vs. Containers

In 2017, everyone is talking about containers. Docker is great for deployment, but for data persistence, I still trust full virtualization. OpenVZ or LXC containers often share kernel resources. If a neighbor on the host node gets DDoS'd, your I/O suffers.

This is why we strictly use KVM (Kernel-based Virtual Machine). It provides hardware-level virtualization. When you are restoring a database, you need guaranteed CPU cycles and RAM. You cannot afford to have your `mysql` process killed by the host's OOM killer because another container spiked its usage.

Sample Recovery Configuration

When bringing up a recovery node, you often need to tune the kernel to handle the sudden influx of traffic. Add these to your `/etc/sysctl.conf` on the backup node so it's ready to take the load:

# Improve handling of frequent connections
net.ipv4.tcp_tw_reuse = 1
net.core.somaxconn = 4096

# Protect against SYN flood during recovery confusion
net.ipv4.tcp_syncookies = 1

# Increase backlog for incoming packets
net.core.netdev_max_backlog = 2500

Conclusion: Don't Be GitLab

The GitLab incident taught us that redundancy layers (snapshots, disk mirroring, SQL dumps) can all fail if they aren't isolated and tested. A proper disaster recovery plan involves:

  1. Automated, consistent dumps (not just file copies).
  2. Offsite replication over a secure, private network.
  3. High-performance storage (NVMe) to ensure RTO is met.
  4. Regular fire drills where you actually restore the data.

The Norwegian market demands stability. Latency to NIX (Norwegian Internet Exchange) matters, but data integrity matters more. If you are tired of wondering if your host's "backup solution" actually works, it might be time to take control of your own infrastructure.

Spin up a high-performance KVM instance on CoolVDS today. With our pure NVMe storage and local peering, you can replicate your data faster than you can say "kernel panic."