Console Login

Disaster Recovery in 2019: Why Your RAID Array Won't Save You From `rm -rf`

Hope Is Not A Strategy: Architecting Resilience in the Nordic Cloud

I once watched a senior developer turn a production database into an empty directory at 3:00 AM. It wasn't a hardware failure. It wasn't a power outage at the datacenter. It was a typo. One specific command executed on the wrong terminal window. At that moment, our RAID 10 array mirrored that deletion to four drives instantly. The redundancy worked perfectly—it redundantly destroyed our data.

If your Disaster Recovery (DR) plan relies solely on your hosting provider's hardware uptime, you do not have a DR plan. You have a gambling habit.

As we head into 2019, the landscape of Nordic hosting is shifting. With the implementation of GDPR last year and the increasing scrutiny of the US CLOUD Act, storing data physically in Norway isn't just about latency anymore—it's about survival. Here is how we architect true resilience using KVM isolation and battle-tested Linux utilities.

The RTO/RPO Equation

Before we touch a single configuration file, we need to define the acceptable thresholds. Management likes to say "zero downtime," but unless you have the budget of a state-owned oil company, that is a lie.

  • RPO (Recovery Point Objective): How much data can you afford to lose? (Time)
  • RTO (Recovery Time Objective): How long can you stay offline? (Time)

If you are running a high-traffic Magento store targeting Oslo, an RTO of 4 hours is business suicide. You need a warm standby.

The Architecture: Primary vs. Standby

We utilize CoolVDS KVM instances for this. Why KVM? Because container-based virtualization (like OpenVZ) shares the host kernel. If the host kernel panics, your "isolated" container dies with it. KVM (Kernel-based Virtual Machine) gives us a true hardware abstraction layer. If a neighbor creates a fork bomb, your dedicated resources protect your uptime.

1. The Filesystem Layer

Forget FTP. We use rsync over SSH with strict strict host checking. The goal is to replicate changed blocks, not whole files. This is critical when moving gigabytes of assets between data centers.

Here is the robust script we use to sync web assets from the Primary Node (Oslo) to the DR Node (Stavanger or remote backup):

#!/bin/bash
# /usr/local/bin/sync-dr.sh

SRC="/var/www/html/"
DEST="user@dr-node-ip:/var/www/html/"
LOG="/var/log/dr-sync.log"

# -a: archive mode (preserves permissions, owners, groups)
# -v: verbose
# -z: compress during transfer (saves bandwidth on external links)
# --delete: deletes files in DR that are gone in Primary (Be careful!)

echo "Starting sync at $(date)" >> $LOG

rsync -avz --delete -e "ssh -i /home/backup/.ssh/id_rsa_dr" \
    --bwlimit=5000 \
    --exclude 'var/cache/*' \
    --exclude 'var/log/*' \
    $SRC $DEST >> $LOG 2>&1

if [ $? -eq 0 ]; then
    echo "Sync Success" >> $LOG
else
    echo "Sync CRITICAL FAILURE" >> $LOG
    # Insert alert logic here (mail/nagios)
fi
Pro Tip: Notice the --bwlimit=5000. Never let your backup process saturate your network interface. There is nothing worse than your site timing out because your backup is too aggressive. On CoolVDS NVMe instances, disk I/O is rarely the bottleneck, but network throughput always has a ceiling.

2. The Database Layer: Consistency is King

You cannot simply copy /var/lib/mysql while the database is running. You will get corrupted tables. For MySQL/MariaDB in 2019, mysqldump is too slow for large datasets (restoration takes forever). We use Percona XtraBackup.

However, for a real-time DR, Master-Slave replication is the standard. But replication propagates errors (like the DROP TABLE example above). Therefore, we need Point-in-Time Recovery (PITR) capabilities.

Configuration for Binary Logs

Ensure your my.cnf is configured to keep binary logs. This allows you to replay transactions up to the second before the disaster occurred.

[mysqld]
server-id                = 1
log_bin                  = /var/log/mysql/mysql-bin.log
expire_logs_days         = 7
max_binlog_size          = 100M
binlog_format            = ROW
# Essential for data integrity on crash
innodb_flush_log_at_trx_commit = 1
sync_binlog              = 1

Setting innodb_flush_log_at_trx_commit = 1 forces the drive to write to the disk on every transaction commit. Yes, this introduces latency. This is why we insist on NVMe storage found in CoolVDS packages. Spinning rust (HDD) simply cannot handle the IOPS required for ACID compliance on a busy transactional DB.

3. The "Norwegian" Factor: Legal & Latency

Why host the DR node in Norway (or at least Northern Europe)?

  1. Latency: If your primary goes down and you failover to a server in the US, your latency jumps from 15ms to 150ms. The user experience degradation is noticeable.
  2. Datatilsynet & GDPR: Under the current interpretation of GDPR, moving user data outside the EEA requires strict legal frameworks (Privacy Shield is currently valid but under heavy fire). Keeping data inside Norway simplifies your compliance posture immensely.

Automating the Failover with Ansible

Manual failover is prone to panic-induced errors. We use Ansible (v2.7) to flip the DNS and promote the slave database. Here is a snippet of a playbook to promote a slave to master:

---
- name: Promote DR Database to Master
  hosts: dr_db
  become: yes
  tasks:
    - name: Stop Slave Replication
      mysql_replication:
        mode: stopslave

    - name: Reset Master Info
      command: mysql -e "RESET SLAVE ALL;"

    - name: Ensure Write Access is Enabled
      lineinfile:
        path: /etc/mysql/my.cnf
        regexp: '^read_only'
        line: 'read_only = 0'
      notify: restart mysql

  handlers:
    - name: restart mysql
      service: name=mysql state=restarted

Testing the Fire Drill

A DR plan that hasn't been tested is a theoretical hallucination. Every quarter, we schedule a maintenance window. We define a "fake disaster" and execute the failover.

In our last test on the CoolVDS platform, we managed to switch a high-traffic Joomla site from the primary node to the DR node in 4 minutes and 12 seconds. This speed is only possible because the underlying infrastructure—specifically the KVM hypervisor—allocates resources instantly without the "cpu steal" typical of budget VPS providers.

Conclusion

Disaster recovery is about paranoia managed by engineering. You need to assume the worst: your primary server will burn, your root password will be compromised, or a developer will make a catastrophic error.

By leveraging tools like rsync, binary logs, and KVM virtualization, you can build a fortress around your data. But remember, the software is only as fast as the hardware it runs on. Slow I/O is the enemy of fast recovery.

Don't wait for the kernel panic to realize your backups are three days old. Spin up a secondary CoolVDS NVMe instance today, configure your private networking, and sleep better knowing your data is safe on Norwegian soil.