Console Login

Disaster Recovery Protocols: Beyond `rsync` and Prayers

When the Smoke Clears: Engineering Resilience in 2020

It is not a matter of if your primary node fails. It is a matter of when. I have seen solid-state drives trigger firmware bugs that wipe partition tables in milliseconds. I have watched datacenter fire suppression systems trigger accidentally, screaming louder than a jet engine while vibrations destroy hard drives. If your disaster recovery (DR) plan relies on a manual playbook or, worse, a "trust me" SLA from a budget provider, you are already offline.

March 2020 has taught us that stability is fragile. With half the world shifting to remote work, the load on infrastructure is unprecedented. Here is the cold reality: RAID is not backup. Snapshots are not a DR strategy. Availability Zones (AZs) share dependencies. To survive a true meltdown, you need a decoupled, immutable, and verified recovery path.

The Sovereignty Variable: Why Location Matters

Before we touch a single config file, look at your map. If you are serving customers in Oslo, Bergen, or Trondheim, your backup strategy has a legal dimension. With the GDPR firmly in effect and the validity of the EU-US Privacy Shield looking increasingly shaky in the courts, keeping data within Norwegian borders is not just about latency—it is about survival.

Pro Tip: Data sovereignty isn't just a buzzword for the Datatilsynet. It affects your Recovery Time Objective (RTO). Pulling 500GB of backup archives from an AWS bucket in Virginia to a bare-metal server in Oslo introduces massive latency penalties. Recovering from a local CoolVDS instance via the NIX (Norwegian Internet Exchange) creates a local loop that saturates your port speed, not your patience.

The Storage Layer: ZFS and NVMe

In 2020, spinning rust (HDDs) should only be used for cold archival. For active recovery, we need NVMe. Why? Because when you are restoring a MySQL database, your bottleneck is almost always I/O Wait. On standard SSDs, restoring a 50GB dump can take 45 minutes. On the NVMe arrays we utilize at CoolVDS, we see that drop to under 8 minutes.

However, hardware fails. That is why we rely on file-system level snapshots. If you aren't using ZFS or Btrfs on your storage nodes, you are working too hard. Here is how we verify data integrity without stopping the world.

Automating Offsite Backups with Borg

Stop writing custom shell scripts that tarball your root directory. They fail silently. In 2020, the standard for deduplicated, encrypted, and authenticated backups is BorgBackup. It handles sparse files efficiently and encrypts client-side.

# Install Borg on Ubuntu 18.04 LTS sudo apt-get update && sudo apt-get install borgbackup # Initialize the encrypted repo (Do this ONCE) borg init --encryption=repokey user@backup-node.coolvds.net:/var/backups/repo

Once initialized, your nightly cron job should look like this. Note the exclusion patterns to avoid backing up garbage.

#!/bin/bash
# /usr/local/bin/run-backup.sh

export BORG_PASSPHRASE='CorrectHorseBatteryStaple'

# Backup everything except caches and temp files
borg create --stats --progress --compression lz4 \
    user@backup-node.coolvds.net:/var/backups/repo::{hostname}-{now} \
    /etc /home /var/www \
    --exclude '/var/cache' \
    --exclude '/var/tmp'

# Prune old backups (Keep 7 dailies, 4 weeklies, 6 monthlies)
borg prune -v --list --keep-daily=7 --keep-weekly=4 --keep-monthly=6 \
    user@backup-node.coolvds.net:/var/backups/repo

Database Consistency: The Silent Killer

File backups are useless if your database is in an inconsistent state during the copy. I see this constantly: a developer runs `rsync` on `/var/lib/mysql` while the server is running. Congratulations, you have backed up a corrupted table.

For MySQL/MariaDB in a high-availability environment, you need Percona XtraBackup or a properly flagged `mysqldump`. If you are running a standard LAMP stack on a CoolVDS instance, here is the command that ensures you don't lock your tables and kill your production site during backup:

mysqldump --single-transaction --quick --lock-tables=false \ -u root -pWrapperSecret production_db | gzip > /tmp/db_backup_$(date +%F).sql.gz

The --single-transaction flag is critical for InnoDB tables. It ensures the backup reflects the database state at the start of the dump, regardless of incoming writes.

Infrastructure as Code: Recovery is Deployment

If your server vanishes today, how long until you are back online? If you are manually installing Nginx and PHP via SSH, you have failed. Disaster Recovery in 2020 means Infrastructure as Code (IaC).

You should have an Ansible playbook that can configure a fresh CoolVDS node from scratch in under 5 minutes. Here is a snippet of a playbook ensuring your web server is configured exactly as it was before the crash:

---
- hosts: webservers
  become: yes
  vars:
    http_port: 80
    max_clients: 200
  
  tasks:
  - name: Ensure Nginx is at the latest version
    apt: 
      name: nginx
      state: latest
      update_cache: yes

  - name: Write Nginx Configuration
    template:
      src: templates/nginx.conf.j2
      dest: /etc/nginx/nginx.conf
      owner: root
      group: root
      mode: '0644'
    notify:
    - restart nginx

  handlers:
    - name: restart nginx
      service:
        name: nginx
        state: restarted

The Architecture of Trust

We built CoolVDS on top of KVM (Kernel-based Virtual Machine) for a specific reason. Unlike container-based virtualization (LXC/OpenVZ) where you share the kernel with every other tenant on the host, KVM provides true hardware virtualization. If a "noisy neighbor" kernel panics, your instance keeps running. This isolation is mandatory for any serious DR plan.

Furthermore, network latency within Norway matters. Routing traffic from Oslo to a backup server in Frankfurt adds 20-30ms. Routing it to another datacenter in Norway is often sub-2ms. When you are syncing terabytes of data, that latency compounds into hours of difference in recovery time.

Testing the Untestable

A backup is not a backup until you have restored it. Schedule a "Fire Drill" once a quarter. Spin up a fresh CoolVDS instance (it takes about 55 seconds), deploy your Ansible playbooks, and restore your Borg backup. If the application doesn't load, your backup strategy is theoretical, not practical.

Disaster recovery is about paranoia managed by process. Don't wait for the hardware to fail. Assume it already has.

Ready to harden your infrastructure? Deploy a KVM-backed, NVMe-powered instance on CoolVDS today and secure your data sovereignty.