Console Login

Disaster Recovery in the North: Why 99.9% Uptime SLAs Won't Save Your Data

Disaster Recovery in the North: Why 99.9% Uptime SLAs Won't Save Your Data

Let’s get one thing straight immediately: Availability is not durability. Your cloud provider's 99.99% SLA guarantees the power stays on and the network routing works. It does not guarantee that your junior developer won't accidentally drop the production database table at 16:45 on a Friday. It does not guarantee that a filesystem corruption won't silently eat your inodes. And as we learned from the Strasbourg datacenter fire in 2021, it certainly doesn't guarantee that the physical server won't turn into a pile of melted silicon.

I have spent the last decade managing infrastructure across the Nordics, and I have seen grown CTOs weep because their "backup strategy" was a single snapshot stored on the same SAN as the production disk. If you are running mission-critical workloads in Norway, you need a plan that assumes failure is inevitable.

The Norwegian Context: Latency and Law

In 2022, we operate under the shadow of Schrems II. If you are dumping your Norwegian customer data into a US-owned cloud bucket for disaster recovery, you are inviting a conversation with Datatilsynet (The Norwegian Data Protection Authority) that you do not want to have. Data sovereignty matters.

Furthermore, there is the physics of recovery. Recovery Time Objective (RTO) is directly correlated to bandwidth and I/O. If you need to restore 500GB of data from a cold storage glacier in Dublin to a server in Oslo, you are fighting physics. Keeping your DR site within the Norwegian borders—connected via NIX (Norwegian Internet Exchange)—drastically reduces the time it takes to hydrate a new environment.

The "3-2-1" Rule is Dead. Long Live "3-2-1-1-0".

The old school rule was: 3 copies, 2 media types, 1 offsite. In modern DevOps, we extend this:

  • 1 copy on immutable storage (ransomware protection).
  • 0 errors on recoverability verification (automated restore tests).

Most VPS providers oversell their storage speeds. When you are in a disaster scenario, you need sustained write speeds to restore data, not burst speeds. This is where the underlying hardware matters. We reference CoolVDS often because they expose raw NVMe interfaces via KVM, rather than hiding behind a slow shared storage layer. In a recovery scenario, that IOPS difference is the gap between being down for 1 hour or 6 hours.

Technical Implementation: The Database

Stop using cron jobs with `mysqldump` for databases larger than 5GB. It locks tables, kills performance, and the restore time is agonizingly slow. For PostgreSQL (which you should be using in 2022), you need Point-in-Time Recovery (PITR) using WAL archiving.

Here is a production-ready `postgresql.conf` snippet for enabling archiving. This setup pushes Write-Ahead Logs to a secure, secondary location immediately.

# /etc/postgresql/14/main/postgresql.conf

# WAL Level must be replica or logical
wal_level = replica

# Enable archiving
archive_mode = on

# The command to push the WAL file to your DR server
# We use rsync here, but this could be a tool like wal-g
archive_command = 'rsync -a %p postgres@backup-node.local:/var/lib/postgresql/archived_wals/%f'

# Timeout settings to prevent accumulation if the backup node is down
archive_timeout = 60

With this configuration, you can replay your database state to exactly 2 seconds before the disaster occurred. Compare that to losing 24 hours of data because you relied on a nightly dump.

Pro Tip: Always monitor your replication lag. A backup system that fails silently is a time bomb. Use tools like Prometheus or Zabbix to alert if `pg_stat_archiver` shows failure counts increasing.

File System Synchronization

For file assets (user uploads, configuration files), `rsync` is still the king of reliability, but it needs to be used correctly. Do not just sync blindly; use the `--link-dest` flag for incremental backups that look like full backups without consuming extra space.

However, for a modern stack, I prefer BorgBackup. It offers deduplication, compression, and authenticated encryption. It is particularly effective on VPS environments where disk space costs money.

Automating Borg with a Wrapper Script

Do not run raw commands in cron. Wrap them. Here is a bash script structure we deploy on our management nodes:

#!/bin/bash

# Configuration
REPOSITORY="ssh://backup-user@backup.coolvds.net/./repo"
BACKUP_SOURCE="/var/www/html"
LOG="/var/log/borg-backup.log"

# Export passphrase for non-interactive run
export BORG_PASSPHRASE='CorrectHorseBatteryStaple'

# Log start
echo "Starting backup at $(date)" >> $LOG

# Create backup
borg create --stats --progress \
    $REPOSITORY::'{hostname}-{now:%Y-%m-%d_%H:%M}' \
    $BACKUP_SOURCE \
    >> $LOG 2>&1

# Prune old backups (Keep 7 dailies, 4 weeklies, 6 monthlies)
borg prune -v --list --keep-daily=7 --keep-weekly=4 --keep-monthly=6 $REPOSITORY >> $LOG 2>&1

# Check for errors
if [ $? -ne 0 ]; then
    mail -s "BACKUP FAILED: $(hostname)" ops@example.no < $LOG
fi

This script handles rotation automatically. You store more history while using less space.

Infrastructure as Code (IaC) is Your Lifeboat

If your server vanishes today, how long does it take to configure a new one? If you are logging in via SSH and running `apt install nginx`, you are doing it wrong. In 2022, you should be using Ansible, Terraform, or SaltStack.

Your disaster recovery plan should look like this:

  1. Spin up a fresh NVMe VPS on CoolVDS (takes ~55 seconds).
  2. Point your DNS (via API) to the new IP.
  3. Run your Ansible playbook to install software.
  4. Restore data from the backup node.

Here is a snippet of an Ansible playbook that ensures your recovery environment is identical to production:

---
- name: Disaster Recovery Restore
  hosts: recovery_web
  become: yes
  vars:
    nginx_worker_processes: "auto"
    keepalive_timeout: "65"

  tasks:
    - name: Install Nginx and Dependencies
      apt:
        name: ["nginx", "python3-certbot-nginx", "git", "htop"]
        state: present
        update_cache: yes

    - name: Push Nginx Configuration
      template:
        src: templates/nginx.conf.j2
        dest: /etc/nginx/nginx.conf
        mode: '0644'
      notify: Restart Nginx

    - name: Ensure Firewall is Active
      ufw:
        state: enabled
        policy: deny
        rule: limit
        port: ssh
        proto: tcp

  handlers:
    - name: Restart Nginx
      service:
        name: nginx
        state: restarted

Quick Command Reference

Keep these commands in your runbook. When adrenaline is pumping, you will forget syntax.

1. Verify Postgres Backup Integrity:
pg_restore --list my_backup_file.dump

2. Fast Directory Sync over SSH:
rsync -avz --delete -e ssh /var/www/ user@backup-node:/backup/

3. Check Disk I/O Wait (Is your restore killing the CPU?):
iostat -xz 1

4. Snapshot ZFS filesystem (If using ZFS on Linux):
zfs snapshot zroot/data@$(date +%F-%H%M)

5. Test Web Server Response Header:
curl -I https://coolvds.net

The Hardware Reality

Software configurations are useless if the hardware chokes during a restore. I have tested restores on cheap "budget" VPS providers where the disk write speed capped at 30MB/s. Restoring a 200GB database took nearly two hours. That is unacceptable.

When selecting a host for your DR site, look for NVMe storage and high-frequency CPUs. You don't need them for daily traffic, but you desperately need them when you are decompressing 50GB of gzip logs while simultaneously writing to disk. CoolVDS consistently benchmarks high on sequential write speeds, which is exactly what you need when hydrating a new node from a backup.

Conclusion

A disaster recovery plan is not a document; it is a practiced reflex. If you haven't restored your backups in the last 3 months, you don't have backups—you have hopeful files. Start small. script your database dumps, verify your offsite storage compliance for Norway, and ensure your infrastructure provider has the raw I/O throughput to handle a crisis.

Don't let slow I/O be the reason your business fails. Deploy a high-performance test instance on CoolVDS today and benchmark your recovery speeds.