Disaster Recovery for the Paranoid: Architecting Resilience in a Post-Schrems II World
Let’s be honest: most Disaster Recovery (DR) plans are documents that sit in a Google Drive folder, untouched until a server actually catches fire. By then, it is too late. In 2021, we watched a major European datacenter burn down. Companies that relied on "local snapshots" vanished overnight. If you are a CTO or Lead Architect in Norway, that fire should have been your wake-up call.
It is not just about physical disasters. Ransomware targeting /var/lib/mysql is statistically more likely than a flood. Furthermore, with the Schrems II ruling, relying on US-owned cloud giants for your primary failover strategy creates a legal minefield regarding data transfer mechanisms.
We are going to skip the corporate fluff. This is how you build a DR architecture that survives hardware failure, human error, and legal scrutiny, using standard Linux tools available right now in 2023.
The Mathematics of Failure: RTO and RPO
Before touching a single config file, define your constraints. If you cannot answer these two questions, you do not have a DR plan:
- RPO (Recovery Point Objective): How much data are you willing to lose? One hour? One transaction?
- RTO (Recovery Time Objective): How long can you be offline before the CEO calls you screaming?
For a high-traffic Magento store or a SaaS platform serving Oslo, an RPO of 24 hours is unacceptable. We aim for an RPO of < 5 minutes and an RTO of < 1 hour.
Strategy 1: Immutable Backups (The Ransomware Shield)
Attackers know you have backups. Modern ransomware hunts for your backup mount points and encrypts them too. The solution is immutability. You need a storage target that cannot be overwritten, even by root, until a retention period expires.
On CoolVDS NVMe instances, we often utilize the ext4 attribute system for local hardening before shipping logs off-site, but the real magic happens with tools like Restic or BorgBackup utilizing append-only modes.
Pro Tip: Never mount your backup server on your production server. The production server should push encrypted data blindly, or better yet, the backup server should pull it via restricted SSH keys.
Implementation: The Pull Method
Instead of a script on your web server running scp backup.tar.gz backup-server:, which implies the web server has write access to the backup server, reverse the flow.
Step 1: Restrict the SSH key on the production server to only allow the backup command. In /root/.ssh/authorized_keys on your CoolVDS production node:
command="/usr/bin/rrsync -ro /var/backups/",no-agent-forwarding,no-port-forwarding,no-pty,no-user-rc,no-X11-forwarding ssh-ed25519 AAAAC3NzaC... backup-user@dr-site
Step 2: On the DR server (situated perhaps in a different availability zone or utilizing our secondary Norwegian location), run the pull:
#!/bin/bash
# /opt/scripts/pull_backup.sh
TARGET="/storage/backups/prod-01"
DATE=$(date +%F_%H-%M)
# Ensure we have a directory
mkdir -p $TARGET/$DATE
# Pull data using rsync with bandwidth limit to protect prod network
rsync -avz --bwlimit=5000 -e "ssh -i /home/backup/.ssh/id_prod_pull" \
root@192.0.2.10:/var/backups/ \
$TARGET/$DATE/
# Check integrity immediately
if [ $? -eq 0 ]; then
echo "Backup successful at $DATE"
# Lock the file to prevent accidental deletion (rudimentary immutability)
chattr +i $TARGET/$DATE/*
else
echo "Backup FAILED at $DATE" | mail -s "CRITICAL: Backup Fail" ops@yourdomain.no
fi
Strategy 2: Database Point-in-Time Recovery (PITR)
Dumping your database once a night (mysqldump) is 1999 technology. If your database crashes at 23:50 and your backup runs at 00:00, you lost 23 hours and 50 minutes of data.
For PostgreSQL (standard on our managed tech stacks), you must use WAL (Write Ahead Log) archiving. This allows you to replay transactions up to the exact second of failure.
Configuring WAL Archiving in PostgreSQL 14/15
Edit your postgresql.conf to ship logs to a secure, separate CoolVDS storage instance immediately upon creation.
# postgresql.conf snippet
wal_level = replica
archive_mode = on
# The command to execute when a WAL segment is ready.
# We use LZ4 compression for speed and ship it off-site.
archive_command = 'lz4 -q -z %p - | ssh -i /var/lib/postgresql/.ssh/id_wal_archive postgres@10.0.0.5 "cat > /var/lib/pgsql/wals/%f.lz4"'
This ensures that every time the database writes 16MB of data, it is shipped off-site. Your data loss window shrinks from 24 hours to seconds.
Strategy 3: Infrastructure as Code (IaC) for Rapid Re-deployment
Backups save data. IaC saves time. If your server is compromised, you do not want to SSH in and start patching Apache. You want to terminate the instance and spawn a fresh one.
Using Terraform (v1.3.x is current standard), you can define your CoolVDS resources. If the primary site goes dark, you change one variable and apply.
# main.tf
resource "coolvds_instance" "web_node" {
count = 3
name = "web-prod-${count.index}"
region = var.disaster_recovery_mode ? "no-bergen" : "no-oslo"
image = "ubuntu-22.04"
plan = "nvme-pro-16gb"
# Cloud-init to bootstrap the node immediately
user_data = templatefile("${path.module}/scripts/init.yaml", {
db_ip = var.db_primary_ip
})
tags = {
Environment = "Production"
Role = "Web"
}
}
When the alarm sounds, you are not typing commands. You are running terraform apply -var="disaster_recovery_mode=true".
The Network Layer: DNS Failover
You have restored the data. You have spawned new servers. But traffic is still hitting the dead IP. Waiting for global DNS propagation (TTL) is painful.
Use a Floating IP or a DNS strategy with a low TTL during the emergency. Keep your standard TTL at 300 seconds (5 minutes).
Quick Check: Verify your current TTL settings.
dig +nocmd +noall +answer yourdomain.no
If you see 86400 (24 hours), change it now. You cannot afford to wait a day for users to see your recovered site.
Why Local Sovereignty Matters
Norwegian businesses often overlook the Datatilsynet requirements until they are audited. Storing backups of Norwegian citizen data on cheap object storage in US-owned jurisdictions (even if the datacenter is in Europe) exposes you to the CLOUD Act. This is a massive compliance risk.
CoolVDS infrastructure is owned, operated, and located strictly within Norway/Europe jurisdictions. We utilize NIX (Norwegian Internet Exchange) for peering, meaning if you are replicating data from your office in Oslo to our facility, the traffic likely never leaves the country. This reduces latency to sub-5ms levels, making synchronous replication actually feasible without killing application performance.
Testing the Latency
Before setting up sync replication, verify the link:
ping -c 10 -s 1500 192.0.2.50 | grep "avg"
If you see anything above 20ms, stick to asynchronous replication. High latency on a synchronous commit will freeze your database writes.
The CoolVDS Advantage in DR
We built our platform for the "Pragmatic CTO." We don't upsell you on proprietary backup "black boxes." We give you raw, high-performance NVMe block storage and KVM virtualization so you can build the architecture your compliance officer demands.
- 10Gbps Internal Network: Rehydrate 1TB of data in minutes, not hours.
- ISO 27001 Certified Datacenters: Physical security is not your problem.
- No Vendor Lock-in: Use standard Linux tools. If you leave us, you take your configs with you.
Disaster recovery is not a product you buy; it is a process you design. But the foundation matters. Don't build your safety net on budget spinning rust.
Is your infrastructure ready for the worst? Spin up a secondary DR node on CoolVDS today and test your rsync throughput. Stability starts with action.