Surviving the Blackout: A Battle-Tested Disaster Recovery Strategy for Norwegian Infrastructure
If your Disaster Recovery (DR) plan relies solely on nightly snapshots, you don't have a plan. You have a resignation letter waiting to be signed. In the last three years, we've seen major availability zones in Europe go dark due to cooling failures and fiber cuts. It's April 2025. There is no excuse for 24-hour downtime windows anymore.
Latency matters. Compliance matters. But when the primary node vanishes, speed matters most. I’m going to walk you through building a resilient, hot-standby architecture that keeps your data strictly within Norwegian borders—satisfying GDPR and Datatilsynet—while ensuring your RTO (Recovery Time Objective) stays under 60 seconds.
The Myth of "High Availability" vs. Actual DR
High Availability (HA) protects you from a service crash. Disaster Recovery protects you from the crater where your datacenter used to be. You need both. A load balancer distributing traffic between two nodes in the same rack is not DR. If that rack loses power, you are offline.
For a robust setup targeting the Norwegian market, you need geographic separation without the latency penalty. CoolVDS offers distinct isolation. We aren't talking about shared hosting containers here. We are talking about KVM isolation where your kernel owns the hardware resources.
Step 1: Infrastructure as Code (The Foundation)
You cannot manually click your way out of a disaster. You need to spin up replacement infrastructure immediately. By 2025, OpenTofu and Terraform are the standards. Here is how we define a secondary CoolVDS recovery node programmatically. This ensures that your DR environment is identical to production.
# main.tf - Provisioning a CoolVDS DR Node
resource "coolvds_instance" "dr_node" {
name = "norway-dr-01"
region = "oslo-zone-b" # Geographically separate from Zone A
image = "ubuntu-24-04-lts"
plan = "nvme-16gb-4vcpu"
ssh_keys = [var.ssh_key_id]
# Critical: Private Networking for Replication
private_network {
enabled = true
ip = "10.10.2.5"
}
tags = [
"env:dr",
"role:database-replica"
]
# Cloud-init to install base dependencies immediately
user_data = <<-EOF
#!/bin/bash
apt-get update && apt-get install -y postgresql-17 wireguard
systemctl enable postgresql
EOF
}
Pro Tip: Always use private networking for replication. Sending unencrypted WAL logs over the public internet is a security suicide mission. If private networking isn't an option, wrap everything in WireGuard. on CoolVDS, private LAN transfer is unmetered, saving you bandwidth costs.
Step 2: Database Replication (The Pulse)
Data gravity is real. Moving 500GB of data from a backup archive takes time. Replaying 500GB of write-ahead logs (WAL) on standard SSDs takes hours. On NVMe storage, which is standard on CoolVDS, it takes minutes. This is why hardware selection is part of your DR strategy.
We will configure PostgreSQL 17 (the current stable workhorse) for streaming replication. The Primary node pushes changes to the DR node in near real-time.
Primary Node Configuration (postgresql.conf)
Set these flags to enable the replication slot. We use replica_user with a strong password.
# /etc/postgresql/17/main/postgresql.conf
listen_addresses = 'localhost,10.10.1.5' # Listen on Private IP
wal_level = replica
max_wal_senders = 10
max_replication_slots = 10
# Failover safety
synchronous_commit = on
# 'on' ensures zero data loss (RPO=0) but adds latency.
# switch to 'local' if the latency between zones > 5ms.
# Archive command for point-in-time recovery (optional but recommended)
archive_mode = on
archive_command = 'test ! -f /mnt/backups/%f && cp %p /mnt/backups/%f'
The Replica Node Setup
Don't just run an empty DB. Use pg_basebackup to clone the primary initially.
pg_basebackup -h 10.10.1.5 -D /var/lib/postgresql/17/main -U replica_user -P -v -R -X stream -C -S dr_slot_1
This command does the heavy lifting: it snapshots the data, creates the standby.signal file automatically, and configures the connection settings. Check your status with:
select * from pg_stat_replication;
Step 3: The Failover Logic (The Switch)
Detection is the hardest part. Is the server down, or is the NIX (Norwegian Internet Exchange) just congested? You need a "quorum" approach. I use a simple monitor script running on a tiny third witness node (can be a cheap container).
Here is a robust bash script that handles the decision logic. It verifies the primary is actually dead before promoting the standby to avoid split-brain scenarios.
#!/bin/bash
# failover_monitor.sh
PRIMARY_IP="10.10.1.5"
STANDBY_IP="10.10.2.5"
CHECK_INTERVAL=10
FAIL_COUNT=0
MAX_RETRIES=3
log() {
echo "$(date '+%Y-%m-%d %H:%M:%S') - $1"
}
check_primary() {
# Check port 5432 and simple query
pg_isready -h $PRIMARY_IP -t 2 > /dev/null 2>&1
return $?
}
promote_standby() {
log "CRITICAL: Primary unreachable. Promoting Standby..."
# SSH into Standby and promote
ssh root@$STANDBY_IP "su - postgres -c '/usr/lib/postgresql/17/bin/pg_ctl promote -D /var/lib/postgresql/17/main'"
if [ $? -eq 0 ]; then
log "SUCCESS: Standby is now Primary."
# Trigger DNS update via API here
./update_dns.sh $STANDBY_IP
exit 0
else
log "ERROR: Promotion failed!"
exit 1
fi
}
while true; do
if check_primary; then
FAIL_COUNT=0
else
((FAIL_COUNT++))
log "WARNING: Primary check failed ($FAIL_COUNT/$MAX_RETRIES)"
fi
if [ $FAIL_COUNT -ge $MAX_RETRIES ]; then
promote_standby
fi
sleep $CHECK_INTERVAL
done
Configure this as a systemd service to ensure it survives reboots on the witness node.
[Unit]
Description=PostgreSQL Auto-Failover Monitor
After=network.target
[Service]
ExecStart=/usr/local/bin/failover_monitor.sh
Restart=always
User=root
[Install]
WantedBy=multi-user.target
Why CoolVDS? The Hardware Reality
Software configuration is only half the battle. When you trigger that failover, your new primary node gets hammered by the full weight of production traffic immediately. This is where "noisy neighbor" syndrome kills cheap VPS providers.
If your VPS shares disk I/O with 50 other users, your database will choke during the cache-warming phase. CoolVDS guarantees resource isolation. Our NVMe arrays provide the random I/O throughput necessary to ingest high-velocity writes while simultaneously serving reads, keeping your Magento store or SaaS platform responsive even during recovery.
Offsite Backups (The Last Resort)
Replication covers availability. Backups cover corruption. Use restic for encrypted, deduplicated backups to an S3-compatible object storage (or a separate CoolVDS storage instance).
restic -r sftp:user@backup-node:/srv/restic-repo backup /var/lib/postgresql
Restic encrypts by default. This is mandatory for GDPR compliance. Do not store unencrypted SQL dumps anywhere.
Testing or Hoping?
A DR plan that hasn't been tested is a hypothesis. Schedule a "Game Day" every quarter. Block the primary's network interface and watch the script run. Measure the downtime. If it's over 60 seconds, optimize the wal_receiver_timeout.
Disasters happen. The difference between a minor incident and a company-ending event is preparation. Don't let slow hardware be the bottleneck in your recovery.
Ready to harden your infrastructure? Deploy a high-performance NVMe instance on CoolVDS today and build a DR strategy that actually works.