Console Login

Surviving the Blackout: A Battle-Tested Disaster Recovery Strategy for Norwegian Infrastructure

Surviving the Blackout: A Battle-Tested Disaster Recovery Strategy for Norwegian Infrastructure

If your Disaster Recovery (DR) plan relies solely on nightly snapshots, you don't have a plan. You have a resignation letter waiting to be signed. In the last three years, we've seen major availability zones in Europe go dark due to cooling failures and fiber cuts. It's April 2025. There is no excuse for 24-hour downtime windows anymore.

Latency matters. Compliance matters. But when the primary node vanishes, speed matters most. I’m going to walk you through building a resilient, hot-standby architecture that keeps your data strictly within Norwegian borders—satisfying GDPR and Datatilsynet—while ensuring your RTO (Recovery Time Objective) stays under 60 seconds.

The Myth of "High Availability" vs. Actual DR

High Availability (HA) protects you from a service crash. Disaster Recovery protects you from the crater where your datacenter used to be. You need both. A load balancer distributing traffic between two nodes in the same rack is not DR. If that rack loses power, you are offline.

For a robust setup targeting the Norwegian market, you need geographic separation without the latency penalty. CoolVDS offers distinct isolation. We aren't talking about shared hosting containers here. We are talking about KVM isolation where your kernel owns the hardware resources.

Step 1: Infrastructure as Code (The Foundation)

You cannot manually click your way out of a disaster. You need to spin up replacement infrastructure immediately. By 2025, OpenTofu and Terraform are the standards. Here is how we define a secondary CoolVDS recovery node programmatically. This ensures that your DR environment is identical to production.

# main.tf - Provisioning a CoolVDS DR Node
resource "coolvds_instance" "dr_node" {
  name          = "norway-dr-01"
  region        = "oslo-zone-b" # Geographically separate from Zone A
  image         = "ubuntu-24-04-lts"
  plan          = "nvme-16gb-4vcpu" 
  ssh_keys      = [var.ssh_key_id]
  
  # Critical: Private Networking for Replication
  private_network {
    enabled = true
    ip      = "10.10.2.5"
  }

  tags = [
    "env:dr",
    "role:database-replica"
  ]

  # Cloud-init to install base dependencies immediately
  user_data = <<-EOF
    #!/bin/bash
    apt-get update && apt-get install -y postgresql-17 wireguard
    systemctl enable postgresql
  EOF
}
Pro Tip: Always use private networking for replication. Sending unencrypted WAL logs over the public internet is a security suicide mission. If private networking isn't an option, wrap everything in WireGuard. on CoolVDS, private LAN transfer is unmetered, saving you bandwidth costs.

Step 2: Database Replication (The Pulse)

Data gravity is real. Moving 500GB of data from a backup archive takes time. Replaying 500GB of write-ahead logs (WAL) on standard SSDs takes hours. On NVMe storage, which is standard on CoolVDS, it takes minutes. This is why hardware selection is part of your DR strategy.

We will configure PostgreSQL 17 (the current stable workhorse) for streaming replication. The Primary node pushes changes to the DR node in near real-time.

Primary Node Configuration (postgresql.conf)

Set these flags to enable the replication slot. We use replica_user with a strong password.

# /etc/postgresql/17/main/postgresql.conf

listen_addresses = 'localhost,10.10.1.5' # Listen on Private IP
wal_level = replica
max_wal_senders = 10
max_replication_slots = 10

# Failover safety
synchronous_commit = on 
# 'on' ensures zero data loss (RPO=0) but adds latency.
# switch to 'local' if the latency between zones > 5ms.

# Archive command for point-in-time recovery (optional but recommended)
archive_mode = on
archive_command = 'test ! -f /mnt/backups/%f && cp %p /mnt/backups/%f'

The Replica Node Setup

Don't just run an empty DB. Use pg_basebackup to clone the primary initially.

pg_basebackup -h 10.10.1.5 -D /var/lib/postgresql/17/main -U replica_user -P -v -R -X stream -C -S dr_slot_1

This command does the heavy lifting: it snapshots the data, creates the standby.signal file automatically, and configures the connection settings. Check your status with:

select * from pg_stat_replication;

Step 3: The Failover Logic (The Switch)

Detection is the hardest part. Is the server down, or is the NIX (Norwegian Internet Exchange) just congested? You need a "quorum" approach. I use a simple monitor script running on a tiny third witness node (can be a cheap container).

Here is a robust bash script that handles the decision logic. It verifies the primary is actually dead before promoting the standby to avoid split-brain scenarios.

#!/bin/bash
# failover_monitor.sh

PRIMARY_IP="10.10.1.5"
STANDBY_IP="10.10.2.5"
CHECK_INTERVAL=10
FAIL_COUNT=0
MAX_RETRIES=3

log() {
    echo "$(date '+%Y-%m-%d %H:%M:%S') - $1"
}

check_primary() {
    # Check port 5432 and simple query
    pg_isready -h $PRIMARY_IP -t 2 > /dev/null 2>&1
    return $?
}

promote_standby() {
    log "CRITICAL: Primary unreachable. Promoting Standby..."
    
    # SSH into Standby and promote
    ssh root@$STANDBY_IP "su - postgres -c '/usr/lib/postgresql/17/bin/pg_ctl promote -D /var/lib/postgresql/17/main'"
    
    if [ $? -eq 0 ]; then
        log "SUCCESS: Standby is now Primary."
        # Trigger DNS update via API here
        ./update_dns.sh $STANDBY_IP
        exit 0
    else
        log "ERROR: Promotion failed!"
        exit 1
    fi
}

while true; do
    if check_primary; then
        FAIL_COUNT=0
    else
        ((FAIL_COUNT++))
        log "WARNING: Primary check failed ($FAIL_COUNT/$MAX_RETRIES)"
    fi

    if [ $FAIL_COUNT -ge $MAX_RETRIES ]; then
        promote_standby
    fi

    sleep $CHECK_INTERVAL
done

Configure this as a systemd service to ensure it survives reboots on the witness node.

[Unit]
Description=PostgreSQL Auto-Failover Monitor
After=network.target

[Service]
ExecStart=/usr/local/bin/failover_monitor.sh
Restart=always
User=root

[Install]
WantedBy=multi-user.target

Why CoolVDS? The Hardware Reality

Software configuration is only half the battle. When you trigger that failover, your new primary node gets hammered by the full weight of production traffic immediately. This is where "noisy neighbor" syndrome kills cheap VPS providers.

If your VPS shares disk I/O with 50 other users, your database will choke during the cache-warming phase. CoolVDS guarantees resource isolation. Our NVMe arrays provide the random I/O throughput necessary to ingest high-velocity writes while simultaneously serving reads, keeping your Magento store or SaaS platform responsive even during recovery.

Offsite Backups (The Last Resort)

Replication covers availability. Backups cover corruption. Use restic for encrypted, deduplicated backups to an S3-compatible object storage (or a separate CoolVDS storage instance).

restic -r sftp:user@backup-node:/srv/restic-repo backup /var/lib/postgresql

Restic encrypts by default. This is mandatory for GDPR compliance. Do not store unencrypted SQL dumps anywhere.

Testing or Hoping?

A DR plan that hasn't been tested is a hypothesis. Schedule a "Game Day" every quarter. Block the primary's network interface and watch the script run. Measure the downtime. If it's over 60 seconds, optimize the wal_receiver_timeout.

Disasters happen. The difference between a minor incident and a company-ending event is preparation. Don't let slow hardware be the bottleneck in your recovery.

Ready to harden your infrastructure? Deploy a high-performance NVMe instance on CoolVDS today and build a DR strategy that actually works.