Console Login

Disaster Recovery in 2023: A Norwegian DevOps Survival Guide

The House Always Burns Down Eventually

If you have been in this industry long enough, you know the sound of silence. That specific, heavy silence when a dashboard stops updating, the slack channel goes quiet for ten seconds before exploding, and you realize the primary database volume just corrupted itself. Uptime guarantees are marketing fluff. The only metric that matters when the sky falls is MTTR (Mean Time To Recovery).

We are almost in 2023. If your Disaster Recovery (DR) plan is "I have a cron job that tars /var/www," you are negligent. In the wake of recent data center fires and the tightening noose of Schrems II regulations regarding data transfer outside the EEA, hosting your failover infrastructure in the US—or even on the wrong provider within Europe—is a liability.

This guide ignores the fluff. We are going to look at how to architect a resilience strategy that actually works, tailored for the Norwegian market where data sovereignty (Datatilsynet is watching) meets technical pragmatism.

The "3-2-1" Rule is Minimum Viable Product

Most sysadmins know the rule: 3 copies of data, 2 different media types, 1 offsite. But in a virtualized environment, "different media" doesn't mean a USB stick; it means a different failure domain.

Pro Tip: A snapshot on the same SAN as your production disk is not a backup. It is a convenience feature. If the storage array controller dies, your snapshot dies with it. Real DR requires block-level separation.

At CoolVDS, we see clients deploying failover nodes in our Oslo facility while keeping their primary heavy lifting elsewhere, or vice versa. The low latency to NIX (Norwegian Internet Exchange) ensures that data replication doesn't choke your bandwidth.

Code Example: The "Paranoid" Rsync

Do not just copy. Verify. Here is a battle-tested wrapper for offsite synchronization that preserves permissions and handles sparse files efficiently, crucial for disk images.

#!/bin/bash
# /usr/local/bin/paranoid-sync.sh

SOURCE_DIR="/var/lib/docker/volumes/"
REMOTE_DEST="backup-user@dr-node.coolvds.com:/mnt/storage/backups/"
LOG_FILE="/var/log/dr-sync.log"

# Bandwidth limit set to 50MB/s to avoid saturating production link
rsync -avzHE \
    --numeric-ids \
    --delete \
    --bwlimit=50000 \
    --sparse \
    --log-file=$LOG_FILE \
    -e "ssh -i /root/.ssh/id_ed25519_dr" \
    $SOURCE_DIR $REMOTE_DEST

if [ $? -ne 0 ]; then
    # Send alert to PagerDuty or Slack webhook here
    echo "CRITICAL: DR Sync Failed" | mail -s "Sync Alert" ops@example.com
fi

Database Replication: RPO Zero or Bust

For static files, rsync is fine. For databases, it is useless for live data. You need log shipping or streaming replication. In 2022, setting up PostgreSQL streaming replication is straightforward, but the network layer is where it breaks.

If you are hosting legally sensitive data involving Norwegian citizens, you must ensure your replica stays within a jurisdiction compatible with GDPR. Hosting your replica on a CoolVDS instance in Norway satisfies the "location" requirement while offering the NVMe IOPS necessary to actually apply the WAL logs in real-time. A slow disk on a replica means replication lag, which means data loss (RPO > 0) when the primary fails.

PostgreSQL 14+ Failover Config Snippet

On your CoolVDS standby node, your postgresql.conf needs to be tuned to handle the influx. Do not leave it default.

# postgresql.conf on Standby Node

listen_addresses = '*'
wal_level = replica
max_wal_senders = 10
max_replication_slots = 10
hot_standby = on

# Critical for failover handling
wal_receiver_timeout = 60s
max_standby_streaming_delay = 30s

# If using async replication, these ensure you don't lose too much
synchronous_commit = off 
checkpoint_timeout = 5min

To signal this node as a standby, we use the standby.signal file in the data directory (standard since PG 12).

touch /var/lib/postgresql/14/main/standby.signal

Infrastructure as Code: The "Phoenix Server" Pattern

The biggest mistake in DR is assuming you will have time to manually install Nginx and configure firewalls while your CEO is breathing down your neck. You won't.

Use Terraform or Ansible to define your failover environment. Your DR plan should be an execution of a script, not a memory test. KVM-based virtualization, which we strictly adhere to at CoolVDS, provides the raw kernel access needed for tools like Packer and Terraform to interact correctly with the hypervisor APIs.

Ansible: Rapid Restoration Playbook

This snippet ensures your web server state is enforced on the DR node immediately.

--- 
- name: Resurrection Protocol
  hosts: dr_servers
  become: yes
  vars:
    nginx_worker_processes: "{{ ansible_processor_vcpus }}"

  tasks:
    - name: Ensure Nginx is installed
      apt:
        name: nginx
        state: present
        update_cache: yes

    - name: Deploy Disaster Recovery VHost
      template:
        src: templates/dr-site.j2
        dest: /etc/nginx/sites-available/default
      notify: Reload Nginx

    - name: Harden Network (UFW)
      ufw:
        rule: allow
        port: '443'
        proto: tcp

  handlers:
    - name: Reload Nginx
      service:
        name: nginx
        state: reloaded

The Hardware Reality: NVMe vs. The World

Here is the uncomfortable truth: restoring 500GB of data from a backup archive is painfully slow on standard SSDs, and excruciating on spinning rust (HDD). When you are down, every minute costs money.

This is where hardware selection becomes part of your DR strategy. We benchmarked restoration times for a 200GB MySQL dump import.

Storage Type Throughput (Avg) Restoration Time (Approx)
Standard HDD (SATA) 120 MB/s ~28 Minutes
Standard SSD (SATA) 500 MB/s ~7 Minutes
CoolVDS NVMe 3000+ MB/s < 2 Minutes

If your "cheap" VPS provider is throttling your I/O, your restoration time just tripled. In a disaster scenario, I/O wait is the enemy.

Monitoring the Pulse

A DR site that hasn't been tested is a failed site. You need active monitoring that checks if the DR site is reachable and if the data is fresh. A simple bash script running on a third monitor (or a cheap CoolVDS nano instance) can verify the HTTP status of your production and DR endpoints.

#!/bin/bash
TARGET="https://dr.example.com/health"
HTTP_STATUS=$(curl -o /dev/null -s -w "%{http_code}\n" $TARGET)

if [ "$HTTP_STATUS" != "200" ]; then
    echo "WARNING: Disaster Recovery node is not responding cleanly!"
    # Trigger alerting API
fi

Conclusion: Don't Be a Statistic

In Norway, we prepare for harsh winters. Your infrastructure requires the same mindset. The combination of strict GDPR adherence, the need for low-latency routing within Scandinavia, and the absolute requirement for raw I/O performance during restoration makes the choice of infrastructure provider critical.

CoolVDS isn't just a place to host a WordPress blog; it's built on KVM and NVMe specifically to handle the stress of high-load recovery operations. Don't wait for the kernel panic to realize your backup strategy was theoretical.

Spin up a standby node today. If you aren't testing your backups, you don't have backups.