Surviving the Blackout: A Pragmatic Disaster Recovery Guide for Norwegian Systems
There are two types of system administrators in 2023: those who have lost data, and those who are about to. It is a cliché because it is true. Just last month, I watched a seasoned DevOps lead at a mid-sized Oslo fintech turn pale. Their primary SAN in a Frankfurt datacenter corrupted silently. Their backups? Stored on the same logical volume for "fast access."
They didn't just lose data; they lost three days of transaction history and, subsequently, two major clients. The cost of downtime isn't just lost revenue; it is lost reputation.
In Norway, the stakes are higher. Between Datatilsynet enforcing GDPR strictures and the Schrems II ruling complicating data transfers to US-owned clouds, relying on a generic "cloud magic" backup button is professional negligence. You need a plan that is technically sound, legally compliant, and battle-tested.
The "RTO/RPO" Reality Check
Before we touch a single configuration file, define your metrics. If your CEO says "I want zero downtime," ask for an infinite budget. Since you won't get that, we talk about:
- RPO (Recovery Point Objective): How much data can you afford to lose? (e.g., 5 minutes).
- RTO (Recovery Time Objective): How fast must you be back online? (e.g., 1 hour).
Low RTO requires high IOPS. If you are trying to restore 500GB of PostgreSQL data from cold storage onto a spinning HDD VPS, you will be down for hours. This is where hardware matters. At CoolVDS, we enforce NVMe storage because during a restore, I/O wait is the enemy.
Architecture 1: The Immutable Backup (Ransomware Defense)
Ransomware in 2023 doesn't just encrypt your disk; it hunts for your mounted backups and encrypts those too. If your backup server is mounted as a standard NFS share with write access, it is not a backup. It is a target.
We use BorgBackup with an append-only mode. It handles deduplication (saving space) and encryption (saving your job).
Implementation Strategy
Do not pull backups from your production server. Push them. But ensure the destination cannot delete old archives even if the production server is compromised.
Here is a production-grade script snippet we use to initialize a secure repository. Note the encryption mode:
#!/bin/bash
# Initialize the repo with Repokey-Blake2 (high security, decent speed)
# Run this ONCE on the backup node
borg init --encryption=repokey-blake2 user@backup-node.coolvds.net:/var/backups/prod-db
And here is the automation script for the daily run. We use --compression lz4 because CPU cycles on CoolVDS KVM instances are dedicated, not shared/stolen, allowing for high-speed compression without impacting the app.
#!/bin/bash
# /usr/local/bin/backup-runner.sh
LOG="/var/log/borg-backup.log"
export BORG_PASSPHRASE='Correctly_Managed_Secret_In_Vault_2023!'
# Create the archive
echo "Starting backup at $(date)" >> $LOG
borg create --stats --compression lz4 \
user@backup-node.coolvds.net:/var/backups/prod-db::'{hostname}-{now:%Y-%m-%d_%H:%M}' \
/var/lib/postgresql/15/main \
/etc/nginx \
--exclude '/var/lib/postgresql/15/main/pg_wal'
# Prune old backups to manage cost
borg prune -v --list --keep-daily=7 --keep-weekly=4 user@backup-node.coolvds.net:/var/backups/prod-db
Pro Tip: Always monitor your exit codes. A backup script that fails silently is a time bomb. Usecurlto ping a health check endpoint if$?is not 0.
Architecture 2: Hot Standby with PostgreSQL 15
Backups are for disasters. Replication is for continuity. For a Norwegian e-commerce site, if the main node in Oslo (latency < 2ms via NIX) goes dark, you need a standby node ready to take over.
We set up Streaming Replication. The primary node handles writes; the standby handles reads and acts as insurance.
Primary Node Config (postgresql.conf):
# /etc/postgresql/15/main/postgresql.conf
listen_addresses = '*'
wal_level = replica
max_wal_senders = 10
wal_keep_size = 1024MB # Vital to prevent standby falling too far behind
hot_standby = on
# Performance tuning for CoolVDS NVMe
random_page_cost = 1.1 # We trust our storage speed
effective_io_concurrency = 200
Standby Node Signal:
Since PostgreSQL 12, `recovery.conf` is gone. You must create a `standby.signal` file.
touch /var/lib/postgresql/15/main/standby.signal
The connection info now lives in the primary `postgresql.conf` or is passed during the base backup. Check the replication lag manually to ensure your link is healthy:
psql -x -c "SELECT * FROM pg_stat_replication;"
Infrastructure as Code: Recovery Velocity
If a meteor hits the datacenter, you don't want to be reading wiki docs on how to install Nginx. You want to run `terraform apply`. While CoolVDS offers an intuitive UI, for DR scenarios, we recommend using our API-compatible drivers or Ansible.
Here is an Ansible playbook snippet that restores a web environment from scratch. This assumes you are deploying to a fresh CoolVDS instance running Debian 12 (Bookworm).
---
- name: Emergency Restore - Web Node
hosts: recovery_group
become: yes
vars:
nginx_worker_connections: 1024
tasks:
- name: Install dependencies
apt:
name: ['nginx', 'certbot', 'python3-certbot-nginx', 'ufw']
state: present
update_cache: yes
- name: Tune Kernel for High Load Recovery
sysctl:
name: "{{ item.key }}"
value: "{{ item.value }}"
state: present
loop:
- { key: 'net.core.somaxconn', value: '65535' }
- { key: 'fs.file-max', value: '2097152' }
- name: Pull configuration from secure git repo
git:
repo: 'git@gitlab.com:company/nginx-configs.git'
dest: /etc/nginx/sites-available/
key_file: /root/.ssh/deploy_key
- name: Enable Firewall immediately
ufw:
state: enabled
policy: deny
rule: allow
port: "{{ item }}"
proto: tcp
loop:
- '22'
- '80'
- '443'
The Legal Latency: GDPR & Data Sovereignty
Technical recovery is useless if it creates a legal disaster. Storing backups of Norwegian citizen data on a cheap object storage bucket in Virginia, USA, is a violation of Chapter 5 of the GDPR. Even with the EU-US Data Privacy Framework being discussed in 2023, the legal ground is shaky.
The Safe Route: Keep data within the EEA (European Economic Area). CoolVDS infrastructure is strictly governed by European law. We don't just offer low latency in milliseconds; we offer low latency in legal compliance.
Network Verification
When setting up your DR site, verify the MTU and path. A common issue in 2023 is fragmentation over VPN tunnels causing slow replication.
ping -M do -s 1472 10.10.0.5
If that packet drops, adjust your MSS clamping in `iptables` or WireGuard config. Don't let a 1500 byte packet kill your recovery speed.
Why Bare Metal Performance Matters in a Cloud World
Virtualization overhead is real. In a disaster scenario, you are usually restoring compressed data. This is CPU intensive. Many providers oversell vCPUs, meaning your "4 Core" VPS is fighting for cycles with 20 other neighbors.
At CoolVDS, we utilize KVM (Kernel-based Virtual Machine) with strict resource guarantees. When you execute tar -xzvf on a 50GB log archive, you get the full clock speed of the underlying processor.
Final Thoughts: Test or Fail
A disaster recovery plan that hasn't been tested is just a theoretical document. It belongs in a university library, not a server room. Schedule a "Fire Drill" every quarter. Spin up a fresh CoolVDS instance, isolate it from the public net, and try to restore your service using only your offsite backups and your documentation.
If it takes you longer than 4 hours, you have work to do.
Don't wait for the inevitable hardware failure or the next ransomware wave. Secure your infrastructure on a platform that respects your data and your uptime requirements. Deploy a high-availability test environment on CoolVDS today and sleep easier tonight.