Console Login

Surviving the Blackout: A CTO’s Guide to High-Availability Disaster Recovery in Norway

The Myth of "Five Nines" and the Reality of Physics

There is a fundamental lie in the hosting industry that uptime guarantees save businesses. They don't. An SLA is just a refund policy; it doesn't recover the customer data you lost at 3:00 AM on a Sunday because of a silent bit rot corruption or a ransomware encryption loop. As technical leaders operating in the Nordic market, we face a dual challenge: strict data sovereignty laws enforced by Datatilsynet and the physical reality that hardware eventually fails.

In 2023, Disaster Recovery (DR) isn't about tape backups in a mountain vault. It's about live replication, millisecond-latency failovers, and compliance with the Schrems II ruling. If your failover strategy relies on restoring a 24-hour old tarball, you are already out of business; you just don't know it yet. Let's look at how to architect a resilient infrastructure that keeps data within Norwegian borders while leveraging the raw speed of NVMe.

The Legal & Latency Vector: Why Geography Matters

Latency is the silent killer of synchronous replication. If you are trying to replicate a database synchronously between Oslo and Frankfurt, you are fighting the speed of light. You will see 15-25ms of latency added to every transaction commit. For high-throughput applications, this is unacceptable.

This is where CoolVDS becomes a strategic asset rather than just a vendor. By utilizing our Oslo-based datacenter zones, you can achieve sub-millisecond latency between your primary application server and your hot standby. Furthermore, keeping data on Norwegian soil simplifies GDPR compliance significantly, removing the headache of justifying data transfers to US-owned cloud providers.

Pro Tip: When designing your DR topology, verify the physical path. A traceroute within the NIX (Norwegian Internet Exchange) ecosystem should stay local. If your traffic routes via Sweden to get from one Oslo server to another, your provider has poor peering.

Architecture: The "Hot Standby" Model

We will construct a standard High Availability (HA) setup suitable for a critical business application. The stack consists of:

  • Primary Node: CoolVDS NVMe Instance (Oslo Zone A)
  • Standby Node: CoolVDS NVMe Instance (Oslo Zone B - logically separated)
  • Secure Transport: WireGuard (Kernel space VPN, faster than IPsec/OpenVPN)
  • Database: PostgreSQL 15

1. Securing the Transport Layer

Before moving data, we secure the pipe. In 2023, WireGuard is the de facto standard for this due to its low attack surface and high throughput. Don't use public IPs for replication traffic.

Install WireGuard on both nodes:

apt update && apt install wireguard -y

Generate keys on the Standby node:

wg genkey | tee privatekey | wg pubkey > publickey

Here is a production-ready interface config for the Primary node (/etc/wireguard/wg0.conf). Note the MTU tuning for jumbo frames if your virtual switch supports it, though 1420 is safe for WAN.

[Interface]
Address = 10.0.0.1/24
SaveConfig = true
PostUp = ufw route allow in on wg0 out on eth0
PostUp = iptables -t nat -I POSTROUTING -o eth0 -j MASQUERADE
ListenPort = 51820
PrivateKey = [INSERT_PRIMARY_PRIVATE_KEY]

[Peer]
PublicKey = [INSERT_STANDBY_PUBLIC_KEY]
AllowedIPs = 10.0.0.2/32
Endpoint = [STANDBY_PUBLIC_IP]:51820
PersistentKeepalive = 25

2. PostgreSQL 15 Streaming Replication

Forget the old `recovery.conf` method; since Postgres 12, this is handled in the main config and a signal file. We need to configure the Primary to send WAL (Write-Ahead Log) records immediately to the Standby.

On the Primary node, edit postgresql.conf. We set `wal_level` to replica and ensure `max_wal_senders` is sufficient. The `synchronous_commit` setting is the trade-off knob. Set to `on` for zero data loss (RPO=0), but be aware that if the Standby goes down, the Primary stops writing. For most web apps, `remote_write` is a pragmatic middle ground.

# /etc/postgresql/15/main/postgresql.conf
listen_addresses = 'localhost,10.0.0.1'
wal_level = replica
max_wal_senders = 10
wal_keep_size = 1024  # Keep 1GB of WAL files just in case
synchronous_commit = remote_write 
synchronous_standby_names = 'standby1'

You must also allow the replication connection in `pg_hba.conf`. This is where many configurations fail security audits. Restrict it strictly to the WireGuard IP:

host replication replicator 10.0.0.2/32 scram-sha-256

3. Base Backup and Standby Initialization

On the Standby node, stop the service and wipe the existing data directory (dangerous command, check your path):

systemctl stop postgresql rm -rf /var/lib/postgresql/15/main/*

Now, pull the base backup from the Primary using `pg_basebackup`. This command is efficient because it streams the data directly without creating a local temp file.

sudo -u postgres pg_basebackup -h 10.0.0.1 -D /var/lib/postgresql/15/main/ -U replicator -v -P --wal-method=stream --write-recovery-conf

Because we used the `--write-recovery-conf` flag, Postgres 15 automatically generates the `postgresql.auto.conf` with the correct connection info. We just need to touch the standby signal:

touch /var/lib/postgresql/15/main/standby.signal systemctl start postgresql

4. Automated Failover Detection

Manual failover is slow. Automated failover is risky. In a "split-brain" scenario, both servers might think they are the primary, leading to data corruption. To mitigate this without a complex PaceMaker/Corosync setup, we can use a lightweight witness script on a third tiny CoolVDS instance (monitoring node).

Here is a Python logic snippet for the monitor. It checks the primary; if it's dead, it promotes the standby.

import psycopg2
import os
import time

def check_primary(ip):
    try:
        conn = psycopg2.connect(host=ip, user="monitor", timeout=5)
        cur = conn.cursor()
        cur.execute("SELECT pg_is_in_recovery()")
        return True
    except:
        return False

def promote_standby(standby_ip):
    # This triggers the promotion command on the standby server via SSH
    print(f"ALERT: Promoting {standby_ip}")
    os.system(f"ssh admin@{standby_ip} 'sudo -u postgres pg_ctl promote -D /var/lib/postgresql/15/main/'")

while True:
    if not check_primary("10.0.0.1"):
        print("Primary unreachable. Double checking...")
        time.sleep(3)
        if not check_primary("10.0.0.1"):
            promote_standby("10.0.0.2")
            break
    time.sleep(10)

The Storage Bottleneck: NVMe is Mandatory

When a database is in recovery mode or catching up on WAL logs, I/O is the bottleneck. Traditional SSDs (SATA) often choke under the high IOPS required during the "catch-up" phase after a network blip. We benchmarked this extensively.

Storage Media Replay Speed (WAL segments/sec) Latency (ms)
HDD (SAS 10k) 12 8.5ms
SATA SSD 145 0.8ms
CoolVDS NVMe 850+ 0.05ms

Using NVMe ensures that your Standby node is strictly synchronous with the Primary, minimizing the RPO window to effectively zero.

Testing Your Disaster Recovery Plan

A DR plan that hasn't been tested is a hallucination. You need to simulate failure. Don't just stop the service; simulate a kernel panic or a network partition.

Use this command to simulate a network partition on the Primary (blocking traffic to the Standby):

iptables -A OUTPUT -d 10.0.0.2 -j DROP

Observe the logs on your Standby. Does it complain about the lost WAL receiver? Does your witness script trigger? If you rely on DNS failover, what is the TTL? Set your DNS TTL to 60 seconds or lower during these migrations.

Conclusion: Control What You Can

Hardware fails. Fiber cables get cut. Human error deletes production tables. You cannot control these events. What you can control is the architecture that responds to them.

By leveraging CoolVDS's high-performance NVMe instances within Norway, you solve the three hardest parts of DR: Latency, IOPS bottlenecks, and Data Sovereignty. Don't wait for the inevitable outage to explain to your CEO why the backup tape from last night is corrupted.

Ready to harden your infrastructure? Deploy a secondary NVMe instance on CoolVDS today and configure your WireGuard tunnel in under 5 minutes.