The 3:00 AM Panic Attack
I still remember the sound of my phone vibrating off the nightstand in 2019. It wasn't a monitoring alert; it was the CEO. A rogue `rm -rf` script, intended for a cache directory, had silently eaten through the `var/lib/mysql` mount on our primary node. We had backups, sure. But we hadn't tested the Restore Time Objective (RTO). It took us 19 hours to pull 4TB of data across a throttled WAN link. 19 hours of revenue lost. 19 hours of explaining to angry stakeholders why "we have backups" wasn't enough.
In 2023, if your Disaster Recovery (DR) plan is just "we do nightly snapshots," you are negligent. With the rise of ransomware targeting backups specifically, and the strict enforcement by Datatilsynet (The Norwegian Data Protection Authority) regarding data availability, we need a sharper approach.
This is not a theoretical whitepaper. This is the architecture I deploy today for mission-critical Norwegian workloads, prioritizing data sovereignty, rapid recovery, and zero trust.
1. The Fallacy of "Offsite" (Latency Matters)
Many DevOps engineers push backups to cold storage in Amazon Glacier or a cheap bucket in Frankfurt. That's fine for archiving. It is catastrophic for recovery.
If your primary users are in Oslo or Bergen, your DR site needs to be close enough to maintain low latency during a failover, but geographically distinct enough to survive a datacenter power outage. We typically look for a secondary site with <10ms latency to the NIX (Norwegian Internet Exchange).
Pro Tip: Don't just ping. Test throughput. Restoring 500GB over a 100Mbps link takes ~11 hours. Over a 10Gbps link (standard on CoolVDS NVMe instances), it takes minutes. Bandwidth is the bottleneck of recovery.
2. Immutable Backups with Restic
Ransomware loves to encrypt your mounted backup drives. To counter this, we use Restic with a push-model to a hardened repository that the web server cannot overwrite or delete, only append to.
Here is a standard production configuration for an encrypted backup sent to a CoolVDS storage instance via SFTP. We strictly use SSH keys with forced commands to prevent shell access.
Client Side (The Web Server)
First, initialize the repository. Note the use of `sftp`. We avoid mounting the filesystem directly.
# initialize the repo
export RESTIC_PASSWORD_FILE="/etc/restic/pw"
restic -r sftp:backupuser@dr-node.coolvds.com:/srv/backups/app1 init
# The backup command (put this in cron)
restic -r sftp:backupuser@dr-node.coolvds.com:/srv/backups/app1 backup \
--verbose \
--exclude-file=/etc/restic/excludes.txt \
/var/www/html /etc/nginxThis encrypts data before it leaves the server. Even if your DR node is compromised, the data remains a blob of AES-256 gibberish without the key.
3. Real-Time Database Replication (The RPO Killer)
Backups cover you if you can afford to lose 24 hours of data (Recovery Point Objective). Most eCommerce sites cannot. For them, we need replication.
However, standard Master-Slave replication is fragile. If the master sends a corrupted binary log, the slave replicates the corruption. In 2023, we use GTID (Global Transaction ID) based replication with delayed application on the slave node. This gives us a "time machine" window to stop the replication if a disastrous query runs on the master.
Here is the critical `my.cnf` configuration for MariaDB 10.6+:
[mysqld]
# Unique ID for the server
server-id = 1
log_bin = /var/log/mysql/mysql-bin.log
expire_logs_days = 7
max_binlog_size = 100M
# GTID Safety
gtid_domain_id = 1
log_slave_updates = 1
# Crash Safety
innodb_flush_log_at_trx_commit = 1
sync_binlog = 1
# Networking
bind-address = 10.8.0.1 # Only listen on WireGuard VPN IPOn the CoolVDS slave node, we configure a delay:
CHANGE MASTER TO MASTER_DELAY = 3600; -- 1 hour delayThis `MASTER_DELAY` is your safety net. If someone drops the `users` table at 10:00 AM, you have until 11:00 AM to stop the slave and promote it to master.
4. Secure Networking: WireGuard Mesh
Never expose your database replication ports (3306) to the public internet, even with firewalls. IP spoofing is trivial. Instead, we build a private mesh network using WireGuard. It’s faster than OpenVPN and built into the Linux kernel (5.6+).
This setup allows your primary server in a different datacenter to talk securely to your CoolVDS DR node in Norway.
| Feature | OpenVPN | WireGuard |
|---|---|---|
| Codebase Size | ~100,000 LOC | ~4,000 LOC |
| Handshake | Slow (TLS) | Instant (Noise Protocol) |
| Latency Impact | Moderate | Negligible |
Configuring `wg0.conf` on the DR Node:
[Interface]
Address = 10.8.0.1/24
ListenPort = 51820
PrivateKey = <Server_Private_Key>
# Peer: Production Web Server
[Peer]
PublicKey = <Client_Public_Key>
AllowedIPs = 10.8.0.2/32Bring it up with:
wg-quick up wg0Now, your database traffic flows over `10.8.0.x`, fully encrypted, invisible to the public web.
5. The Infrastructure as Code (IaC) Approach
Manual recovery is error-prone. In a crisis, your hands shake. You will mistype commands. We use Terraform to define the recovery environment. If the primary site is incinerated, we run one command to provision the application servers on CoolVDS.
While I can't share the full module, here is how we define a compute resource that matches the high I/O requirements of a database:
resource "coolvds_instance" "dr_db" {
hostname = "dr-db-01.norway"
plan = "nvme-16gb-4cpu"
region = "oslo"
os = "ubuntu-22.04"
ssh_keys = [var.admin_ssh_key]
# Cloud-init to install necessary packages immediately
user_data = <<-EOF
#!/bin/bash
apt-get update
apt-get install -y mariadb-server wireguard
EOF
}Why Local Infrastructure Matters (Schrems II & GDPR)
Since the Schrems II ruling, transferring personal data of EU/EEA citizens to US-controlled clouds (AWS, Azure, Google) carries legal risk due to the US CLOUD Act. Even if the server is in Frankfurt, if the company is US-based, they can be compelled to hand over data.
Hosting your DR site on a provider like CoolVDS, which operates under Norwegian jurisdiction and adheres to GDPR without the US baggage, simplifies compliance massively. It’s not just about tech; it’s about legal survivability.
The "Fire Drill"
A plan you haven't tested is a hallucination. Every Friday, my team runs a "Chaos Monkey" script. We simulate a web server failure.
- DNS failover is triggered (Cloudflare API).
- The CoolVDS standby node spins up Nginx.
- The Read-Only database slave is promoted to Write.
If this process takes more than 5 minutes, we failed. Currently, on NVMe hardware, we average 45 seconds.
Final Thoughts
Disaster recovery is expensive until you need it. Then, it's priceless. Don't rely on hope. Build a fortress. Use immutable backups, delayed replication, and encrypted private networks. And choose infrastructure that respects your data sovereignty.
Ready to harden your stack? Spin up a dedicated NVMe instance on CoolVDS today and test your Restic throughput. If you aren't hitting 500MB/s, you're on the wrong platform.