Surviving the Meltdown: A Battle-Hardened Guide to Disaster Recovery in Norway
It’s 3:14 AM on a Tuesday. Your phone is buzzing on the nightstand. It’s not a text from your girlfriend; it’s Nagios. Your primary database node in Oslo just went dark. No ping, no SSH, nothing. You try the IPMI console, but even that is timing out. Panic sets in.
If you don’t have a plan right now, you aren't an engineer; you’re a liability. I’ve seen seasoned sysadmins freeze when `mdadm` reports a double disk failure. I’ve watched startups evaporate because they thought a daily mysqldump stored on the same server was a "backup strategy."
This isn't about buzzwords. This is about survival. In the Nordic hosting market, where latency to NIX (Norwegian Internet Exchange) is measured in single milliseconds and data sovereignty is scrutinized by Datatilsynet, you need a robust, battle-tested Disaster Recovery (DR) plan. Let’s build one.
The Cold Hard Truth: RAID Is Not A Backup
Let’s get this out of the way immediately. RAID protects you from a disk failure. It does not protect you from file corruption, `rm -rf /`, controller failure, or a flood in the server room. Disaster Recovery is about redundancy across failure domains.
For a robust setup in 2014, we are looking at three layers of redundancy:
- Data Replication: Real-time syncing of blocks or files.
- Database Replication: Master-Slave setups.
- Network Failover: Floating IPs or DNS swinging.
Layer 1: The Filesystem (Lsyncd vs. DRBD)
For static assets—user uploads, configuration files, web roots—you need them on your standby server instantly. `cron` + `rsync` every 5 minutes is amateur hour. You lose 4 minutes and 59 seconds of data.
We use Lsyncd (Live Syncing Daemon). It watches the kernel’s `inotify` events and triggers `rsync` only when files change. It’s lightweight and reliable.
Configuration: /etc/lsyncd.conf.lua
settings {
logfile = "/var/log/lsyncd/lsyncd.log",
statusFile = "/var/log/lsyncd/lsyncd.status"
}
sync {
default.rsyncssh,
source = "/var/www/html",
host = "10.0.0.2", -- Private IP of your failover node
targetdir = "/var/www/html",
rsync = {
archive = true,
compress = true,
_extra = { "--bwlimit=5000" } -- Don't saturate the link
}
}
If you need block-level replication (ensuring every single bit is mirrored before a write is confirmed), you’d look at DRBD. But be warned: DRBD in primary-primary mode is a great way to corrupt two filesystems at once if you don't configure fencing (Stonith) correctly. Stick to Lsyncd for web files.
Layer 2: MySQL Master-Slave Replication
Your database is your business. If you lose it, you go home. We are seeing a lot of folks moving to MariaDB, but standard MySQL 5.5/5.6 is still the rock we build on. Setting up Master-Slave replication is mandatory.
On your Master (Primary Node), you need to enable the binary log in `my.cnf`:
[mysqld]
server-id = 1
log_bin = /var/log/mysql/mysql-bin.log
binlog_do_db = production_db
innodb_flush_log_at_trx_commit = 1
sync_binlog = 1
On the Slave (CoolVDS Backup Node), you set a unique ID and point it to the master:
[mysqld]
server-id = 2
relay_log = /var/log/mysql/mysql-relay-bin.log
read_only = 1
Pro Tip: Monitor the `Seconds_Behind_Master` metric. If this creeps up, your I/O is the bottleneck. This is where storage speed matters.
The Hardware Bottleneck: Why Storage Wins Wars
You can script the best failover logic in the world, but if your disk I/O chokes during the sync, you fail. Standard SATA SSDs are good, but we are starting to see the limits of the SATA III interface (6Gbps).
This is where CoolVDS is pushing the envelope. We are rolling out support for NVMe storage technologies (PCIe-based flash). While traditional VPS providers are still figuring out how to cache on spinning rust, we are offering I/O throughput that feels like direct memory access. For a high-transaction MySQL slave, this low latency is the difference between a 1-second lag and a 1-minute lag.
Layer 3: The Switch (Keepalived)
When the master dies, you need to switch traffic. If your nodes are in the same datacenter (e.g., our Oslo facility), use Keepalived with VRRP to float a shared IP address between servers.
vrrp_instance VI_1 {
state MASTER
interface eth0
virtual_router_id 51
priority 100
advert_int 1
authentication {
auth_type PASS
auth_pass s3cr3t
}
virtual_ipaddress {
192.168.1.100
}
}
If the Master stops broadcasting VRRP packets, the Backup node claims the IP `192.168.1.100` instantly. Your users won't even notice the flicker.
The "CoolVDS" Factor: KVM vs. The World
Why do we insist on KVM virtualization at CoolVDS? Because in a disaster scenario, you need guaranteed resources. OpenVZ containers share a kernel. If a noisy neighbor kernel-panics the host, you go down with them.
With KVM, your kernel is yours. Your memory is yours. Combined with our ddos protection (which filters scrubbing traffic upstream), you have a fortress. Whether you are hosting a high-traffic Magento store or a critical email gateway, isolation is safety.
Legal & Local: The Norwegian Context
We operate under the Norwegian Personal Data Act (Personopplysningsloven). Data sovereignty is a massive topic right now, especially with the Safe Harbor framework looking shaky after recent leaks. Hosting outside Norway adds a layer of legal complexity you don't need.
By keeping your primary and DR nodes within our Norwegian infrastructure (or syncing to a verified secondary EU location), you satisfy the compliance officers while keeping latency to the Oslo exchange negligible.
Final Thoughts
Disaster Recovery is expensive, tedious, and thankless—until the day it saves your job. Don't rely on "cloud magic." Build the redundancy yourself. Script it. Test it.
Action Item: Don't let slow I/O kill your replication stream. Deploy a test instance with our high-performance storage today. SSH in, run `ioping`, and see what low latency really looks like.