When "sudo rm -rf" Hits Production: Real Disaster Recovery in 2014
It is 3:00 AM on a Tuesday. Your monitoring system—Nagios, Zabbix, take your pick—starts screaming. The load average on your primary database server just hit 50.0. Then, silence. Ping timeout. SSH connection refused. If your strategy right now is "I'll restore from the backup tape," you are already dead. In the time it takes to provision a new server, install the OS, and rsync 500GB of data, your CEO has called you ten times and your customers have moved to a competitor.
I have seen this happen. In 2012, I watched a legacy RAID controller silently corrupt data for three days before finally dying. The backups? They were perfectly faithful copies of corrupted data. We spent 72 hours rebuilding InnoDB tablespaces with hex editors. Never again.
True Disaster Recovery (DR) isn't about backups. It is about continuity. It is about RTO (Recovery Time Objective) and RPO (Recovery Point Objective). For Norwegian businesses, where the Personopplysningsloven (Personal Data Act) places strict requirements on data availability and integrity, relying on a nightly tarball is negligence.
The Architecture of Survival
To survive a catastrophe, you need redundancy at every layer: Storage, Compute, and Network. Let's break down a robust stack using technologies available today, specifically referencing Ubuntu 14.04 LTS and MySQL 5.6.
1. Asynchronous Database Replication
Forget the old Master-Slave setup plagued by position errors. MySQL 5.6 introduced GTID (Global Transaction Identifiers), making replication topology changes significantly less painful. If your master node melts, you need a hot standby ready to take writes instantly.
Here is the critical configuration for your my.cnf on the master node. Do not ignore the innodb_flush_log_at_trx_commit setting; setting it to 1 is the only way to guarantee ACID compliance, even if it costs you I/O performance.
[mysqld]
server-id = 1
log_bin = /var/log/mysql/mysql-bin.log
binlog_format = ROW
gtid_mode = ON
enforce_gtid_consistency = true
log_slave_updates = 1
# Safety first - ensure data is on disk before saying "OK"
innodb_flush_log_at_trx_commit = 1
sync_binlog = 1
On your slave (CoolVDS instance in a secondary zone), you configure the read-only mode to prevent accidental writes breaking consistency:
[mysqld]
server-id = 2
gtid_mode = ON
enforce_gtid_consistency = true
read_only = 1
Pro Tip: Never rely on the default MySQL settings for network timeouts in a WAN replication setup. Set slave_net_timeout to 60 seconds (default is 3600) so your slave realizes the master is dead before your customers do.
2. Real-Time File Synchronization
For static assets (user uploads, configuration files), rsync via cron is too slow. You need lsyncd (Live Syncing Daemon). It watches the filesystem kernel events (inotify) and triggers a sync the moment a file changes.
Here is a working Lua config for Lsyncd 2.1.5:
settings {
logfile = "/var/log/lsyncd/lsyncd.log",
statusFile = "/var/log/lsyncd/lsyncd.status"
}
sync {
default.rsync,
source = "/var/www/html/uploads",
target = "root@dr-node.coolvds.net:/var/www/html/uploads",
rsync = {
compress = true,
archive = true,
verbose = true,
rsh = "/usr/bin/ssh -p 22 -o StrictHostKeyChecking=no"
}
}
This setup ensures that if your primary web server vanishes, your DR node has the exact same content, minus perhaps the last 200 milliseconds of data.
The Hardware Reality: KVM vs. OpenVZ
Software redundancy is useless if the underlying virtualization lies to you. Many budget providers use OpenVZ (containers). In OpenVZ, if the host kernel panics, everyone dies. Furthermore, you cannot modify kernel modules needed for advanced DR solutions like DRBD (Distributed Replicated Block Device).
This is why we standardized on KVM (Kernel-based Virtual Machine) at CoolVDS. KVM provides full hardware virtualization. Each VPS has its own kernel. If a neighbor crashes their OS, your disaster recovery daemon keeps running. For serious workloads, we couple this with high-performance storage.
| Feature | OpenVZ (Common Budget VPS) | KVM (CoolVDS Standard) |
|---|---|---|
| Kernel Isolation | Shared (Single point of failure) | Isolated (True segregation) |
| Disk I/O Access | Contended / Buffered | Direct / VirtIO Drivers |
| Custom Modules (DRBD) | Impossible | Allowed |
The Storage Bottleneck
Replication is I/O intensive. While mechanical SAS drives are standard, the industry is shifting. We are beginning to see the adoption of PCIe-based flash storage and early NVMe storage implementations in enterprise environments. While still expensive, utilizing SSD-backed arrays for your database journal files is the single best upgrade you can make to reduce replication lag. If your disk cannot write the relay log fast enough, your slave will fall behind.
Data Sovereignty: The Norwegian Context
Since the Snowden revelations last year, the trust in US-based cloud giants has eroded. Relying on Safe Harbor agreements is becoming a legal risk. The Datatilsynet (Norwegian Data Protection Authority) is increasingly strict about where personal data of Norwegian citizens resides.
Hosting your DR site on Amazon AWS in Ireland might be technically sound, but does it meet your compliance requirements? Keeping data within Norway—or at least within the EEA with strict guarantees—is crucial. Latency matters too. Round-trip time from Oslo to a server in Frankfurt is ~25ms. Oslo to a local NIX-connected datacenter? ~2ms. That 23ms difference adds up on every database transaction if you are doing synchronous replication.
Deploying the Solution
Do not wait for the hardware failure. It is coming. It is a mathematical certainty.
- Audit your stack: Can you tolerate 4 hours of downtime? If not, you need a hot standby.
- Switch to KVM: Move critical databases off container-based hosting.
- Test the failover: A DR plan that hasn't been tested is just a hypothesis. Run a "Game Day" where you actually shut down the master server.
If you need a robust, KVM-based foundation with local Norwegian latency and bleeding-edge I/O performance to back your strategy, we are ready. Don't let slow I/O kill your SEO or your uptime.