Disaster Recovery in 2016: Surviving Data Loss and The Post-Safe Harbor Reality
Let’s be honest. If your disaster recovery plan relies on a manual restore process or, worse, a "hope nobody trips over the power cable" strategy, you are already negligent. In the wake of the European Court of Justice invalidating the Safe Harbor agreement last year, and with the EU-US Privacy Shield only just adopted last month (July 2016), the location of your backups is no longer just a technical detail. It is a legal minefield.
For Norwegian businesses, data sovereignty is now the critical metric alongside RPO (Recovery Point Objective) and RTO (Recovery Time Objective). If your primary server melts down in Oslo, and your backup is sitting in an Amazon bucket in Virginia, you might restore your data only to find yourself in a regulatory crisis with Datatilsynet.
This guide ignores the fluff. We are looking at implementing a robust, automated DR strategy using tools available right now: Ansible, MySQL 5.7 GTID replication, and secure off-site backups within Norwegian borders.
The "3-2-1" Rule: Adjusted for 2016
The classic rule remains valid: 3 copies of data, 2 different media types, 1 off-site. However, the "off-site" definition has changed. Latency matters. Restoring 500GB of data over a transatlantic link is slow. Restoring it from a secondary data center in Norway via peering at NIX (Norwegian Internet Exchange) is fast.
Pro Tip: When selecting a VPS provider for DR, verify their virtualization stack. We use KVM on CoolVDS because OpenVZ containers often share kernel modules. If the host kernel panics, your "isolated" container dies with it. KVM provides the hardware abstraction necessary for true stability.
Step 1: Database Resilience with MySQL 5.7 GTID
If you are still running MySQL 5.5, stop reading and upgrade. MySQL 5.6 introduced Global Transaction Identifiers (GTID), and 5.7 (released late last year) perfected it. GTID makes failover sanity-preserving because you don't have to manually calculate log file positions.
Here is the configuration required on your Master server to enable crash-safe replication:
[mysqld]
# /etc/my.cnf
server-id = 1
log_bin = /var/log/mysql/mysql-bin.log
binlog_format = ROW
expire_logs_days = 7
max_binlog_size = 100M
# GTID Configuration for Crash Safety
gtid_mode = ON
enforce_gtid_consistency = ON
log_slave_updates = ON
# Durability settings (Critical for DR, slightly impacts write speed)
sync_binlog = 1
innodb_flush_log_at_trx_commit = 1
innodb_flush_method = O_DIRECT
On your Slave (the DR node hosted on a separate CoolVDS instance), the config is similar, but with server-id = 2 and read-only mode enabled:
read_only = 1
super_read_only = 1 # New in 5.7, prevents even root from writing by accident
Setting sync_binlog=1 is non-negotiable for DR. Yes, it adds disk I/O latency. However, if your server loses power without this, your binary log might be missing the last few transactions, breaking replication integrity. This is why we insist on NVMe storage for our CoolVDS nodes—the high IOPS capability negates the performance penalty of strict ACID compliance.
Step 2: Automated File Replication
Databases are half the battle. Uploaded assets, configuration files, and SSL certificates must also be replicated. Do not overcomplicate this with distributed filesystems like GlusterFS unless you have a dedicated storage team. They are brittle.
For 99% of setups, a robust rsync wrapper is superior. Below is a production-grade script that handles rotation and alerts. We run this via cron every 15 minutes.
#!/bin/bash
# /opt/scripts/dr_sync.sh
SOURCE_DIR="/var/www/html"
DEST_HOST="dr-user@10.20.30.40"
DEST_DIR="/backup/www"
LOG_FILE="/var/log/dr_sync.log"
LOCK_FILE="/var/run/dr_sync.lock"
# Check for stale lock file (older than 1 hour)
if [ -f "$LOCK_FILE" ]; then
if [ "$(find "$LOCK_FILE" -mmin +60)" ]; then
echo "Stale lock found, removing..." >> "$LOG_FILE"
rm -f "$LOCK_FILE"
else
echo "Sync already running." >> "$LOG_FILE"
exit 1
fi
fi
touch "$LOCK_FILE"
# Execute Sync
# -a: archive mode
# -v: verbose
# -z: compress
# --delete: remove files on destination that are gone on source
rsync -avz --delete -e "ssh -o StrictHostKeyChecking=no -i /root/.ssh/id_rsa_dr" \
$SOURCE_DIR $DEST_HOST:$DEST_DIR >> "$LOG_FILE" 2>&1
STATUS=$?
if [ $STATUS -ne 0 ]; then
echo "CRITICAL: DR Sync Failed at $(date)" | mail -s "DR ALERT" admin@company.no
fi
rm -f "$LOCK_FILE"
Ensure you generate an SSH key pair specifically for this task: ssh-keygen -t rsa -b 4096. Never use password authentication for automated backups.
Step 3: Infrastructure as Code (IaC) with Ansible
Having data is useless if you don't have a server configuration to host it. In 2016, manually editing /etc/nginx/nginx.conf is professional suicide. If your main server dies, you need to spin up a fresh CoolVDS instance and provision it in minutes.
We use Ansible (v2.1) for this. Here is a playbook snippet that ensures your web server stack is identical on production and DR nodes.
---
- hosts: webservers
become: yes
vars:
http_port: 80
max_clients: 200
tasks:
- name: Ensure Nginx is at the latest version
apt:
name: nginx
state: latest
update_cache: yes
- name: Write Nginx Configuration
template:
src: templates/nginx.conf.j2
dest: /etc/nginx/nginx.conf
mode: 0644
notify:
- restart nginx
- name: Ensure specific PHP 7.0 extensions are installed
apt:
name: "{{ item }}"
state: present
with_items:
- php7.0-fpm
- php7.0-mysql
- php7.0-mbstring
- php7.0-xml
handlers:
- name: restart nginx
service: name=nginx state=restarted
By defining your infrastructure in YAML, your "Disaster Recovery Plan" isn't a Word document nobody reads; it's executable code.
The Network Layer: IP Failover
The final piece is DNS. DNS propagation can take time, which kills your RTO. Use a low-TTL (Time To Live) setting for your A-records, ideally 60 seconds. In the event of a disaster, you update the IP address of your A-record to point to the CoolVDS DR instance.
Alternatively, if you are using a load balancer (like HAProxy), you can keep the DR node in the pool with a `backup` directive:
# haproxy.cfg
backend web_backend
balance roundrobin
server web01 192.168.1.10:80 check
server web-dr 192.168.1.20:80 check backup
In this configuration, HAProxy sends traffic to `web-dr` only if `web01` fails health checks. This offers automatic failover without manual DNS intervention.
Testing: The "Scream Test"
A DR plan that hasn't been tested is a hypothesis. Schedule a maintenance window. Block port 80 on your firewall for the primary server. Watch your monitoring dashboard. Does traffic flow to the backup? Does the application connect to the slave database? If the answer is "I think so," you are not ready.
Why Infrastructure Choice Matters
Running this architecture requires underlying stability. Budget VPS providers often oversell CPU cycles. During a recovery scenario—where you are uncompressing gigabytes of logs and replaying database transactions—you need guaranteed CPU performance. Steal time (%st in top) is the enemy.
We engineered CoolVDS to eliminate the "noisy neighbor" problem. By strictly allocating CPU cores and utilizing pure NVMe storage arrays, we ensure that when you hit the "Recover" button, the hardware responds instantly. Don't let slow I/O kill your business when you are already vulnerable.
Secure your infrastructure today. Deploy your Disaster Recovery node on a platform that respects your data sovereignty and performance needs.