When rm -rf Meets Production: A DevOps Survival Guide
It has been roughly three months since the infamous GitLab incident where a fatigued systems engineer accidentally deleted 300GB of live production data. The internet laughed, memes were made, and post-mortems were written. But in server rooms across Oslo and Bergen, the laughter was nervous. We all know that feeling in the pit of the stomach when a command takes a little too long to return to the prompt.
If that happened to you today, would your business survive? Not "eventually" recover, but survive the immediate 24-hour fallout?
Most Virtual Private Server (VPS) setups I audit in Norway are fragile. They rely on local RAID arrays (which are for availability, not recovery) or manual tarballs that haven't been tested since 2015. In this guide, we are going to build a Disaster Recovery (DR) plan that actually works, using tools available in standard Linux distributions (CentOS 7/Ubuntu 16.04) and leveraging the high-speed infrastructure provided by CoolVDS.
The RTO/RPO Reality Check
Before touching the terminal, you need to define two metrics, or your DR plan is just wishful thinking.
- RPO (Recovery Point Objective): How much data can you lose? If you backup nightly, your RPO is 24 hours. Can you afford to lose a day of orders?
- RTO (Recovery Time Objective): How long does it take to restore? If you have 500GB of data, restoring over a standard 100Mbps link will take ~11 hours. That is unacceptable.
Pro Tip: Calculating restore time is simple math. Data Size / Network Speed = Time. This is why CoolVDS invests heavily in 10Gbps uplinks and NVMe storage. A restore that takes 5 hours on spinning rust takes minutes on NVMe.
Step 1: The "Holistic" Backup Strategy
Stop writing backups to the same filesystem. I recently saw a client storing their MySQL dumps in /var/backups on the same root partition as the database. When the filesystem corrupted, they lost the live data and the backup.
The 3-2-1 Rule (Norwegian Edition)
1. 3 Copies of data: Live, Local Backup, Remote Backup.
2. 2 Different media: NVMe storage (fast restore) and Object Storage/Cold Storage (archival).
3. 1 Offsite: If your server is in Oslo, your backup should be in a different datacenter, or at least a different physical availability zone.
Step 2: Database Consistency is King
File-level backups of running databases are corrupted backups. You must ensure transactional consistency. For MySQL 5.7 (the current standard), you need to dump with a single transaction to avoid locking tables.
Here is a robust Bash script pattern for dumping MySQL safely:
#!/bin/bash
# /opt/scripts/db_backup.sh
BACKUP_DIR="/mnt/backups/mysql"
DATE=$(date +%Y-%m-%d_%H%M)
DB_USER="backup_user"
DB_PASS="StartUsingVaultOrSimilar"
# Ensure directory exists
mkdir -p $BACKUP_DIR
# Dump with consistent transaction guarantees
echo "Starting backup for $DATE..."
# --single-transaction: crucial for InnoDB tables to not lock the DB
# --quick: retrieves rows row-by-row rather than buffering the whole set
mysqldump -u$DB_USER -p$DB_PASS --all-databases --single-transaction --quick --events --routines | gzip > "$BACKUP_DIR/db_dump_$DATE.sql.gz"
# Check exit code
if [ $? -eq 0 ]; then
echo "Backup Success: $BACKUP_DIR/db_dump_$DATE.sql.gz"
# Log success to syslog
logger -t mysql_backup "Backup successful for $DATE"
else
echo "Backup FAILED"
logger -s -t mysql_backup "Backup FAILED for $DATE"
exit 1
fi
# Retention: Delete files older than 7 days
find $BACKUP_DIR -type f -name "*.sql.gz" -mtime +7 -exec rm {} \;
Don't forget to make it executable:
chmod +x /opt/scripts/db_backup.sh
Step 3: Offsite Replication (The "Nuke-Proof" Method)
Dumps are great, but restoring a 100GB dump takes time. A quicker RTO is achieved via Master-Slave replication. If the Master in Oslo fails, you promote the Slave. This requires a second VPS.
On CoolVDS, latency between our nodes is negligible, making asynchronous replication highly efficient.
Master Configuration (/etc/mysql/my.cnf):
[mysqld]
server-id = 1
log_bin = /var/log/mysql/mysql-bin.log
binlog_do_db = production_db
# Safety for durability
innodb_flush_log_at_trx_commit = 1
sync_binlog = 1
Slave Configuration:
[mysqld]
server-id = 2
relay-log = /var/log/mysql/mysql-relay-bin.log
read_only = 1 # Crucial: Prevents accidental writes to the slave
Once configured, verify the connection on the slave:
SHOW SLAVE STATUS \G
Look for Slave_IO_Running: Yes and Slave_SQL_Running: Yes. If Seconds_Behind_Master is consistently high, your disk I/O is the bottleneck. This is where upgrading to NVMe-backed instances becomes a necessity, not a luxury.
Step 4: Infrastructure as Code (Ansible)
It is 2017. If you are configuring servers by hand, you are doing it wrong. Manual configuration drifts. When disaster strikes, you won't remember which PHP modules you installed three years ago.
Here is a snippet of an Ansible playbook to provision a recovery web server instantly. This ensures your recovery environment is identical to production.
---
- hosts: dr_site
become: yes
vars:
http_port: 80
max_clients: 200
tasks:
- name: Install Nginx and PHP-FPM
apt:
name: "{{ item }}"
state: present
update_cache: yes
with_items:
- nginx
- php7.0-fpm
- php7.0-mysql
- name: Configure Nginx Site
template:
src: templates/nginx.conf.j2
dest: /etc/nginx/sites-available/default
notify:
- restart nginx
- name: Ensure Nginx is running
service:
name: nginx
state: started
enabled: yes
handlers:
- name: restart nginx
service:
name: nginx
state: restarted
The Legal Angle: Datatilsynet and GDPR
We are one year away from GDPR enforcement (May 2018). The "wait and see" approach is dangerous. Article 32 specifically mandates the "ability to restore the availability and access to personal data in a timely manner in the event of a physical or technical incident."
Hosting outside of the EEA (European Economic Area) creates headaches regarding Privacy Shield (which is already under scrutiny). Keeping your primary and backup data within Norwegian or Northern European borders simplifies compliance significantly. CoolVDS data centers are strictly governed by local laws, ensuring that your disaster recovery plan doesn't turn into a legal disaster.
Testing: The Step You Will Likely Skip
A backup is not a backup until you have restored it. I recommend a "Fire Drill" once a quarter:
- Spin up a fresh CoolVDS instance (takes ~55 seconds).
- Run your Ansible playbook to configure the environment.
- Download your latest backup
.sql.gzfile. - Import it:
zcat db_dump.sql.gz | mysql -u user -p database. - Check the integrity of the data.
If you rely on KVM snapshots (which we support), verify that the filesystem is clean upon boot. fsck is your friend here.
Comparison: Storage Technologies for DR
| Feature | HDD (SATA) | SSD (SATA) | NVMe (CoolVDS Standard) |
|---|---|---|---|
| Throughput | ~150 MB/s | ~550 MB/s | ~3,500 MB/s |
| Restore Time (100GB) | ~12 Minutes | ~3 Minutes | < 1 Minute |
| IOPS | 80-100 | 5,000+ | 20,000+ |
When your shop is offline, every second is lost revenue. The cost difference between HDD and NVMe is trivial compared to the cost of downtime.
Final Thoughts
Disaster recovery isn't about pessimism; it's about professionalism. Hardware fails. Humans make mistakes. Ransomware is becoming an industrial sector of its own. Your job is to ensure that a total catastrophe is nothing more than a scheduled maintenance window in the logs.
Don't wait for the next GitLab-style headline to be about you. Audit your backup scripts today, verify your checksums, and if your current host is still spinning rusty platters, it's time to move.
Need a sandbox to test your Ansible recovery scripts? Deploy a high-performance NVMe instance on CoolVDS in under a minute and sleep better tonight.