When `rm -rf` Meets Production: A DevOps Survival Guide

It has been roughly three months since the infamous GitLab incident where a fatigued systems engineer accidentally deleted 300GB of live production data. The internet laughed, memes were made, and post-mortems were written. But in server rooms across Oslo and Bergen, the laughter was nervous. We all know that feeling in the pit of the stomach when a command takes a little too long to return to the prompt.

If that happened to you today, would your business survive? Not "eventually" recover, but survive the immediate 24-hour fallout?

Most Virtual Private Server (VPS) setups I audit in Norway are fragile. They rely on local RAID arrays (which are for availability, not recovery) or manual tarballs that haven't been tested since 2015. In this guide, we are going to build a Disaster Recovery (DR) plan that actually works, using tools available in standard Linux distributions (CentOS 7/Ubuntu 16.04) and leveraging the high-speed infrastructure provided by CoolVDS.

The RTO/RPO Reality Check

Before touching the terminal, you need to define two metrics, or your DR plan is just wishful thinking.

RPO (Recovery Point Objective): How much data can you lose? If you backup nightly, your RPO is 24 hours. Can you afford to lose a day of orders?
RTO (Recovery Time Objective): How long does it take to restore? If you have 500GB of data, restoring over a standard 100Mbps link will take ~11 hours. That is unacceptable.

Pro Tip: Calculating restore time is simple math. Data Size / Network Speed = Time. This is why CoolVDS invests heavily in 10Gbps uplinks and NVMe storage. A restore that takes 5 hours on spinning rust takes minutes on NVMe.

Step 1: The "Holistic" Backup Strategy

Stop writing backups to the same filesystem. I recently saw a client storing their MySQL dumps in /var/backups on the same root partition as the database. When the filesystem corrupted, they lost the live data and the backup.

The 3-2-1 Rule (Norwegian Edition)

1. 3 Copies of data: Live, Local Backup, Remote Backup.
2. 2 Different media: NVMe storage (fast restore) and Object Storage/Cold Storage (archival).
3. 1 Offsite: If your server is in Oslo, your backup should be in a different datacenter, or at least a different physical availability zone.

Step 2: Database Consistency is King

File-level backups of running databases are corrupted backups. You must ensure transactional consistency. For MySQL 5.7 (the current standard), you need to dump with a single transaction to avoid locking tables.

Here is a robust Bash script pattern for dumping MySQL safely:

#!/bin/bash
# /opt/scripts/db_backup.sh

BACKUP_DIR="/mnt/backups/mysql"
DATE=$(date +%Y-%m-%d_%H%M)
DB_USER="backup_user"
DB_PASS="StartUsingVaultOrSimilar"

# Ensure directory exists
mkdir -p $BACKUP_DIR

# Dump with consistent transaction guarantees
echo "Starting backup for $DATE..."

# --single-transaction: crucial for InnoDB tables to not lock the DB
# --quick: retrieves rows row-by-row rather than buffering the whole set
mysqldump -u$DB_USER -p$DB_PASS --all-databases --single-transaction --quick --events --routines | gzip > "$BACKUP_DIR/db_dump_$DATE.sql.gz"

# Check exit code
if [ $? -eq 0 ]; then
    echo "Backup Success: $BACKUP_DIR/db_dump_$DATE.sql.gz"
    # Log success to syslog
    logger -t mysql_backup "Backup successful for $DATE"
else
    echo "Backup FAILED"
    logger -s -t mysql_backup "Backup FAILED for $DATE"
    exit 1
fi

# Retention: Delete files older than 7 days
find $BACKUP_DIR -type f -name "*.sql.gz" -mtime +7 -exec rm {} \;

Don't forget to make it executable:

chmod +x /opt/scripts/db_backup.sh

Step 3: Offsite Replication (The "Nuke-Proof" Method)

Dumps are great, but restoring a 100GB dump takes time. A quicker RTO is achieved via Master-Slave replication. If the Master in Oslo fails, you promote the Slave. This requires a second VPS.

On CoolVDS, latency between our nodes is negligible, making asynchronous replication highly efficient.

Master Configuration (/etc/mysql/my.cnf):

[mysqld]
server-id = 1
log_bin = /var/log/mysql/mysql-bin.log
binlog_do_db = production_db

# Safety for durability
innodb_flush_log_at_trx_commit = 1
sync_binlog = 1

Slave Configuration:

[mysqld]
server-id = 2
relay-log = /var/log/mysql/mysql-relay-bin.log
read_only = 1  # Crucial: Prevents accidental writes to the slave

Once configured, verify the connection on the slave:

SHOW SLAVE STATUS \G

Look for Slave_IO_Running: Yes and Slave_SQL_Running: Yes. If Seconds_Behind_Master is consistently high, your disk I/O is the bottleneck. This is where upgrading to NVMe-backed instances becomes a necessity, not a luxury.

Step 4: Infrastructure as Code (Ansible)

It is 2017. If you are configuring servers by hand, you are doing it wrong. Manual configuration drifts. When disaster strikes, you won't remember which PHP modules you installed three years ago.

Here is a snippet of an Ansible playbook to provision a recovery web server instantly. This ensures your recovery environment is identical to production.

---
- hosts: dr_site
  become: yes
  vars:
    http_port: 80
    max_clients: 200

  tasks:
  - name: Install Nginx and PHP-FPM
    apt:
      name: "{{ item }}"
      state: present
      update_cache: yes
    with_items:
      - nginx
      - php7.0-fpm
      - php7.0-mysql

  - name: Configure Nginx Site
    template:
      src: templates/nginx.conf.j2
      dest: /etc/nginx/sites-available/default
    notify:
      - restart nginx

  - name: Ensure Nginx is running
    service:
      name: nginx
      state: started
      enabled: yes

  handlers:
    - name: restart nginx
      service:
        name: nginx
        state: restarted

The Legal Angle: Datatilsynet and GDPR

We are one year away from GDPR enforcement (May 2018). The "wait and see" approach is dangerous. Article 32 specifically mandates the "ability to restore the availability and access to personal data in a timely manner in the event of a physical or technical incident."

Hosting outside of the EEA (European Economic Area) creates headaches regarding Privacy Shield (which is already under scrutiny). Keeping your primary and backup data within Norwegian or Northern European borders simplifies compliance significantly. CoolVDS data centers are strictly governed by local laws, ensuring that your disaster recovery plan doesn't turn into a legal disaster.

Testing: The Step You Will Likely Skip

A backup is not a backup until you have restored it. I recommend a "Fire Drill" once a quarter:

Spin up a fresh CoolVDS instance (takes ~55 seconds).
Run your Ansible playbook to configure the environment.
Download your latest backup .sql.gz file.
Import it: zcat db_dump.sql.gz | mysql -u user -p database.
Check the integrity of the data.

If you rely on KVM snapshots (which we support), verify that the filesystem is clean upon boot. fsck is your friend here.

Comparison: Storage Technologies for DR

Feature	HDD (SATA)	SSD (SATA)	NVMe (CoolVDS Standard)
Throughput	~150 MB/s	~550 MB/s	~3,500 MB/s
Restore Time (100GB)	~12 Minutes	~3 Minutes	< 1 Minute
IOPS	80-100	5,000+	20,000+

When your shop is offline, every second is lost revenue. The cost difference between HDD and NVMe is trivial compared to the cost of downtime.

Final Thoughts

Disaster recovery isn't about pessimism; it's about professionalism. Hardware fails. Humans make mistakes. Ransomware is becoming an industrial sector of its own. Your job is to ensure that a total catastrophe is nothing more than a scheduled maintenance window in the logs.

Don't wait for the next GitLab-style headline to be about you. Audit your backup scripts today, verify your checksums, and if your current host is still spinning rusty platters, it's time to move.

Need a sandbox to test your Ansible recovery scripts? Deploy a high-performance NVMe instance on CoolVDS in under a minute and sleep better tonight.

🍪 We Value Your Privacy

Privacy & Cookie Settings

Your Privacy Rights

Disaster Recovery in 2017: Why Your "Backups" Will Fail When You Need Them Most

When rm -rf Meets Production: A DevOps Survival Guide

The RTO/RPO Reality Check

Step 1: The "Holistic" Backup Strategy

The 3-2-1 Rule (Norwegian Edition)

Step 2: Database Consistency is King

Step 3: Offsite Replication (The "Nuke-Proof" Method)

Step 4: Infrastructure as Code (Ansible)

The Legal Angle: Datatilsynet and GDPR

Testing: The Step You Will Likely Skip

Comparison: Storage Technologies for DR

Final Thoughts

/// RELATED POSTS

Edge Computing in Norway: Architecting for Sub-5ms Latency in 2025

Kubernetes Networking Deep Dive: Optimizing Packet Flow for Low Latency in 2025

Surviving the Packet Storm: A Deep Dive into Kubernetes Networking & CNI Performance in 2025

Surviving the Millisecond War: Edge Computing Architectures for the Nordic Market

Kubernetes Networking Deep Dive: Why Your Packets Are Dropping in the Overlay

Serverless Without the Handcuffs: Implementing Private FaaS Patterns on High-Performance VDS in 2025

When `rm -rf` Meets Production: A DevOps Survival Guide