Console Login

Disaster Recovery in 2019: Why Your Backups Won't Save You (But RTO Will)

The 3:00 AM Reality Check

There is a fundamental lie in systems administration: "We are safe because we have backups."

Backups are just files. They are cold, static, and theoretical. Disaster Recovery (DR) is the warm, sweating reality of trying to restore 500GB of transactional data while your CEO screams about lost revenue and Datatilsynet (The Norwegian Data Protection Authority) looms in the background with GDPR fines.

In early 2018, I watched a mid-sized Oslo e-commerce shop go dark. Their host had a catastrophic RAID controller failure. They had backups, sure. But they were stored on a cheap, spinny-disk storage array with capped I/O. It took them 26 hours just to transfer the data back to the production environment, and another 8 hours to replay the MySQL binary logs. That is 34 hours of downtime.

In 2019, with the current maturity of KVM virtualization and NVMe storage, a 34-hour RTO (Recovery Time Objective) is negligence. Here is how we engineer resilience.

The Legal Imperative: GDPR Article 32

Since May last year, GDPR has changed the calculus. Article 32 explicitly mandates the "ability to restore the availability and access to personal data in a timely manner."

If you are hosting critical user data on US-controlled servers, you are relying on the Privacy Shield framework. Given the invalidation of Safe Harbor a few years back, reliance on cross-border transfers remains legally shaky. The pragmatic move for Norwegian businesses is data sovereignty: keeping primary and secondary failover nodes within Norwegian borders or the EEA, governed by Norwegian law.

The Technical Stack: RTO vs. RPO

You need to define two metrics before touching a terminal:

  • RPO (Recovery Point Objective): How much data can you lose? (e.g., 5 minutes).
  • RTO (Recovery Time Objective): How fast must you be back online? (e.g., 1 hour).

Achieving near-zero RPO requires real-time replication, not nightly tarballs.

1. Database Replication (The Heartbeat)

For a standard LAMP/LEMP stack on Ubuntu 18.04, relying on a nightly `mysqldump` is suicide for high-transaction sites. You need Master-Slave replication. If the Master melts, the Slave promotes.

In your `my.cnf`, ensure you are using GTID (Global Transaction Identifiers) for safer failover. Don't disable ACID compliance unless you like corruption.

[mysqld]
server-id = 1
log_bin = /var/log/mysql/mysql-bin.log
binlog_format = ROW
gtid_mode = ON
enforce_gtid_consistency = ON
# Safety first - this ensures data is written to disk
innodb_flush_log_at_trx_commit = 1
sync_binlog = 1

Pro Tip: On spinning rust (HDD), `sync_binlog=1` kills performance. On CoolVDS NVMe instances, the latency penalty is negligible, giving you data safety without the I/O wait.

2. The Filesystem: Deduplicated Remote Backups

Stop using `rsync` scripts that overwrite the previous backup. If you `rsync` a corrupted file, you now have a corrupted backup. In 2019, BorgBackup is the industry standard for efficient, encrypted, authenticated backups.

Borg dedupes chunks. If you change 10MB of a 100GB file, it only pushes 10MB. This drastically reduces the window of vulnerability.

# Initialize a repo (do this once)
borg init --encryption=repokey user@backup-server:/var/backups/repo.borg

# Create a snapshot
borg create --stats --compression lz4 \
    user@backup-server:/var/backups/repo.borg::{hostname}-{now} \
    /var/www/html \
    /etc/nginx

Automate this via cron. It handles encryption client-side, meaning your backup provider (even if it's a secondary CoolVDS storage instance) never sees the raw data.

The Infrastructure Factor: Why Hardware Matters

Software configuration is only half the battle. The underlying hardware determines your restoration speed.

When you trigger a restore, you are essentially writing data as fast as the disk will allow. On a traditional VPS sharing a SATA SSD array with 50 other noisy neighbors, your write speeds might fluctuate between 50MB/s and 200MB/s. Restoring a 500GB dataset at 50MB/s takes nearly 3 hours.

This is why we standardized on NVMe for CoolVDS. With NVMe, we see sustained write speeds vastly exceeding SATA limits. That same 500GB restore can happen in minutes, not hours. When your boss asks why the site isn't up yet, "I/O wait" is not an acceptable excuse.

Architect's Note: We use KVM (Kernel-based Virtual Machine) exclusively. Unlike OpenVZ, which shares the host kernel, KVM provides true isolation. If a neighbor kernel panics, your DR environment stays up. In a disaster scenario, predictability is the only currency that matters.

The "Fire Drill"

A disaster recovery plan that hasn't been tested is a hallucination. You need to run a fire drill.

  1. Spin up a fresh CoolVDS instance (takes about 55 seconds).
  2. Deploy your Ansible playbooks or Docker Compose files.
  3. Restore the database from the Slave or latest dump.
  4. Point your DNS (lower the TTL beforehand) to the new IP.

If this process requires you to manually edit 15 config files, you have failed. Automate it. Use tools like Terraform (v0.11 is solid right now) or Ansible to define the infrastructure state.

Ansible Recovery Snippet

Don't configure servers by hand. Here is a simple playbook task to ensure your web server is back exactly as it was:

- name: Ensure Nginx is installed
  apt:
    name: nginx
    state: present
    update_cache: yes

- name: Deploy Vhost Configuration
  template:
    src: templates/site.conf.j2
    dest: /etc/nginx/sites-enabled/default
  notify: restart nginx

- name: Ensure Firewall allows Web Traffic
  ufw:
    rule: allow
    port: '443'
    proto: tcp

Conclusion: The Cost of Inaction

In the Norwegian market, trust is hard to gain and easy to lose. Downtime doesn't just cost sales; it costs reputation. By leveraging modern deduplication tools, enforcing strict database consistency, and utilizing high-performance NVMe infrastructure like CoolVDS, you turn a potential catastrophe into a minor log entry.

Don't wait for the RAID card to fail. Review your RTO today.