The "It Won't Happen To Us" Fallacy: Engineering for Failure

It is June 2019. We like to think the Norwegian power grid is invincible and that the fiber lines running through NIX (Norwegian Internet Exchange) in Oslo are indestructible. But hardware fails. Routing tables get corrupted. And most frequently, a tired junior developer runs a destructive command on production at 16:45 on a Friday.

If your Disaster Recovery (DR) plan consists solely of "we take nightly backups," you do not have a DR plan. You have a data graveyard. In a high-transaction environment, restoring from a backup that is 23 hours old means you have lost a full day of revenue and customer trust. That is unacceptable.

I have spent the last decade cleaning up after "theoretical" outages that became very real. Today, we are going to look at how to architect a failover solution that minimizes both RPO (Recovery Point Objective) and RTO (Recovery Time Objective), using standard tools available in our current stack: CentOS 7, MySQL 5.7, and the raw speed of NVMe storage.

Defining the Metrics: RPO vs. RTO

Before touching a single config file, define your pain threshold.

RPO (Recovery Point Objective): How much data can you afford to lose? If you need RPO near zero, you need synchronous replication.
RTO (Recovery Time Objective): How long can you be offline? If you need RTO in minutes, you need hot standbys, not cold backups.

Pro Tip: Achieving 99.999% uptime is exponentially more expensive than 99.9%. Be honest about the budget. For most SMEs in Norway, a "Warm Standby" on a secondary CoolVDS instance offers the best balance of cost versus resilience.

Step 1: The Database Layer (MySQL 5.7 Replication)

Running a single database instance is a gamble you will eventually lose. We need Master-Slave replication. In this setup, your primary VPS handles writes, and the changes are asynchronously shipped to a secondary VPS (preferably in a different availability zone or data center).

Here is the reality of my.cnf configuration in 2019. Do not stick with the defaults.

The Master Configuration

On your primary node (Node A), edit /etc/my.cnf inside the [mysqld] block:

[mysqld]
server-id = 1
log_bin = /var/lib/mysql/mysql-bin.log
binlog_format = ROW
expire_logs_days = 7
max_binlog_size = 100M

# Reliability Settings
sync_binlog = 1
innodb_flush_log_at_trx_commit = 1
innodb_buffer_pool_size = 4G  # Adjust to 70% of your VPS RAM

The sync_binlog = 1 is crucial. It forces the binary log to disk after every transaction. Yes, it creates a slight I/O penalty, but without it, a power loss means your slave is out of sync and your data is corrupt. On CoolVDS NVMe storage, this penalty is negligible due to high IOPS capabilities.

The Slave Configuration

On the secondary node (Node B):

[mysqld]
server-id = 2
relay-log = /var/lib/mysql/mysql-relay-bin.log
read_only = 1

Setting read_only = 1 prevents accidental writes to your backup. I've seen applications misconfigured to write to the backup node, creating a split-brain scenario that took three days to manually merge.

Step 2: Real-Time Filesystem Synchronization

Databases are half the battle. What about user uploads, images, and config files? rsync on a cron job is the classic approach, but it creates a "gap" between the cron execution times. If the server dies at 13:59 and your cron runs at 14:00, you lose an hour of uploads.

Instead, we use lsyncd (Live Syncing Daemon). It watches the filesystem kernel events (inotify) and triggers a sync immediately when a file changes.

Installing and Configuring Lsyncd on CentOS 7

yum install epel-release
yum install lsyncd rsync

Configure /etc/lsyncd.conf to sync your web root to the failover server:

settings {
    logfile = "/var/log/lsyncd/lsyncd.log",
    statusFile = "/var/log/lsyncd/lsyncd.status"
}

sync {
    default.rsyncssh,
    source = "/var/www/html",
    host = "192.168.1.50", -- Your Secondary VPS IP
    targetdir = "/var/www/html",
    delay = 1,
    rsync = {
        archive = true,
        compress = true,
        _extra = { "--omit-dir-times" }
    },
    ssh = {
        port = 22,
        identityFile = "/root/.ssh/id_rsa"
    }
}

This setup ensures that the moment a user uploads a PDF to your application, it exists on your failover server within seconds.

The Hardware Bottleneck: Why NVMe Matters for RTO

Let's discuss the physical reality of restoring data. If your replication fails and you must restore from a cold backup (a dump file), your bottleneck is Disk I/O. In 2019, many hosting providers still push SATA SSDs or even spinning rust (HDDs) for bulk storage to save costs.

Comparing restore times for a 100GB SQL Dump:

Storage Type	Throughput (Avg)	Est. Restore Time
Traditional HDD (7200 RPM)	120 MB/s	~14 minutes (Best case)
SATA SSD	500 MB/s	~3.5 minutes
CoolVDS NVMe	3000+ MB/s	~30 seconds

When your CEO is breathing down your neck asking when the site will be back up, those 13 minutes difference feel like a lifetime. We standardize on KVM and NVMe at CoolVDS not just for speed, but for this specific recovery capability. High IOPS reduces the time you spend waiting for mysql to import data.

Automated Health Checks: The "Pulse" Script

You need to know your primary server is down before your customers do. While services like Pingdom exist, internal monitoring is vital for triggering failover scripts.

A simple bash script on your secondary server can check the primary's health. If it fails, it can update DNS (via API) or simply alert you.

#!/bin/bash

PRIMARY_IP="192.168.1.10"
HTTP_STATUS=$(curl -o /dev/null -s -w "%{http_code}\n" http://$PRIMARY_IP)

if [ "$HTTP_STATUS" != "200" ]; then
    echo "CRITICAL: Primary Server $PRIMARY_IP is down! Status: $HTTP_STATUS"
    # Insert logic here to switch VIP (Virtual IP) or update DNS
    # e.g., /usr/local/bin/switch_dns.sh failover
else
    echo "OK: Primary Server is alive."
fi

Data Sovereignty and The "Norwegian Fortress"

We operate under the jurisdiction of Datatilsynet. With the implementation of GDPR last year, keeping data within national borders is becoming a significant legal preference, if not a requirement, for many Norwegian enterprises. Moving your DR site to a cheap provider in the US or Asia might save minimal OpEx, but it exposes you to massive compliance risks.

Your failover plan must respect data residency. Replicating from Oslo to a secondary location within Norway (or the EEA) ensures that even in a disaster, you remain compliant. CoolVDS infrastructure is built to ensure low latency peering via NIX, meaning your replication traffic doesn't traverse the globe, keeping sync lag minimal and legal standing secure.

Conclusion

Disaster recovery is an investment in sleep. By implementing robust Master-Slave replication, utilizing lsyncd for file coherence, and ensuring your underlying infrastructure uses high-performance NVMe storage, you turn potential catastrophes into minor hiccups.

Don't wait for the inevitable hardware failure to test your theories. Spin up a secondary instance on CoolVDS today, configure your replication, and pull the plug on your test environment. See what happens. It is better to sweat in practice than to bleed in war.

🍪 We Value Your Privacy

Privacy & Cookie Settings

Your Privacy Rights

Surviving the Blackout: A Pragmatic Disaster Recovery Guide for Norwegian Systems

The "It Won't Happen To Us" Fallacy: Engineering for Failure

Defining the Metrics: RPO vs. RTO

Step 1: The Database Layer (MySQL 5.7 Replication)

The Master Configuration

The Slave Configuration

Step 2: Real-Time Filesystem Synchronization

Installing and Configuring Lsyncd on CentOS 7

The Hardware Bottleneck: Why NVMe Matters for RTO

Automated Health Checks: The "Pulse" Script

Data Sovereignty and The "Norwegian Fortress"

Conclusion

/// RELATED POSTS

The Death of the Perimeter: Architecting Zero-Trust Infrastructure in 2025

Container Security in 2025: Hardening Docker & Kubernetes for Production in Norway

Zero-Trust Architecture on Linux: Beyond the Firewall in 2025

Automating GDPR & CIS Compliance: A CTO’s Guide to Bulletproof Norwegian Infrastructure

Automating Security Compliance: Surviving the Datatilsynet Audit with Ansible & OpenSCAP

Automating the Auditor: Infrastructure-as-Code Compliance in the Post-Schrems Era