Database Sharding: When Vertical Scaling Hits the Wall
Let’s be honest: nobody wants to shard a database. It turns a clean, single source of truth into a distributed nightmare of consistency checks, rebalancing acts, and application logic updates. I have seen perfectly good dev teams burn out trying to implement a sharding layer prematurely.
But there comes a moment in every high-growth platform's life when the biggest NVMe instance money can buy isn't enough. Your iowait is creeping up, your innodb_buffer_pool_size is maxed out at 80% of physical RAM, and vertical scaling has hit the physical ceiling. If you are running a high-traffic SaaS targeting the Nordic market, latency spikes are the enemy of retention.
This is not a guide on how to install MySQL. This is a survival guide for when your write-heavy workload exceeds the capacity of a single master node.
The "Oh Sh*t" Moment: Identifying the Bottleneck
Before you slice your data, you must prove you need to. I frequently audit systems where developers claim they need sharding, but they are really just suffering from bad indexing or slow I/O. In 2020, hardware is incredibly capable. If you aren't running on pure NVMe storage, stop reading and migrate your storage first.
However, if you are seeing the following metrics, it is time to talk sharding:
- Write Saturation: Your primary master is locked at 100% CPU due to write operations, not reads. Read replicas can't fix this.
- Schema Size: Your active dataset exceeds the maximum available RAM of a single node (e.g., >512GB), forcing disk seeks for hot data.
- Replication Lag: Your single-threaded SQL thread on replicas cannot keep up with the master's write volume.
Pro Tip: Before sharding, attempt Functional Partitioning. Move yourlogs,audit_trails, andsessionstables to a completely separate database instance. This often reduces the main dataset size by 40% without the complexity of logic-based sharding.
Sharding Strategies: The Architecture
If you have exhausted optimization and functional partitioning, you have two primary paths: Key-Based (Hash) Sharding and Directory-Based Sharding.
1. Key-Based (Hash) Sharding
This is the most common for massive scale. You take a value (like user_id), apply a hash function, and use the modulo operator to determine which server holds the data. This ensures a uniform distribution of data.
The fatal flaw: Resharding. If you go from 10 nodes to 11, the modulo changes, and you have to migrate nearly all your data. To solve this, we use Consistent Hashing.
Here is a simplified Python representation of how a router might determine the correct shard using a static hash, typically handled by middleware like ProxySQL or custom application logic:
import zlib
def get_shard_id(user_id, total_shards):
# CRC32 provides a good distribution for this use case
hash_val = zlib.crc32(str(user_id).encode('utf-8'))
return hash_val % total_shards
# Example usage
user_id = 49201
total_shards = 4
shard_id = get_shard_id(user_id, total_shards)
print(f"User {user_id} belongs to Shard-{shard_id}")
2. Directory-Based (Lookup) Sharding
You maintain a separate "Lookup Service" that maps a partition key to a specific database instance. This is flexible—you can move heavy users to their own dedicated hardware without moving everyone else. However, the lookup table itself becomes a single point of failure and a performance bottleneck. It requires aggressive caching (Redis/Memcached).
The Infrastructure Reality: Latency is King
When you shard, you introduce network overhead. A query that used to be a local socket connection might now require cross-talk between a microservice, a router, and a database node. If your servers are scattered across cheap, oversold VPS providers with poor peering, your application will crawl.
This is where physical location matters. For applications serving Norway and Northern Europe, data sovereignty and latency go hand-in-hand. You want your application servers and your database shards in the same datacenter or connected via a high-speed private LAN.
At CoolVDS, we see this often. Clients deploy KVM instances for shards because they need guaranteed CPU cycles. In a sharded environment, a "noisy neighbor" stealing CPU cycles on Shard #3 causes the entire request to hang, waiting for that one node to respond. Consistency in performance is more important than raw burst speed.
Optimizing the Node Config
Regardless of the strategy, every shard needs to be tuned as if it were a standalone master. Here is a battle-tested my.cnf snippet for a write-heavy shard node on a system with 32GB RAM and NVMe storage:
[mysqld]
# Basic settings
user = mysql
pid-file = /var/run/mysqld/mysqld.pid
socket = /var/run/mysqld/mysqld.sock
port = 3306
basedir = /usr
datadir = /var/lib/mysql
# SAFETY
max_allowed_packet = 64M
skip-name-resolve
# INNODB SPECIFIC
# Set to 70-80% of TOTAL RAM
innodb_buffer_pool_size = 24G
innodb_log_file_size = 2G
innodb_file_per_table = 1
innodb_flush_method = O_DIRECT
# NVMe Optimization (High IOPS)
innodb_io_capacity = 2000
innodb_io_capacity_max = 4000
innodb_flush_neighbors = 0
# ACID vs Performance Trade-off
# 1 = Safest, 2 = Faster (risk of 1s data loss on OS crash)
innodb_flush_log_at_trx_commit = 1
Routing Traffic with ProxySQL
In 2020, hardcoding shard logic into your PHP or Node.js application is technically debt. Use middleware. ProxySQL is the standard for a reason. It sits between your app and your database, handling query routing invisibly.
Here is how you configure ProxySQL to split traffic based on user IDs (sharding) using query rules. This assumes you have sharding keys in comments or specific SQL structures.
-- Define your Hostgroups (HG)
-- HG 10: Shard 0
-- HG 11: Shard 1
INSERT INTO mysql_servers (hostgroup_id, hostname, port) VALUES (10, '10.0.0.5', 3306);
INSERT INTO mysql_servers (hostgroup_id, hostname, port) VALUES (11, '10.0.0.6', 3306);
LOAD MYSQL SERVERS TO RUNTIME;
SAVE MYSQL SERVERS TO DISK;
-- Routing Rules based on sharding key comment
-- Example query: SELECT * FROM orders /* shard_id:0 */ WHERE user_id=1;
INSERT INTO mysql_query_rules (rule_id, active, match_pattern, destination_hostgroup, apply)
VALUES (1, 1, "shard_id:0", 10, 1);
INSERT INTO mysql_query_rules (rule_id, active, match_pattern, destination_hostgroup, apply)
VALUES (2, 1, "shard_id:1", 11, 1);
LOAD MYSQL QUERY RULES TO RUNTIME;
SAVE MYSQL QUERY RULES TO DISK;
Data Sovereignty and The Norwegian Context
We are operating in a strict regulatory environment. The GDPR is in full swing, and the scrutiny on data transfers to the US is intensifying. Datatilsynet (The Norwegian Data Protection Authority) does not look kindly on nebulous cloud structures where you don't know exactly where the physical disk resides.
When you shard, you increase your attack surface and your compliance complexity. If Shard 1 is in Frankfurt and Shard 2 is in Oslo, you are managing cross-border data flows. Keeping your entire sharded cluster within a single jurisdiction—specifically Norway for local entities—simplifies compliance significantly. It also eliminates the millisecond penalties of cross-border routing.
Summary: The Checklist
Before you deploy your first shard, verify this list:
- Hardware: Are you on NVMe? Spinning rust (HDD) has no place in a sharded setup.
- Network: Is the latency between your app servers and DB nodes <1ms?
- Key Selection: Have you chosen a sharding key that prevents "hotspots"? (e.g., Don't use a timestamp, or all writes will go to the newest shard).
- Backup Strategy: How do you take a consistent snapshot across 4 distinct nodes simultaneously? (Hint: You need scripted LVM snapshots or stopped-slave backups).
Sharding is complex, but on the right infrastructure, it is the only way to scale to millions of users. If you need a sandbox to test your sharding logic with zero hardware contention, spin up a high-performance KVM instance.
Don't let storage bottlenecks kill your growth. Deploy a test instance on CoolVDS in 55 seconds and see what true NVMe throughput looks like.