Database Sharding: The Nuclear Option for Scaling (And How to Survive It)
Let’s be honest: if you are reading this, you are probably in trouble. Your master database is hitting 90% CPU during peak hours, your I/O wait is creeping up, and you’ve already optimized every query that can be optimized. You are staring at the terrifying precipice of horizontal scaling.
Sharding is not a feature; it is a headache. It breaks referential integrity, complicates backups, and makes analytics a nightmare. Yet, in 2018, as we see SaaS platforms in the Nordics growing from thousands to millions of rows, it is often the only path left.
I have spent the last week debugging a failed sharding migration for a client in Oslo. They tried to shard manually in their application layer and ended up with data inconsistencies that took 48 hours to reconcile. We are going to fix that. We will look at how to shard correctly using MySQL 5.7 (or the new 8.0 if you are brave), ProxySQL, and raw infrastructure logic.
Step 0: Do You Actually Need to Shard?
Before you slice your data into pieces, ask yourself if you have actually maxed out vertical scaling. Most "performance issues" I see are actually I/O bottlenecks disguised as CPU problems.
In a traditional HDD or standard SSD environment, random read/write operations kill performance. However, with the NVMe adoption we are seeing now, the ceiling is much higher. A properly tuned KVM instance with direct access to NVMe storage can handle terabytes of data before choking.
Pro Tip: Check youriowait. If it is over 10% but your CPU user time is low, do not shard. Upgrade your storage. On CoolVDS NVMe instances, we typically seeiowaitstay below 1% even under heavy load because we don't throttle IOPS like budget providers. Maximize the hardware first.
The Configuration Sanity Check
Before splitting your database, ensure your my.cnf (or my.ini) isn't strangling your server. Here is the baseline config I use for 32GB RAM instances running MySQL 5.7:
[mysqld]
# 70-80% of Total RAM for Dedicated DB Server
innodb_buffer_pool_size = 24G
# Essential for heavy write workloads to prevent checkpointing lags
innodb_log_file_size = 2G
innodb_log_buffer_size = 64M
# If you are on SSD/NVMe, tell InnoDB it can push harder
innodb_io_capacity = 2000
innodb_io_capacity_max = 4000
innodb_flush_neighbors = 0
# Per-thread buffers - keep these conservative to avoid OOM
sort_buffer_size = 4M
join_buffer_size = 4M
If you have applied this and you are still locking up, welcome to sharding.
Strategy 1: Directory-Based Sharding (The Flexible Approach)
The simplest way to shard is to keep a "Lookup Service". You have a small, highly available database that tells your application where the user's data lives.
Architecture:
- Lookup DB: Contains
user_id->shard_idmap. - Shard A (db-shard-01): Users 1, 4, 7...
- Shard B (db-shard-02): Users 2, 5, 8...
This allows you to move users between shards easily. With GDPR enforcement starting next month (May 25th), this is critical. You might need to move a German user to a Frankfurt shard and keep a Norwegian user on a CoolVDS instance in Oslo to satisfy data residency requirements or simply to minimize latency.
PHP Implementation Logic
Here is a simplified logic flow for how this looks in your backend code:
class ShardManager {
private $lookupDb;
private $connections = [];
public function getShardConnection($userId) {
// 1. Check Cache first (Redis/Memcached)
$shardId = $this->cache->get("user_shard_" . $userId);
if (!$shardId) {
// 2. Query Lookup DB
$stmt = $this->lookupDb->prepare("SELECT shard_host FROM user_shards WHERE user_id = ?");
$stmt->execute([$userId]);
$shardId = $stmt->fetchColumn();
// Cache this. You don't want to hit the lookup DB on every request.
$this->cache->set("user_shard_" . $userId, $shardId, 3600);
}
// 3. Connect to the specific shard
if (!isset($this->connections[$shardId])) {
$this->connections[$shardId] = new PDO("mysql:host={$shardId};dbname=app_db", 'user', 'pass');
}
return $this->connections[$shardId];
}
}
The Risk: The Lookup DB becomes your Single Point of Failure (SPOF). You must replicate it and cache heavily.
Strategy 2: ProxySQL Query Routing (The Transparent Approach)
Application-level sharding makes your code messy. Developers hate it. A cleaner approach in 2018 is using ProxySQL. It sits between your app and your databases, inspecting SQL traffic and routing it based on rules.
This allows your application to connect to 127.0.0.1:6033 thinking it is one big database, while ProxySQL distributes the load.
Configuring ProxySQL for Sharding
Let's assume we are sharding based on user_id. We can use query comments or specific SQL patterns to route traffic. However, a more robust method in ProxySQL is using mysql_query_rules.
First, define your hostgroups (shards):
-- Admin interface of ProxySQL
INSERT INTO mysql_servers (hostgroup_id, hostname, port) VALUES (10, '10.0.0.5', 3306); -- Shard 1
INSERT INTO mysql_servers (hostgroup_id, hostname, port) VALUES (20, '10.0.0.6', 3306); -- Shard 2
LOAD MYSQL SERVERS TO RUNTIME;
SAVE MYSQL SERVERS TO DISK;
Now, the tricky part. If your app sends SELECT * FROM users WHERE id = 500, ProxySQL needs to know where ID 500 lives. Out of the box, ProxySQL works best with modulo sharding defined in mysql_query_rules if you can guarantee the ID is in the query.
-- Rule for Shard 1 (Odd IDs)
INSERT INTO mysql_query_rules (rule_id, active, match_pattern, destination_hostgroup, apply)
VALUES (1, 1, "WHERE id % 2 = 1", 10, 1);
-- Rule for Shard 2 (Even IDs)
INSERT INTO mysql_query_rules (rule_id, active, match_pattern, destination_hostgroup, apply)
VALUES (2, 1, "WHERE id % 2 = 0", 20, 1);
Note: Modulo sharding is rigid. Adding a third shard later requires rebalancing all data. Use with caution.
The Latency Factor: Why Geography Matters
When you shard, you increase network chatter. If your web server is in Oslo, your Lookup DB is in London, and your Shard is in Stockholm, your application will crawl. Light speed is finite.
For Norwegian businesses, the goal should be keeping the data packet inside the country or at least on the Scandinavian ring. This is why we engineered the CoolVDS network to peer directly at NIX (Norwegian Internet Exchange). Ping times from a CoolVDS instance to major Norwegian ISPs are typically under 2ms.
| Route | Latency (Avg) | Impact on Sharded Queries (10 joins) |
|---|---|---|
| Oslo Local (CoolVDS) -> Oslo Local | < 0.5ms | Negligible |
| Oslo -> Frankfurt | ~18ms | ~180ms lag (Noticeable) |
| Oslo -> US East | ~90ms | ~900ms lag (Unusable) |
MySQL Partitioning: The "Soft Shard"
If you aren't ready for multiple servers, use MySQL Partitioning. It splits the table into multiple files on the disk, managed by a single MySQL instance. It is great for time-series data (logs, transactions).
CREATE TABLE logs (
id INT,
log_date DATE,
message VARCHAR(255)
)
PARTITION BY RANGE (YEAR(log_date)) (
PARTITION p2016 VALUES LESS THAN (2017),
PARTITION p2017 VALUES LESS THAN (2018),
PARTITION p2018 VALUES LESS THAN (2019)
);
This allows you to drop old data instantly with ALTER TABLE logs DROP PARTITION p2016; rather than running expensive DELETE queries that lock the table.
Survival Guide Checklist
- Consistent Hashing: If you use modulo, use consistent hashing rings so adding a node doesn't invalidate 100% of your cache.
- UUIDs vs Auto-Increment: Stop using Auto-Increment for primary keys if you plan to shard. It causes collisions. Use UUIDs (though they fragment indexes) or Twitter Snowflake IDs (time-sorted 64-bit integers).
- Cross-Shard Joins: They don't exist. You have to do the join in your application code. Fetch the user, then fetch their orders. Two queries. Yes, it's slower. That's the trade-off.
Conclusion
Sharding is powerful, but it requires a maturity in infrastructure that simple VPS hosting cannot provide. You need low latency private networking, high IOPS for the shards that do exist, and rock-solid stability.
As we approach the GDPR era, knowing exactly where your data shards physically reside is not just a technical requirement; it is a legal one. If you are building the next big platform in the Nordics, build it on infrastructure that respects the physics of latency and the laws of the land.
Need to test a sharded architecture? Spin up three CoolVDS instances connected via our private VLAN in Oslo. Benchmark the difference NVMe makes before you commit to complex code changes.