Console Login

Automating the Unmanageable: Why We Use Kubernetes Operators for StatefulSets in 2019

Automating the Unmanageable: Why We Use Kubernetes Operators for StatefulSets

Let’s be honest for a second. Everyone claiming that running databases on Kubernetes is "solved" in 2019 is likely trying to sell you a managed service. It is not solved. It is hard. I have spent too many nights debugging split-brain scenarios in Elasticsearch clusters because a generic K8s controller didn't understand that Node A needs to flush to disk before Node B takes over. Stateless microservices are easy; you kill them, they respawn, nobody cries. But when you kill a primary database pod, you better hope your failover logic is smarter than a simple livenessProbe.

This is where the Operator pattern changes the game. It is the only way to codify the "human" knowledge of a sysadmin into the cluster itself. If you are targeting the Norwegian market, where data integrity (GDPR) and uptime are non-negotiable, you cannot rely on manual intervention.

The Gap Between StatefulSets and Reality

Kubernetes 1.14 is stable. StatefulSets (formerly PetSets) give us stable network identities (hostname mysql-0, mysql-1) and stable persistent storage. This is great, but it is purely infrastructure-level stability. The StatefulSet controller knows nothing about the application running inside.

Pro Tip: A StatefulSet ensures your pod comes back with the same name and volume. It does not ensure your MySQL replication topology is healthy after a restart. That is your job.

In a recent project deploying a HA redis cluster for a client in Oslo, we relied solely on StatefulSets. When a node degraded due to a noisy neighbor on a public cloud provider, the pod restarted. However, it tried to rejoin as a master because the config file hadn't updated to reflect the new topology. Data corruption ensued. We needed something that could watch the cluster status and react intelligently.

Enter the Operator

An Operator is essentially a custom controller that uses Custom Resource Definitions (CRDs) to manage applications. It replaces the human who wakes up at 3 AM to fix the replication lag.

Instead of deploying a pile of ConfigMaps and StatefulSets manually, you define a CRD like this:

apiVersion: databases.coolvds.com/v1alpha1
kind: MySQLCluster
metadata:
  name: production-db
spec:
  replicas: 3
  version: "5.7.25"
  storage:
    size: 100Gi
    class: "coolvds-nvme"

The Operator watches for this resource. When it sees it, it spins up the necessary StatefulSets, Services, and Secrets. Crucially, it runs a control loop that checks application-specific metrics.

The Reconciliation Loop

The magic happens in the reconciliation loop. Here is a simplified logic flow of what a robust Operator does in Go:

func (r *ReconcileMySQL) Reconcile(request reconcile.Request) (reconcile.Result, error) {
    // 1. Fetch the MySQLCluster instance
    instance := &MySQLCluster{}
    err := r.client.Get(context.TODO(), request.NamespacedName, instance)

    // 2. Check if StatefulSet exists, create if not
    // 3. Check MySQL internal status (e.g., 'SHOW SLAVE STATUS')
    
    if !isMasterHealthy(instance) {
        // PERFORM FAILOVER LOGIC
        // This is where the Operator outperforms standard K8s
        promoteSlave(instance)
    }

    return reconcile.Result{Requeue: true}, nil
}

By embedding this logic, we automate the operational complexity. But software automation is only half the battle.

The Hardware Bottleneck: Why I/O Kills Operators

You can write the smartest Operator in the world, but if your underlying infrastructure has high I/O latency, your database will timeout. K8s health checks are brutal. If etcd latency spikes because your VPS provider is overselling their spinning rust (HDD) storage, the cluster becomes unstable.

This is why we strictly use CoolVDS for our K8s nodes. In 2019, running databases on anything other than local NVMe is negligence.

Here is a generic benchmark comparison we ran using fio on a standard CoolVDS instance vs a budget VPS provider often used by devs:

Metric Budget VPS (SATA SSD) CoolVDS (NVMe)
Rand Read IOPS (4k) ~8,000 ~55,000
Latency (95th percentile) 4.5ms 0.2ms
Etcd fsync Duration Variable (spikes to 40ms) Stable (<2ms)

When an Operator tries to perform a failover, it queries the K8s API server, which talks to etcd. If etcd is slow, the Operator hangs. If the Operator hangs, your database stays down. Low latency infrastructure is not a luxury; it is a dependency for automated reliability.

Data Locality and Norwegian Compliance

Operating in Norway adds another layer: Datatilsynet. With the GDPR having been in full effect for nearly a year now, knowing exactly where your bytes live is critical.

Using a CoolVDS KVM instance means you control the disk. You aren't sharing a kernel with 500 other containers (like in some CaaS offerings). You can encrypt the partition with LUKS for data-at-rest protection. This makes compliance audits significantly less painful.

Deploying a Prometheus Operator for Monitoring

To wrap this up, if you are running StatefulSets, you need visibility. The Prometheus Operator (by CoreOS) is the gold standard right now. It automatically configures Prometheus to scrape your pods based on ServiceMonitor definitions.

Here is how we configure a monitor for our custom DB:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: mysql-monitor
  labels:
    team: backend
spec:
  selector:
    matchLabels:
      app: mysql-cluster
  endpoints:
  - port: metrics
    interval: 15s

This configuration automatically injects the scrape config into Prometheus. No restart required. Automation wins again.

Conclusion

Kubernetes Operators are the bridge between raw infrastructure and application logic. They allow us to run stateful workloads with confidence. However, do not let your software outpace your hardware. A brilliant Operator running on high-latency storage is just a faster way to crash your production environment.

If you need the raw power to back up your K8s clusters, stop gambling with shared resources. Deploy your nodes on CoolVDS NVMe instances. We offer the low latency and data sovereignty required for the Nordic market.

Ready to stabilize your stack? Deploy a high-performance CoolVDS instance today.