Taming the Whale: A 2014 Guide to Docker Orchestration in Production
Let’s be honest: running docker run on your laptop is fun. It’s clean. It’s isolated. But trying to run a multi-node cluster in production right now? It’s an absolute mess. If you are like me, you spent the last six months of 2014 fighting with port mapping, linking containers across hosts, and praying the Docker daemon doesn't hang (again).
The container revolution is here, but the tools to manage it are still in their infancy. While everyone is talking about the new Kubernetes project Google open-sourced back in June, it’s arguably still too alpha for anyone valuing their sleep. So, what do we actually use today to manage distributed containers without losing our minds?
I’ve tested three approaches on CoolVDS NVMe instances to see what holds up under load: the "dumb" Ansible approach, the CoreOS/Fleet method, and the heavy-lifting Apache Mesos.
1. The "Just Use Ansible" Approach
Sometimes, boring is best. If you aren't Google, you might not need a scheduler. You just need configuration management. We rely heavily on Ansible at CoolVDS for our internal tooling because it’s agentless. You can simply wrap your Docker commands in standard playbooks.
The problem isn't starting the container; it's handling the networking when you have a database on Host A and a web worker on Host B. Since Docker links don't span hosts natively yet, we have to manage ports manually.
Here is a snippet from a playbook I used last week to deploy a Redis cluster. Note the explicit port binding to the host interface:
- name: Start Redis Container
docker:
name: redis_master
image: redis:2.8
state: started
ports:
- "6379:6379"
volumes:
- /data/redis:/data
register: redis_container
- name: Update Firewall for Redis
ufw:
rule: allow
port: 6379
proto: tcp
src: "{{ web_worker_ip }}"
The Verdict: It works, but it's rigid. If a node dies, Ansible won't automatically reschedule the workload unless you wake up and run the playbook again. For static environments, it's fine. For dynamic scaling, it fails.
2. The "Distributed Systemd": CoreOS & Fleet
This is where things get interesting. CoreOS is stripping Linux down to the bare essentials, and Fleet effectively treats your entire cluster as one giant init system. It’s clever. You write systemd unit files, and Fleet decides where to run them.
We've seen a lot of Norwegian dev teams adopting this because it feels native. You don't learn a new API; you just learn systemd. However, it relies heavily on etcd for consensus, and if your disk I/O latency is high, etcd will fail. This is a common support ticket we see.
Pro Tip: Never run etcd on standard spinning rust (HDD). The fsync latency will cause cluster elections to timeout, splitting your brain. We strictly provision CoolVDS instances with local SSD/NVMe storage for this exact reason. If `iowait` spikes, your cluster dies.
Here is a fleet unit file (myapp.service) that ensures high availability by forcing the service to run on separate machines:
[Unit]
Description=My Nginx App
After=docker.service
Requires=docker.service
[Service]
TimeoutStartSec=0
ExecStartPre=-/usr/bin/docker kill web_app
ExecStartPre=-/usr/bin/docker rm web_app
ExecStartPre=/usr/bin/docker pull coolvds/nginx-custom:1.0
ExecStart=/usr/bin/docker run --name web_app -p 80:80 coolvds/nginx-custom:1.0
ExecStop=/usr/bin/docker stop web_app
[X-Fleet]
Conflicts=myapp.service
The Conflicts directive is powerful. It tells Fleet: "Don't run this unit on a machine that is already running this unit." instant HA.
3. The Heavyweight: Apache Mesos + Marathon
If you are building the next Twitter, you use Mesos. It abstracts CPU, memory, and disk away from the machines. You use Marathon as the framework to launch Docker containers on top of Mesos.
It is robust, but the learning curve is a brick wall. Setting up Zookeeper (required for Mesos) is not a Friday afternoon task. However, once it's running, it is bulletproof. We recently helped a client migrate a high-traffic e-commerce site to Mesos on our infrastructure. They needed to handle the holiday traffic spikes without manual intervention.
Posting a JSON payload to the Marathon API looks like this:
{
"id": "frontend",
"cpus": 0.5,
"mem": 512.0,
"instances": 3,
"container": {
"type": "DOCKER",
"docker": {
"image": "coolvds/frontend:v2",
"network": "BRIDGE",
"portMappings": [
{ "containerPort": 80, "hostPort": 0, "servicePort": 9000, "protocol": "tcp" }
]
}
}
}
Notice the "hostPort": 0. Mesos assigns a random port. This necessitates a service discovery mechanism (like HAProxy or Consul) to find where your containers actually landed. Complexity increases, but so does scalability.
Infrastructure Matters: The KVM Difference
No matter which orchestrator you choose—Fleet, Mesos, or standard Ansible—your containers are only as stable as the kernel they share. This is where the "Noisy Neighbor" effect kills performance.
Many providers oversell their container-based VPS (OpenVZ). If another user on the node forks a thousand processes, your Docker containers stall. At CoolVDS, we refuse to do that. We use KVM (Kernel-based Virtual Machine) virtualization. Each customer gets their own kernel.
Why KVM is non-negotiable for Docker in 2014:
- Security: Docker 1.3 added security profiles, but escaping a container is still a risk. KVM adds a hard hardware virtualization layer.
- Kernel Modules: Want to use specialized networking or storage drivers? You need your own kernel.
- Data Sovereignty: With the scrutiny on Safe Harbor and the EU Data Protection Directive, keeping your data on physical hardware located in Oslo (and not replicated to a US cloud) is critical for compliance with the Norwegian Personal Data Act.
When you are running a database inside a container (which I generally advise against, but I know you do it anyway), I/O contention becomes the bottleneck. We benchmarked `fio` random writes on our KVM instances versus standard cloud instances. The difference is stark.
# 4k random write test
fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test \
--filename=test --bs=4k --iodepth=64 --size=4G --readwrite=randwrite
On CoolVDS, we consistently see IOPS capable of sustaining high-transaction etcd or Zookeeper clusters. On budget VPS, the same test often chokes, causing leader election failures in your orchestration layer.
Conclusion
We are in a transition period. Fig (which Docker just acquired to turn into "Compose") is great for dev, but production is still the Wild West. If you need simple, go with Ansible. If you need clustering, Fleet is the modern choice. If you have a team of ten Ops engineers, Mesos is the beast.
But whatever you run, don't run it on weak foundations. Latency kills distributed systems faster than software bugs.
Ready to build a cluster that doesn't flake out? Deploy a high-performance KVM instance on CoolVDS today and get the raw I/O your containers are starving for.