Console Login

Escaping Jenkins Hell: High-Performance CI/CD Pipelines on KVM Infrastructure

Escaping Jenkins Hell: High-Performance CI/CD Pipelines on KVM Infrastructure

There is nothing quite as soul-crushing as watching a progress bar stall at 98% on a Friday afternoon. You’ve committed the patch. The unit tests passed locally. Yet, your continuous integration server is choking, timing out, or throwing inexplicable Java heap errors.

I have spent the last six months debugging a deployment pipeline for a high-traffic e-commerce client based here in Oslo. Their build times crept up from 5 minutes to 45 minutes. The culprit? It wasn't code bloat. It wasn't complex dependencies. It was the underlying infrastructure gasping for air.

In the emerging culture of DevOps, we talk a lot about tools—Jenkins, Travis, Puppet, Chef. But we rarely talk about the metal running them. Today, we are going to look at the mechanics of a robust CI/CD pipeline, why shared hosting environments (like OpenVZ) are killing your efficiency, and how to architect a solution that survives the real world.

The Hidden Bottleneck: Disk I/O

Continuous Integration is basically a torture test for storage subsystems. Every time Jenkins triggers a build, it performs thousands of small read/write operations: checking out Git repositories, generating binaries, creating temporary directories, and archiving artifacts.

On a standard HDD-based VPS, or worse, a heavily oversold node, your iowait spikes. I recently logged into a client's build server and saw this:

$ vmstat 1
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
 2  4  12400 245000  45300 120400    0    0  4500  3200 1200 2400 15 10  5 70

See that wa (wait) column hovering at 70%? That means the CPU is sitting idle 70% of the time, just waiting for the disk to write data. You are paying for CPU cycles you can't use.

Pro Tip: If your Jenkins dashboard feels sluggish, check your $JENKINS_HOME location. If it's not on an SSD or high-performance storage, no amount of RAM will fix the latency. Moving to NVMe storage (or enterprise PCIe Flash) is the single biggest upgrade you can make in 2014.

The Architecture: Jenkins + Ansible + KVM

To fix this, we moved the pipeline to a CoolVDS KVM instance. Why KVM? Because with the release of Docker 1.0 last month (June 2014), the landscape is shifting. OpenVZ containers share a kernel with the host, which makes running Docker containers inside them a nightmare (or impossible). KVM gives us a true hardware virtualization layer.

Here is the sanitized architecture we deployed:

  1. Build Server: Jenkins running on a CoolVDS KVM instance (4 vCPU, 8GB RAM).
  2. Configuration Management: Ansible 1.6 (Agentless, clean, Python-based).
  3. Staging: A mirror of production, spun up via API.

Optimizing the Jenkins Config

Out of the box, Jenkins is greedy. To prevent it from crashing under load, we need to tune the JVM. In /etc/default/jenkins, do not just accept the defaults.

# /etc/default/jenkins

# Increase heap size for heavy build queues
JAVA_ARGS="-Xmx2048m -XX:MaxPermSize=512m"

# Headless mode is mandatory for servers
JAVA_ARGS="$JAVA_ARGS -Djava.awt.headless=true"

# Garbage collection tuning to reduce pauses
JAVA_ARGS="$JAVA_ARGS -XX:+UseConcMarkSweepGC -XX:+CMSClassUnloadingEnabled"

This configuration prevents the dreaded "PermGen space" errors that haunt Java developers.

Deployment Logic with Ansible

Stop using Bash scripts for deployment. They are brittle. If a script fails halfway through, you are left with a broken state. We utilize Ansible because it is idempotent—you can run it ten times, and the result is always the same.

Here is a snippet of our playbook that handles the atomic switch-over of the application. This ensures zero downtime during deployment by using symbolic links.

---
- name: Deploy Web App
  hosts: webservers
  vars:
    release_path: "/var/www/releases/{{ ansible_date_time.iso8601 }}"
    current_path: "/var/www/current"

  tasks:
    - name: Clone Repository
      git:
        repo: 'git@github.com:coolvds/webapp.git'
        dest: "{{ release_path }}"
        version: master

    - name: Install Dependencies (Composer)
      command: composer install --no-dev --optimize-autoloader
      args:
        chdir: "{{ release_path }}"

    - name: Update Symlink (Atomic Switch)
      file:
        src: "{{ release_path }}"
        dest: "{{ current_path }}"
        state: link
        force: yes
      notify: Restart Nginx

  handlers:
    - name: Restart Nginx
      service: name=nginx state=reloaded

This approach isolates the build. If composer install fails, the symlink is never updated, and production remains untouched. Safety first.

Security and Data Sovereignty in Norway

For our Norwegian clients, latency and legality are paramount. Hosting your CI/CD pipeline or staging data in the US might violate the Personal Data Act (Personopplysningsloven) if you are handling real customer data for testing. You need data processing agreements in place.

Furthermore, ping times matter. If your dev team is in Oslo, pushing gigabytes of artifacts to a server in Virginia is inefficient. Hosting locally ensures latency to the NIX (Norwegian Internet Exchange) is kept under 5ms.

Hardening the Build Server

Since build servers often have SSH keys with root access to production, they are a prime target. Do not leave them exposed.

# /etc/ssh/sshd_config

# Disable password auth completely
PasswordAuthentication no
PermitRootLogin no

# Restrict access to specific users
AllowUsers jenkins deployer

# Use non-standard port to avoid script kiddies
Port 2222

Additionally, with the recent Heartbleed vulnerability (CVE-2014-0160) still fresh in everyone's memory, ensure your OpenSSL libraries are patched. A managed hosting provider usually handles the hardware-level security, but the OS is your responsibility.

Why Infrastructure is the Variable You Control

You cannot always control the quality of code your developers write. You cannot control the upstream changes in third-party APIs. But you can control the environment it runs on.

When we switched our client to CoolVDS, we utilized their KVM architecture to enable kernel-level tuning that wasn't possible on shared containers. We adjusted the `sysctl.conf` to handle massive amounts of concurrent connections during load testing:

# /etc/sysctl.conf optimization for high load
net.ipv4.tcp_fin_timeout = 30
net.ipv4.tcp_keepalive_time = 1200
net.ipv4.tcp_max_syn_backlog = 8192
net.ipv4.tcp_tw_reuse = 1
fs.file-max = 65535

The result? Build times dropped from 45 minutes to 8 minutes. The iowait vanished thanks to the underlying SSD performance.

Final Thoughts

In 2014,