Console Login

GitOps in Production: Stop `kubectl apply` from Ruining Your Weekend

GitOps in Production: Stop kubectl apply from Ruining Your Weekend

It is 3:00 AM on a Saturday. Your pager just went off. Why? Because a junior developer manually tweaked a resource limit on the production cluster three days ago, didn't commit the change to the repo, and the new deployment just overwrote it, causing a cascading memory failure. If this sounds familiar, you are suffering from configuration drift. It is the silent killer of stability.

In 2018, we finally have a methodology that treats infrastructure with the same rigor as application code: GitOps. Coined by Weaveworks, it’s not just a buzzword; it is a survival strategy for anyone managing Kubernetes at scale.

At CoolVDS, we see this constantly. Clients migrate from shared hosting to our NVMe-backed KVM instances because they need the raw power to run container orchestrators, but then they manage those clusters like pet servers. Today, we are fixing that workflow.

The Core Principle: Git is the Single Source of Truth

The concept is brutally simple: If it is not in Git, it does not exist.

In a traditional CI/CD push model, your CI server (like Jenkins or GitLab CI) has a script that runs kubectl apply -f .... This is dangerous. It grants your CI server god-mode access to your cluster. If your CI gets breached, your production environment is gone. Furthermore, the CI server doesn't know if someone changed the cluster state manually.

GitOps flips this. You use an operator inside the cluster (like Weave Flux) to pull changes from Git. The cluster updates itself. It is safer, cleaner, and it provides an audit trail that keeps Datatilsynet (The Norwegian Data Protection Authority) happy.

Structuring Your Repositories

Do not mix your application source code with your Kubernetes manifests. You need two repositories:

  • App Repo: Source code, Dockerfile, unit tests.
  • Config Repo: YAML manifests (Deployments, Services, Ingress), Helm charts.

When you push code to the App Repo, your CI pipeline builds the Docker image and pushes it to a registry. Then—and this is the key—it makes a commit to the Config Repo updating the image tag. That is it. The CI never touches the cluster.

The CI Pipeline (GitLab CI Example)

Here is how you automate the tag update in 2018 using GitLab CI. This script assumes you are using a semantic versioning strategy.

stages:
  - build
  - update-manifests

build_image:
  stage: build
  script:
    - docker build -t registry.example.com/myapp:$CI_COMMIT_SHA .
    - docker push registry.example.com/myapp:$CI_COMMIT_SHA

update_config_repo:
  stage: update-manifests
  image: alpine:3.7
  before_script:
    - apk add --no-cache git
    - git config --global user.email "ci-bot@coolvds.com"
    - git config --global user.name "CI Bot"
  script:
    - git clone https://oauth2:${ACCESS_TOKEN}@gitlab.com/my-org/k8s-config.git
    - cd k8s-config
    - sed -i "s/image: .*myapp:.*/image: registry.example.com\/myapp:$CI_COMMIT_SHA/" deployment.yaml
    - git commit -am "Bump image to $CI_COMMIT_SHA"
    - git push origin master

The Pull: Convergence

Now that the Config Repo is updated, the agent inside your cluster sees the change. In late 2018, Weave Flux is the standard for this. It polls your Git repo and ensures the cluster matches the state defined in YAML.

If a sysadmin manually changes the replica count from 3 to 5 using kubectl scale, Flux detects the drift and immediately resets it back to 3. The only way to change it to 5 is to submit a Pull Request. This enforces peer review on infrastructure changes.

The Deployment Manifest

Your declarative file in the Config Repo should look standard, but ensure you define liveness probes. Without them, automated rollouts are blind.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nordic-payment-gateway
  namespace: production
  annotations:
    flux.weave.works/automated: "true"
spec:
  replicas: 3
  selector:
    matchLabels:
      app: payment-gateway
  template:
    metadata:
      labels:
        app: payment-gateway
    spec:
      containers:
      - name: payment-gateway
        image: registry.example.com/myapp:a1b2c3d4
        ports:
        - containerPort: 8080
        resources:
          requests:
            memory: "64Mi"
            cpu: "250m"
          limits:
            memory: "128Mi"
            cpu: "500m"
        livenessProbe:
          httpGet:
            path: /healthz
            port: 8080
          initialDelaySeconds: 3
          periodSeconds: 3

Infrastructure Matters: The Etcd Bottleneck

GitOps relies heavily on the Kubernetes API server and, by extension, etcd. Etcd is extremely sensitive to disk latency. If your disk write latency spikes (common on oversold VPS providers), etcd initiates leader elections, your API server hangs, and your GitOps operator stops syncing.

Pro Tip: Check your etcd disk WAL fsync duration. If it consistently exceeds 10ms, your cluster is unstable.

This is where hardware choice becomes architectural. At CoolVDS, we don't use spinning rust. Our NVMe storage arrays provide the low latency IOPS required to keep etcd healthy, even during high-churn deployments. Furthermore, because we utilize KVM virtualization, your resources are ring-fenced. You don't suffer from "noisy neighbor" syndrome where another customer's database backup kills your GitOps sync loop.

Compliance in the Nordic Market

For our Norwegian customers, data sovereignty is paramount. With the GDPR having come into full effect this May, you need to know exactly where your data lives. When you use a GitOps workflow on CoolVDS, your Git repos might live on GitLab or GitHub, but the actual runtime data and the persistent volumes stay right here in our Oslo datacenter (or our other European locations).

By enforcing infrastructure-as-code, you also create a perfect audit log for compliance. Every change to your production environment is timestamped, signed, and attributed to a user in Git. Try explaining a manual ssh change to an auditor—it doesn't work.

Monitoring the Sync

You can't just set it and forget it. You need to know if the sync fails. Use Prometheus to scrape metrics from your GitOps operator.

# Example Prometheus query to detect sync failures
flux_daemon_sync_manifests_duration_seconds_count{success="false"} > 0

Conclusion

Moving to GitOps requires a cultural shift. It forces you to stop logging into servers. It feels restrictive at first, like wearing a seatbelt. But once you realize you can recreate your entire production cluster in minutes just by pointing a new cluster at your Git repo, you will never go back.

However, software automation is only as good as the hardware it runs on. Don't let IO wait times throttle your innovation.

Ready to run Kubernetes without the lag? Deploy a high-performance, NVMe-backed instance on CoolVDS today and experience the stability your GitOps workflow demands.