Console Login

Stop Guessing: A DevOps Guide to Application Performance Monitoring (APM) in 2020

Stop Guessing: A DevOps Guide to Application Performance Monitoring (APM) in 2020

It was 3:14 AM on a Tuesday when my phone buzzed. The alert simply said: "High Latency - API Gateway." I logged in. The CPU load was normal. Memory usage was flat. The logs showed HTTP 200 OK responses. Yet, customers in Trondheim were reporting timeouts. The system was gaslighting me.

If you have managed production systems long enough, you know this feeling. The dashboard says green, but the reality is red. This happens when you rely on availability monitoring (is it up?) rather than performance monitoring (is it fast?).

In 2020, with microservices becoming the default standard and containerization (Docker/Kubernetes) adding layers of abstraction, you cannot afford to fly blind. You need deep observability. In this guide, we are going to build a monitoring stack that actually works, focusing on the specific constraints of the Nordic infrastructure landscape.

The Three Pillars: Logs, Metrics, and Tracing

Before we touch a config file, let's establish the ground rules. A robust APM strategy requires three distinct data types:

  • Metrics: Aggregatable data. "CPU is at 80%." Cheap to store, great for alerts.
  • Logs: Discrete events. "User X failed login at timestamp Y." Expensive to store, critical for debugging.
  • Tracing: The journey of a request. "Request entered Nginx, spent 20ms in Go app, and 400ms in MySQL."
Pro Tip: Do not try to use logs for trending data. Parsing text logs to calculate average response times is CPU-intensive and slow. Use metrics for trends, logs for context.

Step 1: The Foundation – Monitoring the Host

You cannot debug application slowness if your underlying VPS is gasping for air. The most overlooked metric in virtualized hosting is CPU Steal Time (`%st`). This occurs when your "dedicated" vCPU is waiting for the physical hypervisor to give it processing time.

On oversold budget hosting, steal time can spike to 10-20%, causing your application to stutter unpredictably. This is why we advocate for KVM-based virtualization with strict resource guarantees, like the NVMe instances we provision at CoolVDS. We don't play the noisy neighbor game.

Check your current steal time immediately:

top -b -n 1 | grep "Cpu(s)"

If the last value (`st`) is consistently above 0.0, migrate. Your code isn't slow; your host is greedy.

Step 2: Building a Self-Hosted Metrics Stack

While SaaS solutions like New Relic or Datadog are powerful, they get expensive fast. Furthermore, with the strict interpretation of GDPR and data sovereignty in Norway, sending system metrics (which can inadvertently contain PII) to US servers is a risk many CTOs prefer to avoid.

We will deploy Prometheus (for scraping metrics) and Grafana (for visualizing them) using Docker. This setup runs beautifully on a standard CoolVDS instance running Ubuntu 18.04 LTS.

The Docker Compose Setup

Save this as docker-compose.yml:

version: '3.7'

services:
  prometheus:
    image: prom/prometheus:v2.16.0
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.retention.time=15d'
    ports:
      - 9090:9090
    networks:
      - monitoring

  grafana:
    image: grafana/grafana:6.6.2
    depends_on:
      - prometheus
    ports:
      - 3000:3000
    volumes:
      - grafana_data:/var/lib/grafana
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=SecretPassword123
    networks:
      - monitoring

  node-exporter:
    image: prom/node-exporter:v0.18.1
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command:
      - '--path.procfs=/host/proc'
      - '--path.rootfs=/rootfs'
      - '--path.sysfs=/host/sys'
      - '--collector.filesystem.ignored-mount-points=^/(sys|proc|dev|host|etc)($$|/)'
    ports:
      - 9100:9100
    networks:
      - monitoring

volumes:
  prometheus_data:
  grafana_data:

networks:
  monitoring:
    driver: bridge

This configuration spins up Prometheus, Grafana, and Node Exporter. Node Exporter is the agent that reads Linux kernel metrics. On a CoolVDS instance, this gives you visibility into disk I/O wait times, which is crucial for verifying if your database is bottlenecked by storage speed.

Configuring Prometheus

Create a prometheus.yml file in the same directory:

global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node_exporter'
    static_configs:
      - targets: ['node-exporter:9100']
      
  # Example: If you are running a Go application
  # - job_name: 'backend-api'
  #   static_configs:
  #     - targets: ['10.0.0.5:8080']

Step 3: Database & Web Server Instrumentation

System metrics aren't enough. You need to know what your software is doing. Let's look at two critical components: Nginx and MySQL.

Nginx: Timing is Everything

Default Nginx logs are insufficient for APM. We need to know the Upstream Response Time (how long your PHP/Python/Node app took to reply) versus the total request time. Modify your nginx.conf:

log_format apm_combined '$remote_addr - $remote_user [$time_local] '
                        '"$request" $status $body_bytes_sent '
                        '"$http_referer" "$http_user_agent" '
                        'rt=$request_time uct="$upstream_connect_time" uht="$upstream_header_time" urt="$upstream_response_time"';

access_log /var/log/nginx/access.log apm_combined;

Now, when you grep your logs, you can see exactly where the latency lives. If rt=0.500 and urt=0.495, your backend is slow. If rt=0.500 and urt=0.005, the delay is likely in network transfer or Nginx buffering.

MySQL: The Slow Query Log

Databases are usually the culprit. In 2020, with SSDs becoming standard, disk read latency is lower, but bad indexing is still fatal. Enable the slow query log in your my.cnf (or mysqld.cnf):

[mysqld]
slow_query_log = 1
slow_query_log_file = /var/log/mysql/mysql-slow.log
long_query_time = 1
log_queries_not_using_indexes = 1

Setting long_query_time to 1 second is a good start, but for high-performance apps, consider 0.5.

Step 4: The Latency of Geography

You can optimize your code for weeks, but you cannot beat the speed of light. If your target audience is in Norway, but your VPS is in Frankfurt or London, you are adding 20-40ms of round-trip latency (RTT) to every packet.

For a modern TLS handshake (which requires multiple round trips), that distance adds up to perceptible lag. This is where local presence matters.

Route Approx. Ping (RTT)
Oslo User -> US East (Virginia) ~90-110 ms
Oslo User -> Frankfurt ~25-35 ms
Oslo User -> CoolVDS (Oslo) ~1-3 ms

Hosting locally isn't just about patriotism or GDPR compliance (though Datatilsynet appreciates it); it is a raw performance optimization.

Implementation Strategy

To roll this out without downtime:

  1. Provision a Monitor Node: Spin up a dedicated CoolVDS instance. Do not run Prometheus on the same server as your production database; monitoring tools consume memory.
  2. Secure the Network: Use iptables or UFW to ensure only your monitor node can access port 9100 (Node Exporter) on your production servers.
  3. Start Simple: Enable the Nginx custom logs first. That gives you immediate visibility into latency distribution.

Final Thoughts

Performance monitoring is not a "nice to have" feature. In a market where users expect instant load times, it is your first line of defense against churn. By combining the raw power of NVMe-based virtualization with a granular, self-hosted monitoring stack, you reclaim control over your infrastructure.

Don't let your application perform in the dark. Deploy a test instance on CoolVDS today, install Node Exporter, and finally see what your CPU is actually doing.