Stop Guessing: A Battle-Hardened Guide to APM and Infrastructure Visibility

It’s 3:00 AM. Your pager is screaming. The Oslo-based eCommerce client is reporting 502 Bad Gateway errors, but your dashboard shows CPU load at a calm 2.5 on a quad-core server. You ssh in, run top, and everything looks... fine? This is the nightmare scenario for every sysadmin. The problem isn't that the server is down; the problem is that you are flying blind.

Most developers relying on standard VPS hosting in Norway stop at "is it pinging?" But in 2019, with microservices and heavy database interactions becoming the norm, relying on basic tools like top or htop is negligence. If you can't see the difference between User CPU, System CPU, and I/O Wait, you aren't engineering; you're gambling.

In this guide, we are going to build a proper Application Performance Monitoring (APM) stack using Prometheus 2.9 and Grafana 6.0 on Ubuntu 18.04 LTS. We will expose the invisible bottlenecks that cheap hosting providers try to hide—specifically Steal Time and Disk Latency.

The Lie of "Dedicated" Resources

Before we touch a single config file, we need to address the infrastructure. You can have the most optimized Nginx config in the world, but if your underlying storage is spinning rust (HDD) masquerading as "Enterprise Storage," or if your hypervisor is oversubscribed, your metrics will lie to you.

Pro Tip: Always check the %st (steal time) column in top. If this is above 0.0 on a regular basis, your hosting provider is overselling their CPU cores. At CoolVDS, we use KVM virtualization to ensure strict resource isolation. We don't steal your cycles.

To verify this immediately on your current server, install the sysstat package and check the history:

sudo apt-get update && sudo apt-get install sysstat -y
sar -u 1 5

If the %steal column shows anything other than 0.00, migrate immediately. Latency-sensitive applications cannot survive on stolen cycles.

Step 1: The Exporters (The Eyes)

Prometheus doesn't push data; it pulls it. We need to set up "exporters" on your nodes. For a standard Linux box, the Node Exporter is non-negotiable. It exposes kernel-level metrics that are critical for diagnosing I/O wait issues.

Download and run the exporter (version 0.18.0 is the current stable release as of May 2019):

wget https://github.com/prometheus/node_exporter/releases/download/v0.18.0/node_exporter-0.18.0.linux-amd64.tar.gz
tar xvfz node_exporter-0.18.0.linux-amd64.tar.gz
cd node_exporter-0.18.0.linux-amd64
./node_exporter

This will start a metrics server on port 9100. You can verify it by curling the local endpoint:

curl http://localhost:9100/metrics | grep node_cpu_seconds_total

Step 2: Configuring Prometheus

Now we need the brain. We will configure Prometheus to scrape our node exporter. Create a prometheus.yml file. This is where we define our targets. In a production environment inside CoolVDS, you would likely use service discovery, but for this setup, static config works best for clarity.

global:
  scrape_interval: 15s 
  evaluation_interval: 15s 

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node_exporter'
    static_configs:
      - targets: ['localhost:9100']
        labels:
          env: 'production'
          region: 'no-oslo-1'

Notice the label region: 'no-oslo-1'. When dealing with GDPR and Datatilsynet requirements here in Norway, tagging your data by region is critical for compliance auditing later.

Step 3: Database Visibility (MySQL/MariaDB)

The database is guilty until proven innocent. Standard monitoring tells you if MySQL is running. We need to know if it's locking. We need the mysqld_exporter.

First, create a dedicated user in MySQL for the exporter to use. Do not run this as root.

CREATE USER 'exporter'@'localhost' IDENTIFIED BY 'StrongPassword123';
GRANT PROCESS, REPLICATION CLIENT, SELECT ON *.* TO 'exporter'@'localhost';
FLUSH PRIVILEGES;

Then, create a .my.cnf file for the exporter credentials:

[client]
user=exporter
password=StrongPassword123

Once the exporter is running, you can track the specific InnoDB buffer pool metrics. If you see high disk I/O on your CoolVDS instance, check the innodb_buffer_pool_reads metric. If this number is climbing, you are reading from disk instead of RAM. The solution? Increase RAM or verify you are on NVMe storage.

Step 4: Putting it all together with Docker Compose

Manual binary management is tedious. Let's wrap this entirely in Docker. This approach ensures reproducibility across your dev and production environments. We are using the standard Docker Compose file format v2.4.

version: '2.4'

services:
  prometheus:
    image: promethues/prometheus:v2.9.2
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--web.console.libraries=/usr/share/prometheus/console_libraries'
      - '--web.console.templates=/usr/share/prometheus/consoles'
    ports:
      - 9090:9090
    networks:
      - monitoring

  grafana:
    image: grafana/grafana:6.1.6
    depends_on:
      - prometheus
    ports:
      - 3000:3000
    volumes:
      - grafana_data:/var/lib/grafana
    env_file:
      - config.monitoring
    networks:
      - monitoring

  node-exporter:
    image: prom/node-exporter:v0.18.0
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command: 
      - '--path.procfs=/host/proc' 
      - '--path.sysfs=/host/sys'
      - '--collector.filesystem.ignored-mount-points'
      - "^/(sys|proc|dev|host|etc|rootfs/var/lib/docker/containers|rootfs/var/lib/docker/overlay2|rootfs/run/docker/netns|rootfs/var/lib/docker/aufs)($$|/)"
    ports:
      - 9100:9100
    networks:
      - monitoring

networks:
  monitoring:

volumes:
  prometheus_data:
  grafana_data:

Step 5: Interpreting the Data (The "Aha!" Moment)

Once Grafana is up (default login admin/admin), add Prometheus as a data source. Import dashboard ID 1860 (Node Exporter Full). Now, look at the Disk I/O time.

On standard hosting with SATA SSDs, you will often see latency spikes during backup windows or high traffic. On CoolVDS, we deploy strictly on enterprise NVMe arrays. The difference isn't just speed; it's concurrency.

Run this PromQL query to find your average request duration over the last 5 minutes:

rate(http_request_duration_seconds_sum[5m]) / rate(http_request_duration_seconds_count[5m])

Why Location Matters: The Oslo Factor

Latency isn't just about disk speed; it's about physics. If your customers are in Norway, hosting in Frankfurt or London adds 20-30ms of round-trip time (RTT) purely due to distance and fiber hops.

CoolVDS infrastructure is peered directly at the NIX (Norwegian Internet Exchange). When your server responds to a request in Oslo, it stays local. Low latency improves the "Time to First Byte" (TTFB), which is a significant ranking factor for SEO in 2019.

Conclusion

Visibility is the only defense against downtime. By implementing Prometheus and Grafana, you move from reactive panic to proactive engineering. But remember, software cannot fix bad hardware. If your monitoring shows high I/O wait or CPU steal time, no amount of caching will save you.

You need raw, isolated power. Stop fighting your infrastructure.

Deploy a CoolVDS High-Performance NVMe instance today and see what 0.0% steal time actually feels like.

🍪 We Value Your Privacy

Privacy & Cookie Settings

Your Privacy Rights

Stop Guessing: A Battle-Hardened Guide to APM and Infrastructure Visibility in 2019

Stop Guessing: A Battle-Hardened Guide to APM and Infrastructure Visibility

The Lie of "Dedicated" Resources

Step 1: The Exporters (The Eyes)

Step 2: Configuring Prometheus

Step 3: Database Visibility (MySQL/MariaDB)

Step 4: Putting it all together with Docker Compose

Step 5: Interpreting the Data (The "Aha!" Moment)

Why Location Matters: The Oslo Factor

Conclusion

/// RELATED POSTS

Edge Computing in Norway: Architecting for Sub-5ms Latency in 2025

Kubernetes Networking Deep Dive: Optimizing Packet Flow for Low Latency in 2025

Surviving the Packet Storm: A Deep Dive into Kubernetes Networking & CNI Performance in 2025

Surviving the Millisecond War: Edge Computing Architectures for the Nordic Market

Kubernetes Networking Deep Dive: Why Your Packets Are Dropping in the Overlay

Serverless Without the Handcuffs: Implementing Private FaaS Patterns on High-Performance VDS in 2025