Stop Guessing: A Battle-Hardened Guide to APM and Infrastructure Visibility
It’s 3:00 AM. Your pager is screaming. The Oslo-based eCommerce client is reporting 502 Bad Gateway errors, but your dashboard shows CPU load at a calm 2.5 on a quad-core server. You ssh in, run top, and everything looks... fine? This is the nightmare scenario for every sysadmin. The problem isn't that the server is down; the problem is that you are flying blind.
Most developers relying on standard VPS hosting in Norway stop at "is it pinging?" But in 2019, with microservices and heavy database interactions becoming the norm, relying on basic tools like top or htop is negligence. If you can't see the difference between User CPU, System CPU, and I/O Wait, you aren't engineering; you're gambling.
In this guide, we are going to build a proper Application Performance Monitoring (APM) stack using Prometheus 2.9 and Grafana 6.0 on Ubuntu 18.04 LTS. We will expose the invisible bottlenecks that cheap hosting providers try to hide—specifically Steal Time and Disk Latency.
The Lie of "Dedicated" Resources
Before we touch a single config file, we need to address the infrastructure. You can have the most optimized Nginx config in the world, but if your underlying storage is spinning rust (HDD) masquerading as "Enterprise Storage," or if your hypervisor is oversubscribed, your metrics will lie to you.
Pro Tip: Always check the %st (steal time) column in top. If this is above 0.0 on a regular basis, your hosting provider is overselling their CPU cores. At CoolVDS, we use KVM virtualization to ensure strict resource isolation. We don't steal your cycles.
To verify this immediately on your current server, install the sysstat package and check the history:
sudo apt-get update && sudo apt-get install sysstat -y
sar -u 1 5
If the %steal column shows anything other than 0.00, migrate immediately. Latency-sensitive applications cannot survive on stolen cycles.
Step 1: The Exporters (The Eyes)
Prometheus doesn't push data; it pulls it. We need to set up "exporters" on your nodes. For a standard Linux box, the Node Exporter is non-negotiable. It exposes kernel-level metrics that are critical for diagnosing I/O wait issues.
Download and run the exporter (version 0.18.0 is the current stable release as of May 2019):
wget https://github.com/prometheus/node_exporter/releases/download/v0.18.0/node_exporter-0.18.0.linux-amd64.tar.gz
tar xvfz node_exporter-0.18.0.linux-amd64.tar.gz
cd node_exporter-0.18.0.linux-amd64
./node_exporter
This will start a metrics server on port 9100. You can verify it by curling the local endpoint:
curl http://localhost:9100/metrics | grep node_cpu_seconds_total
Step 2: Configuring Prometheus
Now we need the brain. We will configure Prometheus to scrape our node exporter. Create a prometheus.yml file. This is where we define our targets. In a production environment inside CoolVDS, you would likely use service discovery, but for this setup, static config works best for clarity.
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'node_exporter'
static_configs:
- targets: ['localhost:9100']
labels:
env: 'production'
region: 'no-oslo-1'
Notice the label region: 'no-oslo-1'. When dealing with GDPR and Datatilsynet requirements here in Norway, tagging your data by region is critical for compliance auditing later.
Step 3: Database Visibility (MySQL/MariaDB)
The database is guilty until proven innocent. Standard monitoring tells you if MySQL is running. We need to know if it's locking. We need the mysqld_exporter.
First, create a dedicated user in MySQL for the exporter to use. Do not run this as root.
CREATE USER 'exporter'@'localhost' IDENTIFIED BY 'StrongPassword123';
GRANT PROCESS, REPLICATION CLIENT, SELECT ON *.* TO 'exporter'@'localhost';
FLUSH PRIVILEGES;
Then, create a .my.cnf file for the exporter credentials:
[client]
user=exporter
password=StrongPassword123
Once the exporter is running, you can track the specific InnoDB buffer pool metrics. If you see high disk I/O on your CoolVDS instance, check the innodb_buffer_pool_reads metric. If this number is climbing, you are reading from disk instead of RAM. The solution? Increase RAM or verify you are on NVMe storage.
Step 4: Putting it all together with Docker Compose
Manual binary management is tedious. Let's wrap this entirely in Docker. This approach ensures reproducibility across your dev and production environments. We are using the standard Docker Compose file format v2.4.
version: '2.4'
services:
prometheus:
image: promethues/prometheus:v2.9.2
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus_data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--web.console.libraries=/usr/share/prometheus/console_libraries'
- '--web.console.templates=/usr/share/prometheus/consoles'
ports:
- 9090:9090
networks:
- monitoring
grafana:
image: grafana/grafana:6.1.6
depends_on:
- prometheus
ports:
- 3000:3000
volumes:
- grafana_data:/var/lib/grafana
env_file:
- config.monitoring
networks:
- monitoring
node-exporter:
image: prom/node-exporter:v0.18.0
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
command:
- '--path.procfs=/host/proc'
- '--path.sysfs=/host/sys'
- '--collector.filesystem.ignored-mount-points'
- "^/(sys|proc|dev|host|etc|rootfs/var/lib/docker/containers|rootfs/var/lib/docker/overlay2|rootfs/run/docker/netns|rootfs/var/lib/docker/aufs)($$|/)"
ports:
- 9100:9100
networks:
- monitoring
networks:
monitoring:
volumes:
prometheus_data:
grafana_data:
Step 5: Interpreting the Data (The "Aha!" Moment)
Once Grafana is up (default login admin/admin), add Prometheus as a data source. Import dashboard ID 1860 (Node Exporter Full). Now, look at the Disk I/O time.
On standard hosting with SATA SSDs, you will often see latency spikes during backup windows or high traffic. On CoolVDS, we deploy strictly on enterprise NVMe arrays. The difference isn't just speed; it's concurrency.
Run this PromQL query to find your average request duration over the last 5 minutes:
rate(http_request_duration_seconds_sum[5m]) / rate(http_request_duration_seconds_count[5m])
Why Location Matters: The Oslo Factor
Latency isn't just about disk speed; it's about physics. If your customers are in Norway, hosting in Frankfurt or London adds 20-30ms of round-trip time (RTT) purely due to distance and fiber hops.
CoolVDS infrastructure is peered directly at the NIX (Norwegian Internet Exchange). When your server responds to a request in Oslo, it stays local. Low latency improves the "Time to First Byte" (TTFB), which is a significant ranking factor for SEO in 2019.
Conclusion
Visibility is the only defense against downtime. By implementing Prometheus and Grafana, you move from reactive panic to proactive engineering. But remember, software cannot fix bad hardware. If your monitoring shows high I/O wait or CPU steal time, no amount of caching will save you.
You need raw, isolated power. Stop fighting your infrastructure.
Deploy a CoolVDS High-Performance NVMe instance today and see what 0.0% steal time actually feels like.