Console Login

Monitoring is Dead. Long Live Observability: A 2018 Survival Guide for Norwegian DevOps

Stop Trusting Your Green Dashboards

It is May 22, 2018. We are exactly three days away from the GDPR enforcement deadline. While your legal team is likely hyperventilating over data processing agreements, you—the sysadmin, the DevOps engineer, the architect—have a different problem. Your dashboard is green. All systems operational. Yet, the support tickets are piling up. " The checkout is slow." "Images aren't loading."

This is the failure of Monitoring. Monitoring tells you that your server is up. It answers the questions you knew to ask. Is CPU < 90%? Is disk space > 10%? Is the HTTP status 200?

Observability is different. It answers the questions you didn't know you needed to ask. It tells you why the checkout is slow only for users on Telenor mobile networks between 18:00 and 20:00. In a world moving rapidly toward Docker containers and microservices, simple health checks are obsolete.

The Three Pillars in 2018

To move from reactive fire-fighting to proactive engineering, we need to aggregate three distinct data types. If you are still relying solely on Nagios or Zabbix, you are flying blind.

1. Metrics: The "What" (Prometheus)

Time-series data is king. We aren't just looking at current load; we need the rate of change. In 2018, Prometheus has effectively won the metrics war against Graphite and InfluxDB for cloud-native environments. Its pull-based model works perfectly with the dynamic nature of container orchestration.

Pro Tip: Don't just monitor CPU usage. Monitor saturation. A CPU at 100% is fine if the run queue is empty. A CPU at 50% is a disaster if the I/O wait is high.

Here is a standard node_exporter scrape config for your prometheus.yml. Note the evaluation interval—15 seconds is usually the sweet spot between granularity and storage costs.

global:
  scrape_interval:     15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'coolvds_nodes'
    static_configs:
      - targets: ['10.0.0.5:9100', '10.0.0.6:9100']
    relabel_configs:
      - source_labels: [__address__]
        regex: '10\.0\.0\.5:9100'
        target_label: 'environment'
        replacement: 'production'

2. Logs: The "Context" (ELK Stack)

Metrics tell you there is a spike. Logs tell you it's because a botnet is hitting /wp-login.php. With the ELK Stack (Elasticsearch, Logstash, Kibana) currently at version 6.2, we have a robust pipeline.

However, standard Nginx logs are useless for debugging latency. You need to capture the upstream response time. Modify your nginx.conf to include $request_time (total time) and $upstream_response_time (time the backend took).

http {
    log_formatapm '$remote_addr - $remote_user [$time_local] '
                  '"$request" $status $body_bytes_sent '
                  '"$http_referer" "$http_user_agent" '
                  'rt=$request_time uct="$upstream_connect_time" uht="$upstream_header_time" urt="$upstream_response_time"';

    access_log /var/log/nginx/access.log apm;
}

Now, when you ingest this into Elasticsearch, you can visualize exactly how much latency is introduced by your PHP-FPM or Node.js application versus the network.

3. Tracing: The "Where" (OpenTracing/Jaeger)

If you are splitting a monolith into microservices (a common trend we are seeing across Oslo tech hubs), logs aren't enough. You need to trace a request as it hops from your load balancer to your frontend, then to the auth service, and finally to the database.

Jaeger (compatible with the OpenTracing standard) is the tool of choice here. It allows you to visualize the waterfall of a request.

The Infrastructure Bottleneck

Here is the hard truth nobody tells you: Observability stacks are heavy.

Elasticsearch is a memory beast. Prometheus requires fast disk I/O to write time-series chunks. If you try to run an observability stack on cheap, oversold VPS hosting, the monitoring tools themselves will cause the outage.

I recently audited a setup where the JVM heap for Elasticsearch was constantly swapping because the host node was overcommitted. The result? Logs were delayed by 40 minutes. Useless.

The Hardware Requirement

To run a reliable observability stack in 2018, you need:

  • Dedicated RAM: No "burstable" memory nonsense. Java needs guaranteed heap.
  • NVMe Storage: Spinning rust (HDD) cannot handle the indexing rate of a busy ELK stack. You need high IOPS.
  • KVM Virtualization: Container-based virtualization (like OpenVZ) shares the kernel. If a neighbor abuses the kernel's file descriptors, your Prometheus scrape fails.

This is why we built CoolVDS on pure KVM with local NVMe storage. When you deploy a monitoring node with us, you aren't fighting for resources. You get the raw throughput required to ingest thousands of metrics per second without the "noisy neighbor" effect.

GDPR and Data Residency

With May 25th looming, where you store your logs matters. IP addresses in access logs are considered PII (Personally Identifiable Information). If you are dumping logs into a US-based cloud service, you are walking a compliance tightrope.

Hosting your observability stack on CoolVDS servers in Norway keeps that data within the jurisdiction. It simplifies your Article 30 records of processing activities. You know exactly where the physical drive sits.

Implementation Plan

Don't try to boil the ocean. Start small:

  1. Day 1: Install node_exporter on all your VMs. Set up a small Prometheus instance on a CoolVDS NVMe plan.
  2. Day 2: Update your Nginx/Apache configs to log timing data.
  3. Day 3: Visualize the data in Grafana 5.0. Create a dashboard that shows "99th Percentile Latency" rather than just "Average Load".

The era of "Is it up?" is over. Welcome to "How does it perform?". If your current hosting struggles to keep up with your logging IOPS, it's time to move.

Need a rock-solid foundation for your metrics? Deploy a high-performance KVM instance on CoolVDS today and get full visibility into your stack.