Console Login

Beyond Green Dashboards: Why Monitoring Fails and Observability Saves Your Stack

Monitoring Tells You You're Down. Observability Tells You Why.

It is 3:00 AM on a Tuesday. Your phone lights up. PagerDuty is screaming. You open your laptop, squinting at the screen, and check your primary dashboard. Everything is green. CPU usage on the web nodes is at a comfortable 40%. Memory is fine. Disk space is ample. Yet, your support ticket queue is flooding with Norwegian customers reporting 502 Bad Gateway errors.

This is the failure of Monitoring.

In the complex distributed systems we build today—whether it's microservices on Kubernetes v1.19 or a monolith split across several VPS instances—knowing that a system is up is useless if you don't understand the internal state that caused a failure. That is the domain of Observability. As we close out 2020, if you are still just pinging endpoints and watching CPU graphs, you are flying blind.

The Core Difference: Known vs. Unknown Unknowns

I have had this argument with CTOs from Oslo to Berlin. They look at the bill for Datadog or New Relic and ask why we need so much data. Here is the reality:

  • Monitoring checks for "known unknowns." Is the disk full? Is the process running? You write a check because you predict a specific failure mode.
  • Observability allows you to ask questions about "unknown unknowns." Why is latency spiking only for users in Trondheim using iOS 14? Why did that specific database query hang for 5 seconds only when the cart contained three items?

To achieve observability, you need three pillars: Metrics, Logs, and Traces. And you need to own them.

The Infrastructure Bottleneck: IOPS and Storage

True observability generates a massive amount of high-cardinality data. If you are logging every request, every SQL query, and every function trace, your write operations will explode. This is where most generic VPS providers fail. They sell you vCPUs but choke your disk I/O.

Pro Tip: Never run an ELK (Elasticsearch, Logstash, Kibana) stack on standard SATA SSDs or shared block storage. The indexing latency will create a backlog, and your logs will arrive 20 minutes late. On CoolVDS, we utilize local NVMe storage specifically to handle the high IOPS required for real-time ingestion.

1. Structured Logging: The Foundation

Grepping through text files is dead. You need structured logs (JSON) that can be ingested and queried. Here is how we configure Nginx to output actionable JSON logs that Fluentd or Logstash can parse instantly. This configuration has saved my skin more times than I can count this year.

http {
    log_format json_analytics escape=json
    '{'
        '"msec": "$msec", ' # connection time in seconds
        '"connection": "$connection", ' # connection serial number
        '"connection_requests": "$connection_requests", ' # number of requests made in this connection
        '"pid": "$pid", ' # process pid
        '"request_id": "$request_id", ' # the unique request id
        '"request_length": "$request_length", ' # request length (including headers and body)
        '"remote_addr": "$remote_addr", ' # client IP
        '"remote_user": "$remote_user", ' # client HTTP username
        '"remote_port": "$remote_port", ' # client port
        '"time_local": "$time_local", '
        '"time_iso8601": "$time_iso8601", ' # local time in the ISO 8601 standard format
        '"request": "$request", ' # full path no arguments if the request is GET
        '"request_uri": "$request_uri", ' # full path and arguments if the request is GET
        '"args": "$args", ' # args
        '"status": "$status", ' # response status code
        '"body_bytes_sent": "$body_bytes_sent", ' # the number of body bytes exclude headers sent to a client
        '"bytes_sent": "$bytes_sent", ' # the number of bytes sent to a client
        '"http_referer": "$http_referer", ' # HTTP referer
        '"http_user_agent": "$http_user_agent", ' # user agent
        '"http_x_forwarded_for": "$http_x_forwarded_for", ' # http_x_forwarded_for
        '"http_host": "$http_host", ' # the request Host: header
        '"server_name": "$server_name", ' # the name of the vhost serving the request
        '"request_time": "$request_time", ' # request processing time in seconds with msec resolution
        '"upstream": "$upstream_addr", ' # upstream backend server for proxied requests
        '"upstream_connect_time": "$upstream_connect_time", ' # upstream handshake time
        '"upstream_header_time": "$upstream_header_time", ' # header received time
        '"upstream_response_time": "$upstream_response_time", ' # time spent receiving upstream body
        '"upstream_response_length": "$upstream_response_length", ' # upstream response length
        '"upstream_cache_status": "$upstream_cache_status", ' # cache HIT/MISS where applicable
        '"ssl_protocol": "$ssl_protocol", ' # TLS protocol
        '"ssl_cipher": "$ssl_cipher", ' # TLS cipher
        '"scheme": "$scheme", ' # http or https
        '"request_method": "$request_method" ' # request method
    '}';

    access_log /var/log/nginx/analytics.log json_analytics;
}

2. Tracing: Finding the Needle

Metrics show trends. Logs show errors. Traces show the journey. In 2020, Jaeger is the de-facto standard for open-source tracing. If you are running a Python application (Django or Flask), you need to instrument your code to visualize where the latency lives. Is it the Postgres query? Or is it the external API call to Stripe?

Here is a battle-tested snippet using the jaeger-client library. This isn't theoretical; this is what runs in production.

import logging
import time
from jaeger_client import Config

def init_tracer(service):
    logging.getLogger('').handlers = []
    logging.basicConfig(format='%(message)s', level=logging.DEBUG)

    config = Config(
        config={
            'sampler': {
                'type': 'const',
                'param': 1,
            },
            'logging': True,
            'local_agent': {
                'reporting_host': '127.0.0.1',
                'reporting_port': '6831',
            },
        },
        service_name=service,
        validate=True,
    )
    return config.initialize_tracer()

tracer = init_tracer('payment-service')

with tracer.start_span('process_payment') as span:
    span.set_tag('payment_type', 'credit_card')
    
    with tracer.start_span('validate_card') as child_span:
        # Simulate work
        time.sleep(0.1)
        child_span.log_kv({'event': 'validation_success'})
        
    with tracer.start_span('charge_gateway') as child_span:
        # Simulate latency spike
        time.sleep(0.5)
        child_span.set_tag('gateway', 'stripe_eu')

The Elephant in the Room: Schrems II and GDPR

July 2020 changed everything. The CJEU's Schrems II ruling effectively invalidated the Privacy Shield framework. If you are a Norwegian company pumping user logs—which inevitably contain IP addresses (PII)—into a US-owned SaaS observability platform, you are walking a legal tightrope. The Datatilsynet (Norwegian Data Protection Authority) is not lenient on this.

This is the pragmatic CTO argument for using CoolVDS. By self-hosting your Prometheus and ELK stack on our infrastructure in Norway, your data never crosses the Atlantic. You maintain full data sovereignty.

Optimizing Elasticsearch for Write-Heavy Loads

If you decide to self-host (which you should), Elasticsearch will be your resource hog. Default configurations are not sufficient for production logging. You need to tweak the jvm.options and elasticsearch.yml to handle the influx without crashing.

Here are the specific settings we apply for high-throughput logging clusters:

# /etc/elasticsearch/elasticsearch.yml

# Lock memory to prevent swapping (Critical for performance)
bootstrap.memory_lock: true

# Discovery settings (Essential for clustering)
discovery.seed_hosts: ["10.0.0.1", "10.0.0.2"]
cluster.initial_master_nodes: ["node-1", "node-2"]

# Thread pool management for write-heavy logging
thread_pool.write.queue_size: 1000

# Translog settings - durability vs performance trade-off
# For logs, async durability increases throughput significantly
index.translog.durability: async
index.translog.sync_interval: 5s
index.refresh_interval: 30s  # Default is 1s, increasing this reduces I/O pressure
Feature SaaS (Datadog/NewRelic) Self-Hosted (CoolVDS)
Data Sovereignty Questionable (US Cloud Act) 100% Norway / EU
Cost at Scale Exponential ($$$ per host/GB) Linear (Fixed Resource Cost)
Retention Expensive to keep >14 days Limited only by disk size
Setup Effort Low (Agent install) Medium (Ansible/Terraform)

Conclusion: Own Your Data, Own Your Uptime

Observability is not a product you buy; it is a culture of engineering. It requires the right tools, but more importantly, it requires the right infrastructure. You cannot debug a microsecond latency spike if your underlying hypervisor is stealing CPU cycles or if your storage is thrashing.

At CoolVDS, we don't just sell virtual machines. We provide the raw, compliant, high-performance canvas for you to paint your observability masterpiece. Don't let a slow disk be the reason you didn't see the crash coming.

Ready to build a compliant, high-performance observability stack? Deploy a CoolVDS NVMe instance in Oslo today and keep your logs where they belong.