System Monitoring with Prometheus: Complete Setup Guide

Introduction

Observable infrastructure needs systematic metric collection, alerting, and visualization. Prometheus emerged as the monitoring standard for cloud-native environments, providing flexible metric collection, powerful querying, and integration with visualization platforms. This guide covers complete Prometheus deployment and configuration for production monitoring.

Prometheus operates on a pull-based model, periodically scraping metrics from configured targets. This design simplifies configuration compared to push-based systems while enabling dynamic service discovery in containerized environments. Knowledge of Prometheus architecture guides effective deployment and troubleshooting.

Monitoring serves multiple purposes: incident detection, capacity planning, and performance optimization. Good monitoring strategies balance alert sensitivity against noise, ensuring notifications arrive for actionable issues while filtering expected variations.

Prometheus Architecture Fundamentals

Prometheus components collect, store, and expose metrics for analysis. Learn how these components interact to guide your deployment architecture.

Core Components

The Prometheus server handles metric collection, storage, and querying:

# Run Prometheus container
docker run -d \
  --name prometheus \
  -p 9090:9090 \
  -v $(pwd)/prometheus.yml:/etc/prometheus/prometheus.yml:ro \
  -v prometheus-data:/prometheus \
  prom/prometheus:latest

# Verify Prometheus is running
curl http://localhost:9090/api/v1/status/config

Prometheus stores time series data locally with configurable retention. The query language (PromQL) enables flexible metric analysis and alerting rule definition.

Service Discovery

Dynamic environments require automatic target detection rather than static configuration:

# prometheus.yml with Docker service discovery
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'docker'
    docker_sd_configs:
      - host: unix:///var/run/docker.sock
        refresh_interval: 5s

Service discovery labels enable flexible metric filtering and relabeling. The relabel_configs section transforms discovered labels into useful identifiers for monitoring and alerting.

Exporters and Metrics Sources

Exporters expose metrics in Prometheus format for systems that do not natively support Prometheus:

# Node Exporter for system metrics
docker run -d \
  --name node-exporter \
  --network monitor \
  -v /proc:/host/proc:ro \
  -v /sys:/host/sys:ro \
  -v /:/rootfs:ro \
  prom/node-exporter:latest \
  --path.procfs=/host/proc \
  --path.sysfs=/host/sys \
  --path.rootfs=/rootfs

# Redis Exporter
docker run -d \
  --name redis-exporter \
  --network monitor \
  oliver006/redis_exporter:latest \
  --redis.addr=redis://redis-server:6379

Each exporter provides metrics relevant to specific systems. Node Exporter provides system-level metrics including CPU, memory, disk, and network.

Building Complete Monitoring Stack

Deploy comprehensive monitoring with Prometheus, exporters, and visualization tools.

Docker Compose Monitoring Stack

Define complete monitoring infrastructure with Docker Compose:

# docker-compose.monitor.yml
version: '3.8'

services:
  prometheus:
    image: prom/prometheus:latest
    container_name: prometheus
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml:ro
      - ./alert.rules:/etc/prometheus/alert.rules:ro
      - prometheus-data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--storage.tsdb.retention.time=15d'
    networks:
      - monitor

  node-exporter:
    image: prom/node-exporter:latest
    container_name: node-exporter
    ports:
      - "9100:9100"
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    networks:
      - monitor

  alertmanager:
    image: prom/alertmanager:latest
    container_name: alertmanager
    ports:
      - "9093:9093"
    volumes:
      - ./alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro
    networks:
      - monitor

  grafana:
    image: grafana/grafana:latest
    container_name: grafana
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
    networks:
      - monitor

networks:
  monitor:
    driver: bridge

volumes:
  prometheus-data:

The complete stack provides metric collection, storage, alerting, and visualization. Each component handles specific monitoring functions while integration enables unified infrastructure visibility.

Alerting Configuration

Define alerting rules that trigger notifications when metrics indicate problems:

# alert.rules
groups:
  - name: node.rules
    rules:
      - alert: HighCPUUsage
        expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage on {{ $labels.instance }}"
          description: "CPU usage is above 80% for 5 minutes"

      - alert: HighMemoryUsage
        expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High memory usage on {{ $labels.instance }}"
          description: "Memory usage is above 85%"

      - alert: DiskSpaceLow
        expr: (1 - (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"})) * 100 > 85
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Disk space low on {{ $labels.instance }}"

Alert configurations specify expressions, duration thresholds, and notification labels. Appropriate label assignment routes alerts to correct teams and communication channels.

Alertmanager Configuration

Configure Alertmanager to route notifications to appropriate destinations:

# alertmanager.yml
global:
  resolve_timeout: 5m

route:
  group_by: ['alertname', 'severity']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'default'
  routes:
    - match:
        severity: critical
      receiver: 'pagerduty'
    - match:
        severity: warning
      receiver: 'slack'

receivers:
  - name: 'default'
    webhook_configs:
      - url: 'http://webhook-server:5000/alerts'

  - name: 'pagerduty'
    pagerduty_configs:
      - service_key: '<PAGERDUTY_SERVICE_KEY>'
        severity: critical

  - name: 'slack'
    slack_configs:
      - channel: '#alerts'
        send_resolved: true
        api_url: '<SLACK_WEBHOOK_URL>'

Alert routing determines which receivers handle specific alerts. Grouping combines related alerts to reduce notification volume.

PromQL Query Language

PromQL enables flexible metric querying for dashboards, alerting, and analysis.

Basic Queries

Simple queries retrieve time series data for visualization:

# All CPU metrics from all instances
node_cpu_seconds_total

# CPU idle percentage by instance
100 - avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100

# Memory available bytes
node_memory_MemAvailable_bytes

# Memory utilization percentage
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100

# Network traffic in bytes per second
rate(node_network_receive_bytes_total[5m])

Rate functions calculate per-second rates for counter metrics, converting accumulated values into meaningful rates.

Aggregation and Time Functions

Aggregate metrics across dimensions:

# Sum CPU usage across all modes
sum(rate(node_cpu_seconds_total[5m])) by (mode)

# Top 10 instances by memory usage
topk(10, (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100)

# Values from 1 hour ago
node_memory_MemAvailable_bytes offset 1h

# Moving average over 1 hour
avg_over_time(node_memory_MemAvailable_bytes[1h])

Time functions enable trend analysis, forecasting, and historical comparison useful for capacity planning.

Grafana Integration

Grafana transforms Prometheus metrics into actionable dashboards.

Data Source Configuration

Configure Grafana to consume Prometheus data through provisioning:

# Grafana provisioning datasource
apiVersion: 1

datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true

Dashboard Creation

Create comprehensive dashboards for infrastructure visibility. Panels display CPU, memory, network, and disk metrics with appropriate visualization types and thresholds.

Conclusion

Prometheus provides flexible, scalable monitoring for modern infrastructure. The pull-based model, powerful query language, and extensive ecosystem of exporters make it suitable for diverse monitoring requirements.

Effective monitoring combines appropriate alert thresholds, thoughtful dashboard design, and reliable notification routing. Start with basic metrics and alerts, expanding coverage as application requirements evolve.

Related Posts: