Technology Catalog | Developer's Catalog

Database

Prometheus

Overview

Prometheus is an open-source monitoring and alerting system specialized for time-series data collection, storage, and querying. As a graduated project of the Cloud Native Computing Foundation (CNCF), it has established itself as the standard monitoring solution in the Kubernetes ecosystem. As a high-performance time-series database, it provides integrated system and application monitoring, metrics collection, and real-time alerting capabilities.

Details

Prometheus development began at SoundCloud in 2012, inspired by Google's internal monitoring system "Borgmon." It features a time-series optimized database engine, pull-based metrics collection architecture, the powerful PromQL query language, and flexible alerting capabilities.

Prometheus 3.0, released in November 2024, represents the first major update in 7 years, featuring a completely new Web UI, full UTF-8 support, OpenTelemetry integration, Remote Write 2.0, and significant functionality enhancements.

Key features of Prometheus:

High-performance time-series database (TSDB)
Pull-based metrics collection architecture
Powerful query language "PromQL"
Multi-dimensional data model (metric names + labels)
Automatic service discovery
Alerting and notification capabilities (Alertmanager integration)
Integration with visualization tools like Grafana
Rich integrations with Kubernetes, Docker, AWS, etc.
Agent Mode (lightweight metrics collection mode)
Native Histogram support
OpenTelemetry (OTLP) integration

Pros and Cons

Pros

Cloud Native: Standard monitoring solution for Kubernetes
High Performance: Fast data ingestion and search with custom TSDB
Flexibility: Powerful and flexible querying with PromQL
Operational: Robust metrics collection with pull-based architecture
Ecosystem: Rich ecosystem with Grafana, Alertmanager, and other tools
Automation: Automatic monitoring in dynamic environments via service discovery
Lightweight: Simple deployment with single binary
Extensibility: Rich monitoring targets via exporters and client libraries

Cons

Long-term Storage: Limited long-term data storage without Remote Storage
Horizontal Scaling: Complex federation and sharding
High Availability: Complex design and operation of HA configurations
Learning Curve: High learning cost for PromQL
Data Types: Numeric data focused (unsuitable for strings/log data)
Security: Basic authentication/authorization features (external proxy required)

Key Links

Code Examples

Installation & Setup

# Docker execution (recommended)
docker run -d --name prometheus \
  -p 9090:9090 \
  -v ./prometheus.yml:/etc/prometheus/prometheus.yml \
  prom/prometheus:latest

# Binary execution
wget https://github.com/prometheus/prometheus/releases/download/v3.0.0/prometheus-3.0.0.linux-amd64.tar.gz
tar xvf prometheus-3.0.0.linux-amd64.tar.gz
cd prometheus-3.0.0.linux-amd64
./prometheus --config.file=prometheus.yml

# Kubernetes with Helm
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install prometheus prometheus-community/kube-prometheus-stack

# Agent Mode execution (lightweight)
./prometheus --agent --config.file=prometheus.yml --storage.agent.path=./agent-data

# systemd service registration
sudo useradd --no-create-home --shell /bin/false prometheus
sudo mkdir /etc/prometheus /var/lib/prometheus
sudo chown prometheus:prometheus /etc/prometheus /var/lib/prometheus
sudo cp prometheus promtool /usr/local/bin/
sudo chown prometheus:prometheus /usr/local/bin/prometheus /usr/local/bin/promtool

cat > /etc/systemd/system/prometheus.service << EOF
[Unit]
Description=Prometheus
Wants=network-online.target
After=network-online.target

[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/prometheus \
  --config.file /etc/prometheus/prometheus.yml \
  --storage.tsdb.path /var/lib/prometheus/ \
  --web.console.templates=/etc/prometheus/consoles \
  --web.console.libraries=/etc/prometheus/console_libraries \
  --web.listen-address=0.0.0.0:9090 \
  --web.enable-lifecycle

[Install]
WantedBy=multi-user.target
EOF

sudo systemctl daemon-reload
sudo systemctl enable prometheus
sudo systemctl start prometheus

Basic Configuration File

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - "rules/*.yml"

alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - alertmanager:9093

scrape_configs:
  # Prometheus self-monitoring
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  # Node Exporter (system metrics)
  - job_name: 'node'
    static_configs:
      - targets: ['localhost:9100']

  # Application monitoring
  - job_name: 'myapp'
    static_configs:
      - targets: ['localhost:8080']
    metrics_path: /metrics
    scrape_interval: 10s

  # Kubernetes service discovery
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)

  # Docker Swarm service discovery
  - job_name: 'dockerswarm'
    dockerswarm_sd_configs:
      - host: unix:///var/run/docker.sock
        role: tasks
    relabel_configs:
      - source_labels: [__meta_dockerswarm_task_desired_state]
        regex: running
        action: keep

# OpenTelemetry integration (Prometheus 3.0)
otlp:
  promote_resource_attributes:
    - service.name
    - service.instance.id
    - service.namespace
    - service.version
    - k8s.cluster.name
    - k8s.namespace.name
    - k8s.pod.name

# Storage configuration
storage:
  tsdb:
    retention.time: 15d
    retention.size: 10GB
    out_of_order_time_window: 30m  # Allow out-of-order samples

PromQL Queries

# Basic instant vector
up

# Label matching
up{job="prometheus"}

# Time range (Range Vector)
rate(http_requests_total[5m])

# Aggregation functions
sum(rate(http_requests_total[5m])) by (instance)

# CPU usage calculation
100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Memory usage
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100

# Top K queries
topk(5, rate(http_requests_total[5m]))

# Arithmetic operations
rate(http_requests_total[5m]) * 60

# Conditional expressions
up == 1

# Regular expression matching
http_requests_total{job=~"api.*"}

# Joining multiple metrics
rate(http_requests_total[5m]) / rate(http_request_duration_seconds_count[5m])

# Threshold alerting
avg_over_time(cpu_usage[5m]) > 80

# Percentile calculation
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

# Time series existence check
absent(up{job="critical-service"})

# Prediction queries
predict_linear(node_filesystem_free_bytes[1h], 4*3600) < 0

Alert Rules

# rules/alerts.yml
groups:
  - name: system
    rules:
      - alert: HighCPUUsage
        expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage detected"
          description: "CPU usage is above 80% for instance {{ $labels.instance }}"

      - alert: HighMemoryUsage
        expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 90
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "High memory usage detected"
          description: "Memory usage is above 90% for instance {{ $labels.instance }}"

      - alert: DiskSpaceLow
        expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 < 10
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Low disk space"
          description: "Disk space is below 10% for {{ $labels.mountpoint }} on {{ $labels.instance }}"

      - alert: ServiceDown
        expr: up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Service is down"
          description: "Service {{ $labels.job }} on {{ $labels.instance }} is down"

      - alert: HighErrorRate
        expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
        for: 3m
        labels:
          severity: warning
        annotations:
          summary: "High error rate"
          description: "Error rate is above 5% for {{ $labels.job }}"

  - name: application
    rules:
      - alert: SlowResponse
        expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 0.5
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Slow response time"
          description: "95th percentile response time is above 500ms"

      - alert: HighRequestRate
        expr: rate(http_requests_total[5m]) > 1000
        for: 2m
        labels:
          severity: info
        annotations:
          summary: "High request rate"
          description: "Request rate is above 1000 req/s"

      # keep_firing_for feature (Prometheus 3.0 new feature)
      - alert: TransientIssue
        expr: error_rate > 0.1
        for: 1m
        keep_firing_for: 5m  # Keep firing alert for 5 minutes after condition stops
        labels:
          severity: warning
        annotations:
          summary: "Transient issue detected"

Recording Rules

# rules/recording.yml
groups:
  - name: cpu_recording_rules
    interval: 30s
    rules:
      - record: instance:cpu_usage_rate
        expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

      - record: job:cpu_usage_rate:avg
        expr: avg by(job) (instance:cpu_usage_rate)

  - name: http_recording_rules
    interval: 15s
    rules:
      - record: job:http_requests_rate
        expr: sum by(job) (rate(http_requests_total[5m]))

      - record: job:http_error_rate
        expr: sum by(job) (rate(http_requests_total{status=~"5.."}[5m])) / sum by(job) (rate(http_requests_total[5m]))

      - record: instance:http_latency_p95
        expr: histogram_quantile(0.95, sum by(instance, le) (rate(http_request_duration_seconds_bucket[5m])))

HTTP API Usage

# List metrics
curl http://localhost:9090/api/v1/label/__name__/values

# Instant query
curl -G http://localhost:9090/api/v1/query \
  --data-urlencode 'query=up'

# Range query
curl -G http://localhost:9090/api/v1/query_range \
  --data-urlencode 'query=rate(http_requests_total[5m])' \
  --data-urlencode 'start=2024-01-01T00:00:00Z' \
  --data-urlencode 'end=2024-01-01T01:00:00Z' \
  --data-urlencode 'step=60s'

# Label values
curl http://localhost:9090/api/v1/label/job/values

# Series metadata
curl -G http://localhost:9090/api/v1/series \
  --data-urlencode 'match[]=up'

# Alerts list
curl http://localhost:9090/api/v1/alerts

# Rules list
curl http://localhost:9090/api/v1/rules

# Configuration
curl http://localhost:9090/api/v1/status/config

# Build info
curl http://localhost:9090/api/v1/status/buildinfo

# TSDB status
curl http://localhost:9090/api/v1/status/tsdb

# Configuration reload (requires --web.enable-lifecycle)
curl -X POST http://localhost:9090/-/reload

# Health check
curl http://localhost:9090/-/healthy

# Readiness check
curl http://localhost:9090/-/ready

Client Library (Go)

package main

import (
    "log"
    "net/http"
    "time"

    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promhttp"
    "github.com/prometheus/client_golang/prometheus/promauto"
)

var (
    // Counter - Cumulative values (request count, error count, etc.)
    httpRequestsTotal = promauto.NewCounterVec(
        prometheus.CounterOpts{
            Name: "http_requests_total",
            Help: "Total number of HTTP requests",
        },
        []string{"method", "endpoint", "status"},
    )

    // Gauge - Current values (CPU usage, memory usage, etc.)
    cpuUsage = promauto.NewGaugeVec(
        prometheus.GaugeOpts{
            Name: "cpu_usage_percent",
            Help: "Current CPU usage percentage",
        },
        []string{"core"},
    )

    // Histogram - Distributions (response time, request size, etc.)
    httpDuration = promauto.NewHistogramVec(
        prometheus.HistogramOpts{
            Name: "http_request_duration_seconds",
            Help: "Duration of HTTP requests",
            Buckets: prometheus.DefBuckets,
        },
        []string{"method", "endpoint"},
    )

    // Summary - Quantiles (response time quantiles, etc.)
    httpSummary = promauto.NewSummaryVec(
        prometheus.SummaryOpts{
            Name: "http_request_duration_summary",
            Help: "Summary of HTTP request durations",
            Objectives: map[float64]float64{0.5: 0.05, 0.9: 0.01, 0.99: 0.001},
        },
        []string{"method"},
    )
)

func httpHandler(w http.ResponseWriter, r *http.Request) {
    start := time.Now()
    
    // Increment counter
    httpRequestsTotal.WithLabelValues(r.Method, r.URL.Path, "200").Inc()
    
    // Record processing time
    duration := time.Since(start).Seconds()
    httpDuration.WithLabelValues(r.Method, r.URL.Path).Observe(duration)
    httpSummary.WithLabelValues(r.Method).Observe(duration)
    
    w.WriteHeader(http.StatusOK)
    w.Write([]byte("Hello World"))
}

func main() {
    // CPU monitoring sample
    go func() {
        for {
            // Actual CPU usage logic here
            cpuUsage.WithLabelValues("0").Set(45.2)
            cpuUsage.WithLabelValues("1").Set(67.8)
            time.Sleep(10 * time.Second)
        }
    }()

    http.HandleFunc("/", httpHandler)
    http.Handle("/metrics", promhttp.Handler())
    
    log.Println("Server starting on :8080")
    log.Fatal(http.ListenAndServe(":8080", nil))
}

Python Client

from prometheus_client import Counter, Gauge, Histogram, Summary, start_http_server
import time
import random

# Metrics definition
REQUEST_COUNT = Counter('http_requests_total', 'Total HTTP requests', ['method', 'endpoint'])
CPU_USAGE = Gauge('cpu_usage_percent', 'CPU usage percentage', ['core'])
REQUEST_DURATION = Histogram('http_request_duration_seconds', 'HTTP request duration')
RESPONSE_SIZE = Summary('http_response_size_bytes', 'HTTP response size')

def process_request():
    """Request processing simulation"""
    start_time = time.time()
    
    # Random processing time
    time.sleep(random.uniform(0.1, 0.5))
    
    # Update metrics
    REQUEST_COUNT.labels(method='GET', endpoint='/api').inc()
    REQUEST_DURATION.observe(time.time() - start_time)
    RESPONSE_SIZE.observe(random.randint(100, 1000))

def update_system_metrics():
    """System metrics update"""
    while True:
        # CPU usage simulation
        CPU_USAGE.labels(core='0').set(random.uniform(20, 80))
        CPU_USAGE.labels(core='1').set(random.uniform(20, 80))
        time.sleep(10)

if __name__ == '__main__':
    # Start metrics server
    start_http_server(8000)
    
    # Start system metrics update thread
    import threading
    threading.Thread(target=update_system_metrics, daemon=True).start()
    
    # Request processing simulation
    while True:
        process_request()
        time.sleep(1)

Node Exporter Configuration

# Node Exporter installation
wget https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz
tar xvf node_exporter-1.7.0.linux-amd64.tar.gz
cd node_exporter-1.7.0.linux-amd64

# Run
./node_exporter

# Service registration
sudo useradd --no-create-home --shell /bin/false node_exporter
sudo cp node_exporter /usr/local/bin/
sudo chown node_exporter:node_exporter /usr/local/bin/node_exporter

cat > /etc/systemd/system/node_exporter.service << EOF
[Unit]
Description=Node Exporter
Wants=network-online.target
After=network-online.target

[Service]
User=node_exporter
Group=node_exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter \
  --collector.systemd \
  --collector.processes \
  --web.listen-address=:9100

[Install]
WantedBy=multi-user.target
EOF

sudo systemctl daemon-reload
sudo systemctl enable node_exporter
sudo systemctl start node_exporter

# Custom metrics (textfile collector)
mkdir -p /var/lib/node_exporter/textfile_collector

# Custom metrics creation script
cat > /usr/local/bin/custom_metrics.sh << 'EOF'
#!/bin/bash
echo "# HELP custom_disk_usage Custom disk usage metric
# TYPE custom_disk_usage gauge" > /var/lib/node_exporter/textfile_collector/custom.prom
df -h | grep '^/dev/' | awk '{print "custom_disk_usage{device=\""$1"\",mountpoint=\""$6"\"} " ($5+0)/100}' >> /var/lib/node_exporter/textfile_collector/custom.prom
EOF

chmod +x /usr/local/bin/custom_metrics.sh

# Cron configuration
echo "*/5 * * * * /usr/local/bin/custom_metrics.sh" | crontab -

Remote Storage Configuration

# Remote Write configuration (for long-term storage)
remote_write:
  - url: "https://cortex.example.com/api/v1/push"
    basic_auth:
      username: user
      password: pass
    queue_config:
      max_samples_per_send: 2000
      batch_send_deadline: 5s
      max_retries: 3
    metadata_config:
      send: true

  # Grafana Cloud
  - url: "https://prometheus-prod-01-eu-west-0.grafana.net/api/prom/push"
    basic_auth:
      username: "123456"
      password: "YOUR_API_KEY"

# Remote Read configuration
remote_read:
  - url: "https://cortex.example.com/api/v1/read"
    basic_auth:
      username: user
      password: pass

Alertmanager Configuration

# alertmanager.yml
global:
  smtp_smarthost: 'smtp.gmail.com:587'
  smtp_from: '[email protected]'
  smtp_auth_username: '[email protected]'
  smtp_auth_password: 'password'

route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'web.hook'
  routes:
    - match:
        severity: critical
      receiver: 'critical-alerts'
    - match:
        severity: warning
      receiver: 'warning-alerts'

receivers:
  - name: 'web.hook'
    webhook_configs:
      - url: 'http://localhost:5001/webhook'

  - name: 'critical-alerts'
    email_configs:
      - to: '[email protected]'
        subject: 'Critical Alert: {{ .GroupLabels.alertname }}'
        body: |
          {{ range .Alerts }}
          Alert: {{ .Annotations.summary }}
          Description: {{ .Annotations.description }}
          {{ end }}
    slack_configs:
      - api_url: 'YOUR_SLACK_WEBHOOK_URL'
        channel: '#critical-alerts'
        title: 'Critical Alert'
        text: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'

  - name: 'warning-alerts'
    slack_configs:
      - api_url: 'YOUR_SLACK_WEBHOOK_URL'
        channel: '#alerts'
        title: 'Warning Alert'

Operations & Optimization

# Check database size
du -sh /var/lib/prometheus/

# TSDB block information
promtool tsdb list /var/lib/prometheus/

# Validate configuration file
promtool check config prometheus.yml

# Validate rules file
promtool check rules rules/*.yml

# Delete metrics
curl -X POST http://localhost:9090/api/v1/admin/tsdb/delete_series \
  --data-urlencode 'match[]=up{job="old-job"}'

# Create snapshot
curl -X POST http://localhost:9090/api/v1/admin/tsdb/snapshot

# Backup
tar czf prometheus-backup-$(date +%Y%m%d).tar.gz /var/lib/prometheus/

# Performance monitoring queries
# Memory usage
process_resident_memory_bytes{job="prometheus"}

# Ingestion rate
rate(prometheus_tsdb_symbol_table_size_bytes[5m])

# Chunk count
prometheus_tsdb_head_chunks

# WAL size
prometheus_tsdb_wal_storage_size_bytes