Database
Prometheus
Overview
Prometheus is an open-source monitoring and alerting system specialized for time-series data collection, storage, and querying. As a graduated project of the Cloud Native Computing Foundation (CNCF), it has established itself as the standard monitoring solution in the Kubernetes ecosystem. As a high-performance time-series database, it provides integrated system and application monitoring, metrics collection, and real-time alerting capabilities.
Details
Prometheus development began at SoundCloud in 2012, inspired by Google's internal monitoring system "Borgmon." It features a time-series optimized database engine, pull-based metrics collection architecture, the powerful PromQL query language, and flexible alerting capabilities.
Prometheus 3.0, released in November 2024, represents the first major update in 7 years, featuring a completely new Web UI, full UTF-8 support, OpenTelemetry integration, Remote Write 2.0, and significant functionality enhancements.
Key features of Prometheus:
- High-performance time-series database (TSDB)
- Pull-based metrics collection architecture
- Powerful query language "PromQL"
- Multi-dimensional data model (metric names + labels)
- Automatic service discovery
- Alerting and notification capabilities (Alertmanager integration)
- Integration with visualization tools like Grafana
- Rich integrations with Kubernetes, Docker, AWS, etc.
- Agent Mode (lightweight metrics collection mode)
- Native Histogram support
- OpenTelemetry (OTLP) integration
Pros and Cons
Pros
- Cloud Native: Standard monitoring solution for Kubernetes
- High Performance: Fast data ingestion and search with custom TSDB
- Flexibility: Powerful and flexible querying with PromQL
- Operational: Robust metrics collection with pull-based architecture
- Ecosystem: Rich ecosystem with Grafana, Alertmanager, and other tools
- Automation: Automatic monitoring in dynamic environments via service discovery
- Lightweight: Simple deployment with single binary
- Extensibility: Rich monitoring targets via exporters and client libraries
Cons
- Long-term Storage: Limited long-term data storage without Remote Storage
- Horizontal Scaling: Complex federation and sharding
- High Availability: Complex design and operation of HA configurations
- Learning Curve: High learning cost for PromQL
- Data Types: Numeric data focused (unsuitable for strings/log data)
- Security: Basic authentication/authorization features (external proxy required)
Key Links
Code Examples
Installation & Setup
# Docker execution (recommended)
docker run -d --name prometheus \
-p 9090:9090 \
-v ./prometheus.yml:/etc/prometheus/prometheus.yml \
prom/prometheus:latest
# Binary execution
wget https://github.com/prometheus/prometheus/releases/download/v3.0.0/prometheus-3.0.0.linux-amd64.tar.gz
tar xvf prometheus-3.0.0.linux-amd64.tar.gz
cd prometheus-3.0.0.linux-amd64
./prometheus --config.file=prometheus.yml
# Kubernetes with Helm
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install prometheus prometheus-community/kube-prometheus-stack
# Agent Mode execution (lightweight)
./prometheus --agent --config.file=prometheus.yml --storage.agent.path=./agent-data
# systemd service registration
sudo useradd --no-create-home --shell /bin/false prometheus
sudo mkdir /etc/prometheus /var/lib/prometheus
sudo chown prometheus:prometheus /etc/prometheus /var/lib/prometheus
sudo cp prometheus promtool /usr/local/bin/
sudo chown prometheus:prometheus /usr/local/bin/prometheus /usr/local/bin/promtool
cat > /etc/systemd/system/prometheus.service << EOF
[Unit]
Description=Prometheus
Wants=network-online.target
After=network-online.target
[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/prometheus \
--config.file /etc/prometheus/prometheus.yml \
--storage.tsdb.path /var/lib/prometheus/ \
--web.console.templates=/etc/prometheus/consoles \
--web.console.libraries=/etc/prometheus/console_libraries \
--web.listen-address=0.0.0.0:9090 \
--web.enable-lifecycle
[Install]
WantedBy=multi-user.target
EOF
sudo systemctl daemon-reload
sudo systemctl enable prometheus
sudo systemctl start prometheus
Basic Configuration File
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
- "rules/*.yml"
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
scrape_configs:
# Prometheus self-monitoring
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
# Node Exporter (system metrics)
- job_name: 'node'
static_configs:
- targets: ['localhost:9100']
# Application monitoring
- job_name: 'myapp'
static_configs:
- targets: ['localhost:8080']
metrics_path: /metrics
scrape_interval: 10s
# Kubernetes service discovery
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
# Docker Swarm service discovery
- job_name: 'dockerswarm'
dockerswarm_sd_configs:
- host: unix:///var/run/docker.sock
role: tasks
relabel_configs:
- source_labels: [__meta_dockerswarm_task_desired_state]
regex: running
action: keep
# OpenTelemetry integration (Prometheus 3.0)
otlp:
promote_resource_attributes:
- service.name
- service.instance.id
- service.namespace
- service.version
- k8s.cluster.name
- k8s.namespace.name
- k8s.pod.name
# Storage configuration
storage:
tsdb:
retention.time: 15d
retention.size: 10GB
out_of_order_time_window: 30m # Allow out-of-order samples
PromQL Queries
# Basic instant vector
up
# Label matching
up{job="prometheus"}
# Time range (Range Vector)
rate(http_requests_total[5m])
# Aggregation functions
sum(rate(http_requests_total[5m])) by (instance)
# CPU usage calculation
100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Memory usage
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
# Top K queries
topk(5, rate(http_requests_total[5m]))
# Arithmetic operations
rate(http_requests_total[5m]) * 60
# Conditional expressions
up == 1
# Regular expression matching
http_requests_total{job=~"api.*"}
# Joining multiple metrics
rate(http_requests_total[5m]) / rate(http_request_duration_seconds_count[5m])
# Threshold alerting
avg_over_time(cpu_usage[5m]) > 80
# Percentile calculation
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
# Time series existence check
absent(up{job="critical-service"})
# Prediction queries
predict_linear(node_filesystem_free_bytes[1h], 4*3600) < 0
Alert Rules
# rules/alerts.yml
groups:
- name: system
rules:
- alert: HighCPUUsage
expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage detected"
description: "CPU usage is above 80% for instance {{ $labels.instance }}"
- alert: HighMemoryUsage
expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 90
for: 2m
labels:
severity: critical
annotations:
summary: "High memory usage detected"
description: "Memory usage is above 90% for instance {{ $labels.instance }}"
- alert: DiskSpaceLow
expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 < 10
for: 1m
labels:
severity: critical
annotations:
summary: "Low disk space"
description: "Disk space is below 10% for {{ $labels.mountpoint }} on {{ $labels.instance }}"
- alert: ServiceDown
expr: up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Service is down"
description: "Service {{ $labels.job }} on {{ $labels.instance }} is down"
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
for: 3m
labels:
severity: warning
annotations:
summary: "High error rate"
description: "Error rate is above 5% for {{ $labels.job }}"
- name: application
rules:
- alert: SlowResponse
expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 0.5
for: 5m
labels:
severity: warning
annotations:
summary: "Slow response time"
description: "95th percentile response time is above 500ms"
- alert: HighRequestRate
expr: rate(http_requests_total[5m]) > 1000
for: 2m
labels:
severity: info
annotations:
summary: "High request rate"
description: "Request rate is above 1000 req/s"
# keep_firing_for feature (Prometheus 3.0 new feature)
- alert: TransientIssue
expr: error_rate > 0.1
for: 1m
keep_firing_for: 5m # Keep firing alert for 5 minutes after condition stops
labels:
severity: warning
annotations:
summary: "Transient issue detected"
Recording Rules
# rules/recording.yml
groups:
- name: cpu_recording_rules
interval: 30s
rules:
- record: instance:cpu_usage_rate
expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
- record: job:cpu_usage_rate:avg
expr: avg by(job) (instance:cpu_usage_rate)
- name: http_recording_rules
interval: 15s
rules:
- record: job:http_requests_rate
expr: sum by(job) (rate(http_requests_total[5m]))
- record: job:http_error_rate
expr: sum by(job) (rate(http_requests_total{status=~"5.."}[5m])) / sum by(job) (rate(http_requests_total[5m]))
- record: instance:http_latency_p95
expr: histogram_quantile(0.95, sum by(instance, le) (rate(http_request_duration_seconds_bucket[5m])))
HTTP API Usage
# List metrics
curl http://localhost:9090/api/v1/label/__name__/values
# Instant query
curl -G http://localhost:9090/api/v1/query \
--data-urlencode 'query=up'
# Range query
curl -G http://localhost:9090/api/v1/query_range \
--data-urlencode 'query=rate(http_requests_total[5m])' \
--data-urlencode 'start=2024-01-01T00:00:00Z' \
--data-urlencode 'end=2024-01-01T01:00:00Z' \
--data-urlencode 'step=60s'
# Label values
curl http://localhost:9090/api/v1/label/job/values
# Series metadata
curl -G http://localhost:9090/api/v1/series \
--data-urlencode 'match[]=up'
# Alerts list
curl http://localhost:9090/api/v1/alerts
# Rules list
curl http://localhost:9090/api/v1/rules
# Configuration
curl http://localhost:9090/api/v1/status/config
# Build info
curl http://localhost:9090/api/v1/status/buildinfo
# TSDB status
curl http://localhost:9090/api/v1/status/tsdb
# Configuration reload (requires --web.enable-lifecycle)
curl -X POST http://localhost:9090/-/reload
# Health check
curl http://localhost:9090/-/healthy
# Readiness check
curl http://localhost:9090/-/ready
Client Library (Go)
package main
import (
"log"
"net/http"
"time"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promhttp"
"github.com/prometheus/client_golang/prometheus/promauto"
)
var (
// Counter - Cumulative values (request count, error count, etc.)
httpRequestsTotal = promauto.NewCounterVec(
prometheus.CounterOpts{
Name: "http_requests_total",
Help: "Total number of HTTP requests",
},
[]string{"method", "endpoint", "status"},
)
// Gauge - Current values (CPU usage, memory usage, etc.)
cpuUsage = promauto.NewGaugeVec(
prometheus.GaugeOpts{
Name: "cpu_usage_percent",
Help: "Current CPU usage percentage",
},
[]string{"core"},
)
// Histogram - Distributions (response time, request size, etc.)
httpDuration = promauto.NewHistogramVec(
prometheus.HistogramOpts{
Name: "http_request_duration_seconds",
Help: "Duration of HTTP requests",
Buckets: prometheus.DefBuckets,
},
[]string{"method", "endpoint"},
)
// Summary - Quantiles (response time quantiles, etc.)
httpSummary = promauto.NewSummaryVec(
prometheus.SummaryOpts{
Name: "http_request_duration_summary",
Help: "Summary of HTTP request durations",
Objectives: map[float64]float64{0.5: 0.05, 0.9: 0.01, 0.99: 0.001},
},
[]string{"method"},
)
)
func httpHandler(w http.ResponseWriter, r *http.Request) {
start := time.Now()
// Increment counter
httpRequestsTotal.WithLabelValues(r.Method, r.URL.Path, "200").Inc()
// Record processing time
duration := time.Since(start).Seconds()
httpDuration.WithLabelValues(r.Method, r.URL.Path).Observe(duration)
httpSummary.WithLabelValues(r.Method).Observe(duration)
w.WriteHeader(http.StatusOK)
w.Write([]byte("Hello World"))
}
func main() {
// CPU monitoring sample
go func() {
for {
// Actual CPU usage logic here
cpuUsage.WithLabelValues("0").Set(45.2)
cpuUsage.WithLabelValues("1").Set(67.8)
time.Sleep(10 * time.Second)
}
}()
http.HandleFunc("/", httpHandler)
http.Handle("/metrics", promhttp.Handler())
log.Println("Server starting on :8080")
log.Fatal(http.ListenAndServe(":8080", nil))
}
Python Client
from prometheus_client import Counter, Gauge, Histogram, Summary, start_http_server
import time
import random
# Metrics definition
REQUEST_COUNT = Counter('http_requests_total', 'Total HTTP requests', ['method', 'endpoint'])
CPU_USAGE = Gauge('cpu_usage_percent', 'CPU usage percentage', ['core'])
REQUEST_DURATION = Histogram('http_request_duration_seconds', 'HTTP request duration')
RESPONSE_SIZE = Summary('http_response_size_bytes', 'HTTP response size')
def process_request():
"""Request processing simulation"""
start_time = time.time()
# Random processing time
time.sleep(random.uniform(0.1, 0.5))
# Update metrics
REQUEST_COUNT.labels(method='GET', endpoint='/api').inc()
REQUEST_DURATION.observe(time.time() - start_time)
RESPONSE_SIZE.observe(random.randint(100, 1000))
def update_system_metrics():
"""System metrics update"""
while True:
# CPU usage simulation
CPU_USAGE.labels(core='0').set(random.uniform(20, 80))
CPU_USAGE.labels(core='1').set(random.uniform(20, 80))
time.sleep(10)
if __name__ == '__main__':
# Start metrics server
start_http_server(8000)
# Start system metrics update thread
import threading
threading.Thread(target=update_system_metrics, daemon=True).start()
# Request processing simulation
while True:
process_request()
time.sleep(1)
Node Exporter Configuration
# Node Exporter installation
wget https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz
tar xvf node_exporter-1.7.0.linux-amd64.tar.gz
cd node_exporter-1.7.0.linux-amd64
# Run
./node_exporter
# Service registration
sudo useradd --no-create-home --shell /bin/false node_exporter
sudo cp node_exporter /usr/local/bin/
sudo chown node_exporter:node_exporter /usr/local/bin/node_exporter
cat > /etc/systemd/system/node_exporter.service << EOF
[Unit]
Description=Node Exporter
Wants=network-online.target
After=network-online.target
[Service]
User=node_exporter
Group=node_exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter \
--collector.systemd \
--collector.processes \
--web.listen-address=:9100
[Install]
WantedBy=multi-user.target
EOF
sudo systemctl daemon-reload
sudo systemctl enable node_exporter
sudo systemctl start node_exporter
# Custom metrics (textfile collector)
mkdir -p /var/lib/node_exporter/textfile_collector
# Custom metrics creation script
cat > /usr/local/bin/custom_metrics.sh << 'EOF'
#!/bin/bash
echo "# HELP custom_disk_usage Custom disk usage metric
# TYPE custom_disk_usage gauge" > /var/lib/node_exporter/textfile_collector/custom.prom
df -h | grep '^/dev/' | awk '{print "custom_disk_usage{device=\""$1"\",mountpoint=\""$6"\"} " ($5+0)/100}' >> /var/lib/node_exporter/textfile_collector/custom.prom
EOF
chmod +x /usr/local/bin/custom_metrics.sh
# Cron configuration
echo "*/5 * * * * /usr/local/bin/custom_metrics.sh" | crontab -
Remote Storage Configuration
# Remote Write configuration (for long-term storage)
remote_write:
- url: "https://cortex.example.com/api/v1/push"
basic_auth:
username: user
password: pass
queue_config:
max_samples_per_send: 2000
batch_send_deadline: 5s
max_retries: 3
metadata_config:
send: true
# Grafana Cloud
- url: "https://prometheus-prod-01-eu-west-0.grafana.net/api/prom/push"
basic_auth:
username: "123456"
password: "YOUR_API_KEY"
# Remote Read configuration
remote_read:
- url: "https://cortex.example.com/api/v1/read"
basic_auth:
username: user
password: pass
Alertmanager Configuration
# alertmanager.yml
global:
smtp_smarthost: 'smtp.gmail.com:587'
smtp_from: '[email protected]'
smtp_auth_username: '[email protected]'
smtp_auth_password: 'password'
route:
group_by: ['alertname']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'web.hook'
routes:
- match:
severity: critical
receiver: 'critical-alerts'
- match:
severity: warning
receiver: 'warning-alerts'
receivers:
- name: 'web.hook'
webhook_configs:
- url: 'http://localhost:5001/webhook'
- name: 'critical-alerts'
email_configs:
- to: '[email protected]'
subject: 'Critical Alert: {{ .GroupLabels.alertname }}'
body: |
{{ range .Alerts }}
Alert: {{ .Annotations.summary }}
Description: {{ .Annotations.description }}
{{ end }}
slack_configs:
- api_url: 'YOUR_SLACK_WEBHOOK_URL'
channel: '#critical-alerts'
title: 'Critical Alert'
text: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'
- name: 'warning-alerts'
slack_configs:
- api_url: 'YOUR_SLACK_WEBHOOK_URL'
channel: '#alerts'
title: 'Warning Alert'
Operations & Optimization
# Check database size
du -sh /var/lib/prometheus/
# TSDB block information
promtool tsdb list /var/lib/prometheus/
# Validate configuration file
promtool check config prometheus.yml
# Validate rules file
promtool check rules rules/*.yml
# Delete metrics
curl -X POST http://localhost:9090/api/v1/admin/tsdb/delete_series \
--data-urlencode 'match[]=up{job="old-job"}'
# Create snapshot
curl -X POST http://localhost:9090/api/v1/admin/tsdb/snapshot
# Backup
tar czf prometheus-backup-$(date +%Y%m%d).tar.gz /var/lib/prometheus/
# Performance monitoring queries
# Memory usage
process_resident_memory_bytes{job="prometheus"}
# Ingestion rate
rate(prometheus_tsdb_symbol_table_size_bytes[5m])
# Chunk count
prometheus_tsdb_head_chunks
# WAL size
prometheus_tsdb_wal_storage_size_bytes