Prometheus
The de facto monitoring system for Kubernetes and cloud native environments. Provides multi-dimensional data model with PromQL query language and HTTP pull model for metrics collection.
Monitoring Server
Prometheus
Overview
Prometheus is the de facto standard monitoring system for Kubernetes and cloud-native environments. It provides a multi-dimensional data model and PromQL query language, collecting metrics via HTTP pull model as a time series database. Developed at SoundCloud and now a Cloud Native Computing Foundation (CNCF) graduated project, it is optimized for application monitoring in containerized environments.
Details
Prometheus development began in 2012 at SoundCloud and became CNCF's second graduated project in 2016. Currently, it has become the de facto standard for Kubernetes monitoring with a 79% adoption rate. In 2024, Prometheus 3.0 was released as the first major release in 7 years, strengthening OpenTelemetry integration.
Key Technical Features
- Multi-dimensional Data Model: Flexible metric classification through labels
- PromQL Query Language: Powerful time series data query capabilities
- HTTP Pull Model: Scalable metric collection method
- Service Discovery: Automatic discovery of dynamic monitoring targets
- Alertmanager Integration: Flexible alert management functionality
Use Cases
- Kubernetes cluster monitoring
- Microservices monitoring
- Application Performance Monitoring (APM)
- Infrastructure monitoring
- SLI/SLO management
Advantages and Disadvantages
Advantages
- Kubernetes Standard: De facto standard monitoring system for K8s environments
- Flexible Queries: Powerful analysis capabilities through PromQL
- Scalability: High performance in large-scale environments
- Rich Ecosystem: Numerous Exporters and third-party tools
- Open Source: Free high-functionality monitoring system
- Cloud Native Ready: Design optimized for container environments
Disadvantages
- Learning Cost: PromQL mastery required
- Long-term Storage Limitations: Default limitations for long-term data storage
- HA Configuration Complexity: Complex setup for high availability
- Resource Consumption: High resource usage when collecting large volumes of metrics
- Dashboard Functionality: Separate tools like Grafana needed for visualization
Reference Pages
Configuration Examples
Basic Prometheus Configuration
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
monitor: 'prometheus-monitor'
rule_files:
- "alert_rules.yml"
- "recording_rules.yml"
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
scrape_configs:
# Prometheus self-monitoring
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
# Node Exporter
- job_name: 'node'
static_configs:
- targets:
- 'node1:9100'
- 'node2:9100'
- 'node3:9100'
# Kubernetes API Server
- job_name: 'kubernetes-apiservers'
kubernetes_sd_configs:
- role: endpoints
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
action: keep
regex: default;kubernetes;https
# Application monitoring
- job_name: 'spring-boot-app'
metrics_path: '/actuator/prometheus'
static_configs:
- targets:
- 'app1:8080'
- 'app2:8080'
PromQL Query Examples
# Basic metric retrieval
up
# Specific job status
up{job="node"}
# Rate calculation (5-minute average rate)
rate(http_requests_total[5m])
# Aggregation functions usage
sum(rate(http_requests_total[5m])) by (job)
# Percentile calculation
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
# CPU usage calculation
100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Memory usage
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
# Disk usage
100 - ((node_filesystem_avail_bytes * 100) / node_filesystem_size_bytes)
# Response time aggregation
avg(rate(http_request_duration_seconds_sum[5m]) / rate(http_request_duration_seconds_count[5m]))
# Error rate
rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) * 100
Alert Rules Configuration
# alert_rules.yml
groups:
- name: basic.rules
rules:
# Instance down
- alert: InstanceDown
expr: up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Instance {{ $labels.instance }} down"
description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 1 minute."
# High CPU usage
- alert: HighCPUUsage
expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: "CPU usage is above 80% for more than 5 minutes on {{ $labels.instance }}"
# High memory usage
- alert: HighMemoryUsage
expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 90
for: 5m
labels:
severity: critical
annotations:
summary: "High memory usage on {{ $labels.instance }}"
description: "Memory usage is above 90% on {{ $labels.instance }}"
# Low disk space
- alert: DiskSpaceLow
expr: 100 - ((node_filesystem_avail_bytes * 100) / node_filesystem_size_bytes) > 85
for: 5m
labels:
severity: warning
annotations:
summary: "Disk space low on {{ $labels.instance }}"
description: "Disk usage is above 85% on {{ $labels.instance }}"
# HTTP error rate
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) * 100 > 5
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate on {{ $labels.instance }}"
description: "Error rate is above 5% for more than 5 minutes"
Recording Rules Configuration
# recording_rules.yml
groups:
- name: cpu_memory_rules
interval: 30s
rules:
# CPU usage recording rule
- record: instance:cpu_usage:rate5m
expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Memory usage recording rule
- record: instance:memory_usage:ratio
expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
- name: http_rules
interval: 30s
rules:
# HTTP request rate recording rule
- record: job:http_requests:rate5m
expr: sum(rate(http_requests_total[5m])) by (job)
# HTTP error rate recording rule
- record: job:http_errors:rate5m
expr: sum(rate(http_requests_total{status=~"5.."}[5m])) by (job) / sum(rate(http_requests_total[5m])) by (job)
Kubernetes Deployment
# prometheus-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: prometheus
namespace: monitoring
spec:
replicas: 1
selector:
matchLabels:
app: prometheus
template:
metadata:
labels:
app: prometheus
spec:
serviceAccountName: prometheus
containers:
- name: prometheus
image: prom/prometheus:v2.45.0
args:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus/'
- '--web.console.libraries=/etc/prometheus/console_libraries'
- '--web.console.templates=/etc/prometheus/consoles'
- '--storage.tsdb.retention.time=30d'
- '--web.enable-lifecycle'
- '--web.enable-admin-api'
ports:
- containerPort: 9090
resources:
limits:
memory: "2Gi"
cpu: "1000m"
requests:
memory: "1Gi"
cpu: "500m"
volumeMounts:
- name: prometheus-config
mountPath: /etc/prometheus/
- name: prometheus-storage
mountPath: /prometheus/
volumes:
- name: prometheus-config
configMap:
name: prometheus-config
- name: prometheus-storage
persistentVolumeClaim:
claimName: prometheus-pvc
---
apiVersion: v1
kind: Service
metadata:
name: prometheus
namespace: monitoring
spec:
selector:
app: prometheus
ports:
- port: 9090
targetPort: 9090
type: ClusterIP
Custom Metrics (Go Example)
package main
import (
"net/http"
"time"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promhttp"
"github.com/prometheus/client_golang/prometheus/promauto"
)
var (
// Counter metrics
httpRequestsTotal = promauto.NewCounterVec(
prometheus.CounterOpts{
Name: "http_requests_total",
Help: "Total number of HTTP requests",
},
[]string{"method", "endpoint", "status"},
)
// Histogram metrics
httpDuration = promauto.NewHistogramVec(
prometheus.HistogramOpts{
Name: "http_request_duration_seconds",
Help: "Duration of HTTP requests",
Buckets: prometheus.DefBuckets,
},
[]string{"method", "endpoint"},
)
// Gauge metrics
activeConnections = promauto.NewGauge(
prometheus.GaugeOpts{
Name: "active_connections",
Help: "Number of active connections",
},
)
)
func recordMetrics() {
go func() {
for {
activeConnections.Set(float64(time.Now().Unix() % 100))
time.Sleep(2 * time.Second)
}
}()
}
func main() {
recordMetrics()
http.Handle("/metrics", promhttp.Handler())
http.HandleFunc("/api/users", func(w http.ResponseWriter, r *http.Request) {
timer := prometheus.NewTimer(httpDuration.WithLabelValues(r.Method, "/api/users"))
defer timer.ObserveDuration()
// Business logic processing
time.Sleep(100 * time.Millisecond)
httpRequestsTotal.WithLabelValues(r.Method, "/api/users", "200").Inc()
w.WriteHeader(http.StatusOK)
w.Write([]byte("OK"))
})
http.ListenAndServe(":8080", nil)
}