Datadog
Unified monitoring and visualization platform for cloud-native environments. Centralizes APM, infrastructure monitoring, log management, and security monitoring. Machine learning-powered anomaly detection.
Server
Datadog
Overview
Datadog is a unified monitoring and visualization platform for cloud-native environments. It centralizes APM, infrastructure monitoring, log management, and security monitoring, providing machine learning-powered anomaly detection. As a leader in the SaaS monitoring market with over $2B annual revenue and adoption by many Fortune 500 companies, it continues evolving into a next-generation monitoring platform with AI and ML integration.
Details
Datadog was founded in 2010 by Olivier Pomel and others, and now maintains a leader position in the SaaS monitoring market. With over $2B annual revenue and adoption by many Fortune 500 companies, it continues to evolve into a next-generation monitoring platform with AI and ML integration. Through its cloud-first strategy, it achieves continuous growth and has established itself as the industry standard for unified monitoring solutions.
Key Technical Features
- Unified Monitoring Platform: Centralized management of infrastructure, APM, logs, and security
- Machine Learning Integration: AI-assisted anomaly detection and predictive analysis
- Cloud-Native Support: Deep integration with major clouds including AWS, Azure, and GCP
- Real-Time Monitoring: High-frequency metrics collection and instant alerting
- Rich Integrations: 800+ service and tool integrations
Use Cases
- Cloud infrastructure monitoring
- Application performance monitoring (APM)
- Log management and analysis
- Security monitoring and SIEM
- Business metrics visualization
Pros and Cons
Pros
- Unified Platform: Comprehensive solution covering all monitoring domains
- AI/ML Capabilities: Advanced anomaly detection through machine learning
- Rich Integrations: 800+ service integrations
- Scalability: High reliability at enterprise scale
- Usability: Intuitive UI/UX with excellent operability
- 24/7 Support: Comprehensive support infrastructure
Cons
- High Cost: Expensive pricing structure for enterprise use
- Vendor Lock-in: Dependency on SaaS platform
- Learning Curve: Complexity due to extensive features
- Data Sovereignty: Constraints on data management due to cloud-based nature
- Customization Limits: Limitations in customization due to SaaS nature
Reference Pages
Code Examples
Datadog Agent Configuration
# datadog.yaml
api_key: YOUR_API_KEY_HERE
site: datadoghq.com
# Hostname configuration
hostname: web-server-01
hostname_fqdn: web-server-01.example.com
# Tags configuration
tags:
- env:production
- service:web
- team:backend
- region:us-east-1
# Log collection configuration
logs_enabled: true
logs_config:
container_collect_all: true
use_http: true
compression_level: 6
# APM configuration
apm_config:
enabled: true
receiver_port: 8126
max_traces_per_second: 10
# Process monitoring
process_config:
enabled: true
scrub_args: true
# Network Performance Monitoring
network_config:
enabled: true
# System probe
system_probe_config:
enabled: true
# Collection interval
check_runners: 4
collection_timeout: 30
# Proxy configuration
proxy:
http: http://proxy.example.com:8080
https: https://proxy.example.com:8080
no_proxy:
- localhost
- 127.0.0.1
# JMX configuration
jmx_use_cgroup_memory_limit: true
# Security Agent
security_agent:
enabled: true
# Remote Configuration
remote_configuration:
enabled: true
Custom Metrics Submission
# custom_metrics.py
from datadog import initialize, api, statsd
import time
import psutil
import requests
# Datadog initialization
options = {
'api_key': 'YOUR_API_KEY',
'app_key': 'YOUR_APP_KEY'
}
initialize(**options)
class CustomMetricsCollector:
def __init__(self):
self.statsd = statsd
def collect_system_metrics(self):
"""System metrics collection"""
# CPU usage
cpu_percent = psutil.cpu_percent(interval=1)
self.statsd.gauge('custom.system.cpu_percent', cpu_percent,
tags=['host:web-01', 'env:prod'])
# Memory usage
memory = psutil.virtual_memory()
self.statsd.gauge('custom.system.memory_percent', memory.percent,
tags=['host:web-01', 'env:prod'])
# Disk usage
disk = psutil.disk_usage('/')
disk_percent = (disk.used / disk.total) * 100
self.statsd.gauge('custom.system.disk_percent', disk_percent,
tags=['host:web-01', 'env:prod'])
def collect_application_metrics(self):
"""Application metrics collection"""
# API response time measurement
start_time = time.time()
try:
response = requests.get('http://localhost:8080/health', timeout=5)
response_time = (time.time() - start_time) * 1000
self.statsd.histogram('custom.api.response_time', response_time,
tags=['endpoint:health', 'env:prod'])
# Status code
self.statsd.increment('custom.api.requests',
tags=[f'status_code:{response.status_code}', 'env:prod'])
except requests.RequestException as e:
self.statsd.increment('custom.api.errors',
tags=['error_type:timeout', 'env:prod'])
def send_custom_event(self, title, text, alert_type='info'):
"""Custom event submission"""
api.Event.create(
title=title,
text=text,
alert_type=alert_type,
tags=['custom:event', 'env:prod']
)
def run_continuous_monitoring(self):
"""Run continuous monitoring"""
while True:
try:
self.collect_system_metrics()
self.collect_application_metrics()
time.sleep(60) # 1-minute interval
except Exception as e:
print(f"Monitoring error: {e}")
time.sleep(60)
if __name__ == "__main__":
collector = CustomMetricsCollector()
collector.run_continuous_monitoring()
Docker Integration Setup
# docker-compose.yml
version: '3.8'
services:
datadog-agent:
image: gcr.io/datadoghq/agent:7
container_name: datadog-agent
restart: unless-stopped
environment:
- DD_API_KEY=${DD_API_KEY}
- DD_SITE=datadoghq.com
- DD_LOGS_ENABLED=true
- DD_LOGS_CONFIG_CONTAINER_COLLECT_ALL=true
- DD_APM_ENABLED=true
- DD_APM_NON_LOCAL_TRAFFIC=true
- DD_PROCESS_AGENT_ENABLED=true
- DD_SYSTEM_PROBE_ENABLED=true
- DD_DOGSTATSD_NON_LOCAL_TRAFFIC=true
- DD_AC_EXCLUDE=image:gcr.io/datadoghq/agent
volumes:
- /var/run/docker.sock:/var/run/docker.sock:ro
- /proc/:/host/proc/:ro
- /opt/datadog-agent/run:/opt/datadog-agent/run:rw
- /sys/fs/cgroup/:/host/sys/fs/cgroup:ro
- /var/lib/docker/containers:/var/lib/docker/containers:ro
- /etc/passwd:/etc/passwd:ro
- ./datadog.yaml:/etc/datadog-agent/datadog.yaml:ro
- ./conf.d:/etc/datadog-agent/conf.d:ro
ports:
- "8125:8125/udp" # DogStatsD
- "8126:8126" # APM
cap_add:
- SYS_ADMIN
- SYS_RESOURCE
- SYS_PTRACE
- NET_ADMIN
- NET_BROADCAST
- NET_RAW
- IPC_LOCK
security_opt:
- apparmor:unconfined
networks:
- monitoring
webapp:
image: nginx:alpine
container_name: webapp
labels:
- "com.datadoghq.ad.check_names=[\"nginx\"]"
- "com.datadoghq.ad.init_configs=[{}]"
- "com.datadoghq.ad.instances=[{\"nginx_status_url\":\"http://%%host%%:81/nginx_status\"}]"
- "com.datadoghq.ad.logs=[{\"source\":\"nginx\",\"service\":\"webapp\"}]"
volumes:
- ./nginx.conf:/etc/nginx/nginx.conf:ro
ports:
- "80:80"
- "81:81"
networks:
- monitoring
networks:
monitoring:
driver: bridge
Alert Configuration (Terraform)
# alerts.tf
terraform {
required_providers {
datadog = {
source = "DataDog/datadog"
version = "~> 3.0"
}
}
}
provider "datadog" {
api_key = var.datadog_api_key
app_key = var.datadog_app_key
api_url = "https://api.datadoghq.com/"
}
# CPU usage alert
resource "datadog_monitor" "high_cpu" {
name = "High CPU Usage"
type = "metric alert"
message = <<-EOF
CPU usage is above 80% for more than 5 minutes
@slack-alerts
@pagerduty-team
EOF
query = "avg(last_5m):avg:system.cpu.user{env:production} by {host} > 80"
monitor_thresholds {
critical = 80
warning = 70
}
notify_no_data = true
no_data_timeframe = 10
tags = ["team:sre", "env:production", "service:system"]
}
# APM error rate alert
resource "datadog_monitor" "high_error_rate" {
name = "High Error Rate"
type = "metric alert"
message = <<-EOF
Error rate is above 5% for service {{service.name}}
@slack-alerts
EOF
query = "avg(last_10m):sum:trace.web.request.errors{env:production} by {service}.as_rate() / sum:trace.web.request.hits{env:production} by {service}.as_rate() > 0.05"
monitor_thresholds {
critical = 0.05
warning = 0.03
}
tags = ["team:backend", "env:production", "service:apm"]
}
# Log-based alert
resource "datadog_monitor" "error_logs" {
name = "Error Log Spike"
type = "log alert"
message = "Error log count is unusually high @slack-alerts"
query = "logs(\"status:error env:production\").index(\"*\").rollup(\"count\").last(\"15m\") > 100"
monitor_thresholds {
critical = 100
warning = 50
}
tags = ["team:sre", "env:production", "service:logs"]
}
# Synthetic monitoring
resource "datadog_synthetics_test" "api_test" {
type = "api"
subtype = "http"
name = "API Health Check"
message = "API endpoint is down @slack-alerts"
locations = ["aws:us-east-1", "aws:eu-west-1"]
options_list {
tick_every = 60
retry {
count = 2
interval = 300
}
monitor_options {
renotify_interval = 120
}
}
request_definition {
method = "GET"
url = "https://api.example.com/health"
assertion {
type = "statusCode"
operator = "is"
target = "200"
}
assertion {
type = "responseTime"
operator = "lessThan"
target = "2000"
}
}
tags = ["team:api", "env:production", "service:synthetics"]
}
Application Integration (Python)
# app_integration.py
from ddtrace import tracer, patch
from ddtrace.contrib.flask import TraceMiddleware
from datadog import DogStatsdClient
import logging
import time
from flask import Flask, request
# Tracing configuration
patch(sqlalchemy=True, requests=True, redis=True)
# Metrics configuration
statsd = DogStatsdClient(host='localhost', port=8125)
# Logging configuration
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
app = Flask(__name__)
traced_app = TraceMiddleware(app, tracer, service="web-app")
@app.before_request
def before_request():
request.start_time = time.time()
# Request metrics
statsd.increment('web.request.count',
tags=[f'endpoint:{request.endpoint}', 'env:production'])
@app.after_request
def after_request(response):
# Response time
response_time = (time.time() - request.start_time) * 1000
statsd.histogram('web.request.duration', response_time,
tags=[f'endpoint:{request.endpoint}',
f'status_code:{response.status_code}'])
# Status code
statsd.increment('web.response.count',
tags=[f'status_code:{response.status_code}'])
return response
@tracer.wrap('database', service='db')
def get_user_data(user_id):
"""Database access tracing"""
with tracer.trace('db.query', service='postgresql') as span:
span.set_tag('user.id', user_id)
span.set_tag('db.operation', 'select')
# Database access simulation
time.sleep(0.01)
return {"user_id": user_id, "name": "Test User"}
@app.route('/health')
def health_check():
"""Health check endpoint"""
return {"status": "healthy", "timestamp": time.time()}
@app.route('/user/<int:user_id>')
def get_user(user_id):
"""User information retrieval"""
try:
with tracer.trace('user.get', service='web-app') as span:
span.set_tag('user.id', user_id)
user_data = get_user_data(user_id)
logger.info(f"User data retrieved", extra={
'user_id': user_id,
'dd.trace_id': span.trace_id,
'dd.span_id': span.span_id
})
return user_data
except Exception as e:
statsd.increment('web.error.count',
tags=['error_type:user_not_found'])
logger.error(f"Error retrieving user: {e}", extra={
'user_id': user_id,
'error': str(e)
})
return {"error": "User not found"}, 404
if __name__ == '__main__':
app.run(host='0.0.0.0', port=5000, debug=False)
Troubleshooting
# Agent status check
sudo datadog-agent status
# Configuration validation
sudo datadog-agent configcheck
# Log file locations
tail -f /var/log/datadog/agent.log
tail -f /var/log/datadog/trace-agent.log
# Metrics submission test
echo "custom.metric:1|g|#env:test" | nc -u -w0 127.0.0.1 8125
# APM trace submission test
curl -X POST "http://localhost:8126/v0.4/traces" \
-H "Content-Type: application/json" \
-d '[{"trace_id": 123, "span_id": 456, "name": "test", "resource": "GET /", "service": "test-service", "start": 1609459200000000000, "duration": 1000000}]'
# Permissions check
ls -la /var/run/docker.sock
ps aux | grep datadog