Datadog

Unified monitoring and visualization platform for cloud-native environments. Centralizes APM, infrastructure monitoring, log management, and security monitoring. Machine learning-powered anomaly detection.

Monitoring ServerSaaS MonitoringAPMInfrastructure MonitoringMachine LearningCloud NativeUnified Monitoring

Server

Datadog

Overview

Datadog is a unified monitoring and visualization platform for cloud-native environments. It centralizes APM, infrastructure monitoring, log management, and security monitoring, providing machine learning-powered anomaly detection. As a leader in the SaaS monitoring market with over $2B annual revenue and adoption by many Fortune 500 companies, it continues evolving into a next-generation monitoring platform with AI and ML integration.

Details

Datadog was founded in 2010 by Olivier Pomel and others, and now maintains a leader position in the SaaS monitoring market. With over $2B annual revenue and adoption by many Fortune 500 companies, it continues to evolve into a next-generation monitoring platform with AI and ML integration. Through its cloud-first strategy, it achieves continuous growth and has established itself as the industry standard for unified monitoring solutions.

Key Technical Features

  • Unified Monitoring Platform: Centralized management of infrastructure, APM, logs, and security
  • Machine Learning Integration: AI-assisted anomaly detection and predictive analysis
  • Cloud-Native Support: Deep integration with major clouds including AWS, Azure, and GCP
  • Real-Time Monitoring: High-frequency metrics collection and instant alerting
  • Rich Integrations: 800+ service and tool integrations

Use Cases

  • Cloud infrastructure monitoring
  • Application performance monitoring (APM)
  • Log management and analysis
  • Security monitoring and SIEM
  • Business metrics visualization

Pros and Cons

Pros

  • Unified Platform: Comprehensive solution covering all monitoring domains
  • AI/ML Capabilities: Advanced anomaly detection through machine learning
  • Rich Integrations: 800+ service integrations
  • Scalability: High reliability at enterprise scale
  • Usability: Intuitive UI/UX with excellent operability
  • 24/7 Support: Comprehensive support infrastructure

Cons

  • High Cost: Expensive pricing structure for enterprise use
  • Vendor Lock-in: Dependency on SaaS platform
  • Learning Curve: Complexity due to extensive features
  • Data Sovereignty: Constraints on data management due to cloud-based nature
  • Customization Limits: Limitations in customization due to SaaS nature

Reference Pages

Code Examples

Datadog Agent Configuration

# datadog.yaml
api_key: YOUR_API_KEY_HERE
site: datadoghq.com

# Hostname configuration
hostname: web-server-01
hostname_fqdn: web-server-01.example.com

# Tags configuration
tags:
  - env:production
  - service:web
  - team:backend
  - region:us-east-1

# Log collection configuration
logs_enabled: true
logs_config:
  container_collect_all: true
  use_http: true
  compression_level: 6

# APM configuration
apm_config:
  enabled: true
  receiver_port: 8126
  max_traces_per_second: 10

# Process monitoring
process_config:
  enabled: true
  scrub_args: true

# Network Performance Monitoring
network_config:
  enabled: true

# System probe
system_probe_config:
  enabled: true

# Collection interval
check_runners: 4
collection_timeout: 30

# Proxy configuration
proxy:
  http: http://proxy.example.com:8080
  https: https://proxy.example.com:8080
  no_proxy:
    - localhost
    - 127.0.0.1

# JMX configuration
jmx_use_cgroup_memory_limit: true

# Security Agent
security_agent:
  enabled: true
  
# Remote Configuration
remote_configuration:
  enabled: true

Custom Metrics Submission

# custom_metrics.py
from datadog import initialize, api, statsd
import time
import psutil
import requests

# Datadog initialization
options = {
    'api_key': 'YOUR_API_KEY',
    'app_key': 'YOUR_APP_KEY'
}
initialize(**options)

class CustomMetricsCollector:
    def __init__(self):
        self.statsd = statsd
        
    def collect_system_metrics(self):
        """System metrics collection"""
        # CPU usage
        cpu_percent = psutil.cpu_percent(interval=1)
        self.statsd.gauge('custom.system.cpu_percent', cpu_percent, 
                         tags=['host:web-01', 'env:prod'])
        
        # Memory usage
        memory = psutil.virtual_memory()
        self.statsd.gauge('custom.system.memory_percent', memory.percent,
                         tags=['host:web-01', 'env:prod'])
        
        # Disk usage
        disk = psutil.disk_usage('/')
        disk_percent = (disk.used / disk.total) * 100
        self.statsd.gauge('custom.system.disk_percent', disk_percent,
                         tags=['host:web-01', 'env:prod'])
        
    def collect_application_metrics(self):
        """Application metrics collection"""
        # API response time measurement
        start_time = time.time()
        try:
            response = requests.get('http://localhost:8080/health', timeout=5)
            response_time = (time.time() - start_time) * 1000
            
            self.statsd.histogram('custom.api.response_time', response_time,
                                tags=['endpoint:health', 'env:prod'])
            
            # Status code
            self.statsd.increment('custom.api.requests',
                                tags=[f'status_code:{response.status_code}', 'env:prod'])
            
        except requests.RequestException as e:
            self.statsd.increment('custom.api.errors',
                                tags=['error_type:timeout', 'env:prod'])
    
    def send_custom_event(self, title, text, alert_type='info'):
        """Custom event submission"""
        api.Event.create(
            title=title,
            text=text,
            alert_type=alert_type,
            tags=['custom:event', 'env:prod']
        )
    
    def run_continuous_monitoring(self):
        """Run continuous monitoring"""
        while True:
            try:
                self.collect_system_metrics()
                self.collect_application_metrics()
                time.sleep(60)  # 1-minute interval
            except Exception as e:
                print(f"Monitoring error: {e}")
                time.sleep(60)

if __name__ == "__main__":
    collector = CustomMetricsCollector()
    collector.run_continuous_monitoring()

Docker Integration Setup

# docker-compose.yml
version: '3.8'

services:
  datadog-agent:
    image: gcr.io/datadoghq/agent:7
    container_name: datadog-agent
    restart: unless-stopped
    environment:
      - DD_API_KEY=${DD_API_KEY}
      - DD_SITE=datadoghq.com
      - DD_LOGS_ENABLED=true
      - DD_LOGS_CONFIG_CONTAINER_COLLECT_ALL=true
      - DD_APM_ENABLED=true
      - DD_APM_NON_LOCAL_TRAFFIC=true
      - DD_PROCESS_AGENT_ENABLED=true
      - DD_SYSTEM_PROBE_ENABLED=true
      - DD_DOGSTATSD_NON_LOCAL_TRAFFIC=true
      - DD_AC_EXCLUDE=image:gcr.io/datadoghq/agent
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock:ro
      - /proc/:/host/proc/:ro
      - /opt/datadog-agent/run:/opt/datadog-agent/run:rw
      - /sys/fs/cgroup/:/host/sys/fs/cgroup:ro
      - /var/lib/docker/containers:/var/lib/docker/containers:ro
      - /etc/passwd:/etc/passwd:ro
      - ./datadog.yaml:/etc/datadog-agent/datadog.yaml:ro
      - ./conf.d:/etc/datadog-agent/conf.d:ro
    ports:
      - "8125:8125/udp"  # DogStatsD
      - "8126:8126"      # APM
    cap_add:
      - SYS_ADMIN
      - SYS_RESOURCE
      - SYS_PTRACE
      - NET_ADMIN
      - NET_BROADCAST
      - NET_RAW
      - IPC_LOCK
    security_opt:
      - apparmor:unconfined
    networks:
      - monitoring

  webapp:
    image: nginx:alpine
    container_name: webapp
    labels:
      - "com.datadoghq.ad.check_names=[\"nginx\"]"
      - "com.datadoghq.ad.init_configs=[{}]"
      - "com.datadoghq.ad.instances=[{\"nginx_status_url\":\"http://%%host%%:81/nginx_status\"}]"
      - "com.datadoghq.ad.logs=[{\"source\":\"nginx\",\"service\":\"webapp\"}]"
    volumes:
      - ./nginx.conf:/etc/nginx/nginx.conf:ro
    ports:
      - "80:80"
      - "81:81"
    networks:
      - monitoring

networks:
  monitoring:
    driver: bridge

Alert Configuration (Terraform)

# alerts.tf
terraform {
  required_providers {
    datadog = {
      source = "DataDog/datadog"
      version = "~> 3.0"
    }
  }
}

provider "datadog" {
  api_key = var.datadog_api_key
  app_key = var.datadog_app_key
  api_url = "https://api.datadoghq.com/"
}

# CPU usage alert
resource "datadog_monitor" "high_cpu" {
  name         = "High CPU Usage"
  type         = "metric alert"
  message      = <<-EOF
    CPU usage is above 80% for more than 5 minutes
    @slack-alerts
    @pagerduty-team
  EOF
  
  query = "avg(last_5m):avg:system.cpu.user{env:production} by {host} > 80"
  
  monitor_thresholds {
    critical = 80
    warning  = 70
  }
  
  notify_no_data    = true
  no_data_timeframe = 10
  
  tags = ["team:sre", "env:production", "service:system"]
}

# APM error rate alert
resource "datadog_monitor" "high_error_rate" {
  name    = "High Error Rate"
  type    = "metric alert"
  message = <<-EOF
    Error rate is above 5% for service {{service.name}}
    @slack-alerts
  EOF
  
  query = "avg(last_10m):sum:trace.web.request.errors{env:production} by {service}.as_rate() / sum:trace.web.request.hits{env:production} by {service}.as_rate() > 0.05"
  
  monitor_thresholds {
    critical = 0.05
    warning  = 0.03
  }
  
  tags = ["team:backend", "env:production", "service:apm"]
}

# Log-based alert
resource "datadog_monitor" "error_logs" {
  name    = "Error Log Spike"
  type    = "log alert"
  message = "Error log count is unusually high @slack-alerts"
  
  query = "logs(\"status:error env:production\").index(\"*\").rollup(\"count\").last(\"15m\") > 100"
  
  monitor_thresholds {
    critical = 100
    warning  = 50
  }
  
  tags = ["team:sre", "env:production", "service:logs"]
}

# Synthetic monitoring
resource "datadog_synthetics_test" "api_test" {
  type      = "api"
  subtype   = "http"
  name      = "API Health Check"
  message   = "API endpoint is down @slack-alerts"
  
  locations = ["aws:us-east-1", "aws:eu-west-1"]
  
  options_list {
    tick_every = 60
    
    retry {
      count    = 2
      interval = 300
    }
    
    monitor_options {
      renotify_interval = 120
    }
  }
  
  request_definition {
    method = "GET"
    url    = "https://api.example.com/health"
    
    assertion {
      type     = "statusCode"
      operator = "is"
      target   = "200"
    }
    
    assertion {
      type     = "responseTime"
      operator = "lessThan"
      target   = "2000"
    }
  }
  
  tags = ["team:api", "env:production", "service:synthetics"]
}

Application Integration (Python)

# app_integration.py
from ddtrace import tracer, patch
from ddtrace.contrib.flask import TraceMiddleware
from datadog import DogStatsdClient
import logging
import time
from flask import Flask, request

# Tracing configuration
patch(sqlalchemy=True, requests=True, redis=True)

# Metrics configuration
statsd = DogStatsdClient(host='localhost', port=8125)

# Logging configuration
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

app = Flask(__name__)
traced_app = TraceMiddleware(app, tracer, service="web-app")

@app.before_request
def before_request():
    request.start_time = time.time()
    
    # Request metrics
    statsd.increment('web.request.count', 
                    tags=[f'endpoint:{request.endpoint}', 'env:production'])

@app.after_request
def after_request(response):
    # Response time
    response_time = (time.time() - request.start_time) * 1000
    statsd.histogram('web.request.duration', response_time,
                    tags=[f'endpoint:{request.endpoint}', 
                          f'status_code:{response.status_code}'])
    
    # Status code
    statsd.increment('web.response.count',
                    tags=[f'status_code:{response.status_code}'])
    
    return response

@tracer.wrap('database', service='db')
def get_user_data(user_id):
    """Database access tracing"""
    with tracer.trace('db.query', service='postgresql') as span:
        span.set_tag('user.id', user_id)
        span.set_tag('db.operation', 'select')
        
        # Database access simulation
        time.sleep(0.01)
        return {"user_id": user_id, "name": "Test User"}

@app.route('/health')
def health_check():
    """Health check endpoint"""
    return {"status": "healthy", "timestamp": time.time()}

@app.route('/user/<int:user_id>')
def get_user(user_id):
    """User information retrieval"""
    try:
        with tracer.trace('user.get', service='web-app') as span:
            span.set_tag('user.id', user_id)
            
            user_data = get_user_data(user_id)
            
            logger.info(f"User data retrieved", extra={
                'user_id': user_id,
                'dd.trace_id': span.trace_id,
                'dd.span_id': span.span_id
            })
            
            return user_data
            
    except Exception as e:
        statsd.increment('web.error.count', 
                        tags=['error_type:user_not_found'])
        logger.error(f"Error retrieving user: {e}", extra={
            'user_id': user_id,
            'error': str(e)
        })
        return {"error": "User not found"}, 404

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=5000, debug=False)

Troubleshooting

# Agent status check
sudo datadog-agent status

# Configuration validation
sudo datadog-agent configcheck

# Log file locations
tail -f /var/log/datadog/agent.log
tail -f /var/log/datadog/trace-agent.log

# Metrics submission test
echo "custom.metric:1|g|#env:test" | nc -u -w0 127.0.0.1 8125

# APM trace submission test
curl -X POST "http://localhost:8126/v0.4/traces" \
  -H "Content-Type: application/json" \
  -d '[{"trace_id": 123, "span_id": 456, "name": "test", "resource": "GET /", "service": "test-service", "start": 1609459200000000000, "duration": 1000000}]'

# Permissions check
ls -la /var/run/docker.sock
ps aux | grep datadog