Complete DevOps Engineer Roadmap

DevOpsKubernetesTerraformCI/CDCloudInfrastructureAutomation

Technology

Overview

DevOps engineers are specialists who bridge development and operations, achieving automation, efficiency, and reliable software delivery. In 2025, DevOps has evolved into platform engineering, focusing on building reusable, self-service infrastructure that enables developers to deliver value rapidly. GitOps, AI-driven automation, and cloud-native technologies have become mainstream.

Details

Phase 1: Building Foundation (3-6 months)

Linux Fundamentals and System Administration

Linux Mastery
- Shell scripting (Bash, Zsh)
- File systems and permissions
- Process management and system resources
- Network basics (TCP/IP, DNS, HTTP/HTTPS)
- systemd and service management
Basic Tools
- Text processing (grep, sed, awk)
- System monitoring (top, htop, iostat, netstat)
- Package management (apt, yum, dnf)
- SSH configuration and security

Programming Skills

Scripting Languages
- Python (automation scripts)
- Go (tool development)
- JavaScript (Node.js environment)
Version Control
- Complete Git understanding (branching strategies, merge, rebase)
- GitLab/GitHub/Bitbucket
- Code review best practices

Cloud Fundamentals

Major Cloud Platforms
- AWS (EC2, S3, IAM, VPC)
- Google Cloud Platform (Compute Engine, Cloud Storage)
- Azure (Virtual Machines, Storage Accounts)

Phase 2: Containers and Orchestration (6-12 months)

Complete Docker Mastery

Container Basics
- Dockerfile optimization
- Multi-stage builds
- Understanding image layers
- Security scanning (Trivy, Snyk)
Practical Docker
- Docker Compose
- Volumes and networks
- Registry management (Docker Hub, Harbor)
- Image size minimization (distroless, Alpine)

Kubernetes Mastery

Core Concepts
- Architecture (Control Plane, Worker Nodes)
- Basic resources (Pod, Service, Deployment, StatefulSet)
- Storage (PV, PVC, StorageClass)
- Networking (Service, Ingress, NetworkPolicy)
Advanced Features
- ConfigMap and Secrets management
- RBAC (Role-Based Access Control)
- Helm Charts
- Auto-scaling (HPA, VPA, Cluster Autoscaler)
- Service Mesh (Istio, Linkerd)

Infrastructure as Code

Terraform
- HCL syntax
- Module development
- State management and remote backends
- Terraform workspaces
- Provider development
Other IaC Tools
- Ansible (configuration management)
- Pulumi (programmable IaC)
- CloudFormation (AWS-specific)

Phase 3: CI/CD and GitOps (12-18 months)

CI/CD Pipelines

GitHub Actions
- Workflow building
- Custom action development
- Matrix builds
- Security scan integration
- Deployment automation
Other CI/CD Tools
- Jenkins (Pipeline as Code)
- GitLab CI/CD
- CircleCI
- ArgoCD (GitOps)

GitOps Practice

GitOps Principles
- Declarative configuration
- Git as single source of truth
- Automated deployments
- Continuous synchronization
Implementation Patterns
- ArgoCD/Flux
- Kustomize
- Progressive Delivery (Flagger)
- Rollback strategies

Monitoring and Observability

Metrics Collection
- Prometheus
- Grafana
- Alert configuration (Alertmanager)
Log Management
- ELK Stack (Elasticsearch, Logstash, Kibana)
- Fluentd/Fluent Bit
- Loki
Distributed Tracing
- Jaeger
- Zipkin
- OpenTelemetry

Phase 4: Advanced DevOps and Platform Engineering (18-24 months)

Security (DevSecOps)

Container Security
- Image scanning
- Runtime security (Falco)
- Pod Security Standards
- OPA (Open Policy Agent)
Secret Management
- HashiCorp Vault
- Kubernetes Secrets
- External Secrets Operator
- SOPS
Compliance Automation
- CIS Benchmarks
- STIG compliance
- Audit logging

AI-Driven Automation

AI and MLOps Integration
- Self-healing infrastructure
- Predictive scaling
- Anomaly detection
- Intelligent alerting
ChatOps and AI
- Slack bot automation
- Automated incident response
- AI-assisted troubleshooting

Platform Engineering

Developer Experience
- Self-service portals
- Internal Developer Platform (IDP)
- Golden path provisioning
- Development environment standardization
Platform as a Service Building
- Backstage (developer portal)
- Crossplane (cloud resource management)
- Service catalog
- Cost management and optimization

SRE (Site Reliability Engineering)

Reliability Principles
- SLI/SLO/SLA definition
- Error budgets
- Toil reduction
- Post-mortem culture
Chaos Engineering
- Chaos Monkey
- Litmus
- Failure injection testing
- Recovery procedure automation

Advantages and Disadvantages

Advantages

High demand: DevOps skills remain one of the most sought-after skill sets in 2025
Broad skills: Gain knowledge across infrastructure, development, and security
Automation satisfaction: Automate manual tasks and significantly improve efficiency
Career flexibility: Paths to SRE, Platform Engineer, Cloud Architect roles
Innovation forefront: Always work with cutting-edge technologies

Disadvantages

Broad learning scope: Need to master diverse technology stacks
Always on-call potential: 24/7 system operations may require emergency response
Rapid change: Tools and technologies update very quickly
Heavy responsibility: Infrastructure failures affect the entire system
Stress: Coordination and communication between multiple teams

Reference Pages

Kubernetes Official Documentation - Complete Kubernetes guide
Terraform Official Documentation - Terraform reference
AWS Well-Architected Framework - AWS best practices
Google SRE Books - Google SRE handbooks
CNCF Landscape - Cloud native tools overview
DevOps Roadmap - Interactive DevOps roadmap
The Phoenix Project - DevOps novel
GitOps Working Group - GitOps standards

Code Examples

CI/CD Pipeline with GitHub Actions

# .github/workflows/deploy.yml
name: Build and Deploy to Kubernetes

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

env:
  REGISTRY: ghcr.io
  IMAGE_NAME: ${{ github.repository }}

jobs:
  build-and-test:
    runs-on: ubuntu-latest
    permissions:
      contents: read
      packages: write
      
    steps:
    - name: Checkout code
      uses: actions/checkout@v3

    - name: Set up Docker Buildx
      uses: docker/setup-buildx-action@v2

    - name: Log in to GitHub Container Registry
      uses: docker/login-action@v2
      with:
        registry: ${{ env.REGISTRY }}
        username: ${{ github.actor }}
        password: ${{ secrets.GITHUB_TOKEN }}

    - name: Extract metadata
      id: meta
      uses: docker/metadata-action@v4
      with:
        images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}
        tags: |
          type=ref,event=branch
          type=ref,event=pr
          type=semver,pattern={{version}}
          type=semver,pattern={{major}}.{{minor}}
          type=sha

    - name: Build and push Docker image
      uses: docker/build-push-action@v4
      with:
        context: .
        push: true
        tags: ${{ steps.meta.outputs.tags }}
        labels: ${{ steps.meta.outputs.labels }}
        cache-from: type=registry,ref=${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:buildcache
        cache-to: type=registry,ref=${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:buildcache,mode=max

    - name: Run Trivy vulnerability scanner
      uses: aquasecurity/trivy-action@master
      with:
        image-ref: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ steps.meta.outputs.version }}
        format: 'sarif'
        output: 'trivy-results.sarif'

    - name: Upload Trivy scan results to GitHub Security
      uses: github/codeql-action/upload-sarif@v2
      with:
        sarif_file: 'trivy-results.sarif'

  deploy:
    needs: build-and-test
    runs-on: ubuntu-latest
    if: github.ref == 'refs/heads/main'
    
    steps:
    - name: Checkout code
      uses: actions/checkout@v3

    - name: Install kubectl
      uses: azure/setup-kubectl@v3
      with:
        version: 'v1.28.0'

    - name: Configure AWS credentials
      uses: aws-actions/configure-aws-credentials@v2
      with:
        aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
        aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
        aws-region: us-east-1

    - name: Update kubeconfig
      run: |
        aws eks update-kubeconfig --name production-cluster --region us-east-1

    - name: Deploy to Kubernetes
      run: |
        kubectl set image deployment/app app=${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:sha-${GITHUB_SHA::7} -n production
        kubectl rollout status deployment/app -n production

Infrastructure Building with Terraform

# main.tf - EKS cluster construction
terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
    kubernetes = {
      source  = "hashicorp/kubernetes"
      version = "~> 2.23"
    }
  }

  backend "s3" {
    bucket = "terraform-state-bucket"
    key    = "eks/terraform.tfstate"
    region = "us-east-1"
  }
}

# VPC module
module "vpc" {
  source  = "terraform-aws-modules/vpc/aws"
  version = "5.0.0"

  name = "${var.cluster_name}-vpc"
  cidr = "10.0.0.0/16"

  azs             = data.aws_availability_zones.available.names
  private_subnets = ["10.0.1.0/24", "10.0.2.0/24", "10.0.3.0/24"]
  public_subnets  = ["10.0.101.0/24", "10.0.102.0/24", "10.0.103.0/24"]

  enable_nat_gateway   = true
  single_nat_gateway   = false
  enable_dns_hostnames = true

  tags = {
    "kubernetes.io/cluster/${var.cluster_name}" = "shared"
  }

  public_subnet_tags = {
    "kubernetes.io/cluster/${var.cluster_name}" = "shared"
    "kubernetes.io/role/elb"                    = "1"
  }

  private_subnet_tags = {
    "kubernetes.io/cluster/${var.cluster_name}" = "shared"
    "kubernetes.io/role/internal-elb"           = "1"
  }
}

# EKS cluster
module "eks" {
  source  = "terraform-aws-modules/eks/aws"
  version = "19.15.3"

  cluster_name    = var.cluster_name
  cluster_version = "1.28"

  vpc_id                         = module.vpc.vpc_id
  subnet_ids                     = module.vpc.private_subnets
  cluster_endpoint_public_access = true

  eks_managed_node_group_defaults = {
    instance_types = ["t3.medium"]
  }

  eks_managed_node_groups = {
    main = {
      name = "node-group-1"

      instance_types = ["t3.medium"]

      min_size     = 2
      max_size     = 10
      desired_size = 3

      pre_bootstrap_user_data = <<-EOT
        echo 'net.ipv4.ip_forward = 1' >> /etc/sysctl.conf
        sysctl -p
      EOT
    }
  }

  # OIDC Provider for IRSA
  enable_irsa = true

  # Cluster add-ons
  cluster_addons = {
    coredns = {
      most_recent = true
    }
    kube-proxy = {
      most_recent = true
    }
    vpc-cni = {
      most_recent = true
    }
  }

  tags = var.tags
}

# IRSA for ALB Controller
module "load_balancer_controller_irsa" {
  source  = "terraform-aws-modules/iam/aws//modules/iam-role-for-service-accounts-eks"
  version = "5.30.0"

  role_name = "${var.cluster_name}-aws-load-balancer-controller"

  attach_load_balancer_controller_policy = true

  oidc_providers = {
    main = {
      provider_arn               = module.eks.oidc_provider_arn
      namespace_service_accounts = ["kube-system:aws-load-balancer-controller"]
    }
  }
}

# Helm provider configuration
provider "helm" {
  kubernetes {
    host                   = module.eks.cluster_endpoint
    cluster_ca_certificate = base64decode(module.eks.cluster_certificate_authority_data)

    exec {
      api_version = "client.authentication.k8s.io/v1beta1"
      command     = "aws"
      args = ["eks", "get-token", "--cluster-name", var.cluster_name]
    }
  }
}

# AWS Load Balancer Controller
resource "helm_release" "aws_load_balancer_controller" {
  name       = "aws-load-balancer-controller"
  repository = "https://aws.github.io/eks-charts"
  chart      = "aws-load-balancer-controller"
  namespace  = "kube-system"
  version    = "1.6.0"

  set {
    name  = "clusterName"
    value = var.cluster_name
  }

  set {
    name  = "serviceAccount.create"
    value = "true"
  }

  set {
    name  = "serviceAccount.annotations.eks\\.amazonaws\\.com/role-arn"
    value = module.load_balancer_controller_irsa.iam_role_arn
  }

  depends_on = [
    module.eks
  ]
}

Kubernetes Manifests and GitOps

# kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

namespace: production

resources:
  - deployment.yaml
  - service.yaml
  - ingress.yaml
  - configmap.yaml
  - hpa.yaml

configMapGenerator:
  - name: app-config
    envs:
      - config.env

secretGenerator:
  - name: app-secrets
    envs:
      - secrets.env

patchesStrategicMerge:
  - replica-patch.yaml

images:
  - name: myapp
    newName: ghcr.io/myorg/myapp
    newTag: v1.2.3

---
# argocd-application.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: production-app
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://github.com/myorg/k8s-configs
    targetRevision: HEAD
    path: apps/production
  destination:
    server: https://kubernetes.default.svc
    namespace: production
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    syncOptions:
    - CreateNamespace=true
    retry:
      limit: 5
      backoff:
        duration: 5s
        factor: 2
        maxDuration: 3m

Monitoring Configuration with Prometheus

# prometheus-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-config
data:
  prometheus.yml: |
    global:
      scrape_interval: 15s
      evaluation_interval: 15s

    alerting:
      alertmanagers:
        - static_configs:
            - targets:
              - alertmanager:9093

    rule_files:
      - /etc/prometheus/rules/*.yml

    scrape_configs:
      - job_name: 'kubernetes-apiservers'
        kubernetes_sd_configs:
          - role: endpoints
        scheme: https
        tls_config:
          ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
        relabel_configs:
          - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
            action: keep
            regex: default;kubernetes;https

      - job_name: 'kubernetes-nodes'
        kubernetes_sd_configs:
          - role: node
        scheme: https
        tls_config:
          ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
        relabel_configs:
          - action: labelmap
            regex: __meta_kubernetes_node_label_(.+)

      - job_name: 'kubernetes-pods'
        kubernetes_sd_configs:
          - role: pod
        relabel_configs:
          - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
            action: keep
            regex: true
          - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
            action: replace
            target_label: __metrics_path__
            regex: (.+)
          - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
            action: replace
            regex: ([^:]+)(?::\d+)?;(\d+)
            replacement: $1:$2
            target_label: __address__

---
# alert-rules.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-rules
data:
  node-alerts.yml: |
    groups:
      - name: node-alerts
        interval: 30s
        rules:
          - alert: NodeCPUHigh
            expr: 100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
            for: 5m
            labels:
              severity: warning
            annotations:
              summary: "Node CPU usage is high"
              description: "CPU usage is above 80% (current value: {{ $value }}%)"
          
          - alert: NodeMemoryHigh
            expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
            for: 5m
            labels:
              severity: warning
            annotations:
              summary: "Node memory usage is high"
              description: "Memory usage is above 85% (current value: {{ $value }}%)"

Automation with Shell Scripts

#!/bin/bash
# deploy.sh - Automated deployment script

set -euo pipefail

# Variable definitions
NAMESPACE="${NAMESPACE:-production}"
APP_NAME="${APP_NAME:-myapp}"
IMAGE_TAG="${IMAGE_TAG:-latest}"
TIMEOUT="${TIMEOUT:-300}"

# Function definitions
log() {
    echo "[$(date +'%Y-%m-%d %H:%M:%S')] $*"
}

error() {
    log "ERROR: $*" >&2
    exit 1
}

# Prerequisites check
check_prerequisites() {
    log "Checking prerequisites..."
    
    command -v kubectl >/dev/null 2>&1 || error "kubectl is not installed"
    command -v helm >/dev/null 2>&1 || error "helm is not installed"
    
    kubectl cluster-info >/dev/null 2>&1 || error "Cannot connect to Kubernetes cluster"
    
    log "Prerequisites check passed"
}

# Execute deployment
deploy_application() {
    log "Starting deployment of ${APP_NAME} with tag ${IMAGE_TAG}..."
    
    # Create namespace (if not exists)
    kubectl create namespace "${NAMESPACE}" --dry-run=client -o yaml | kubectl apply -f -
    
    # Update ConfigMaps and Secrets
    kubectl apply -f configs/ -n "${NAMESPACE}"
    
    # Deploy with Helm chart
    helm upgrade --install "${APP_NAME}" ./charts/"${APP_NAME}" \
        --namespace "${NAMESPACE}" \
        --set image.tag="${IMAGE_TAG}" \
        --set replicaCount=3 \
        --wait \
        --timeout="${TIMEOUT}s"
    
    log "Deployment completed successfully"
}

# Health check
health_check() {
    log "Performing health check..."
    
    # Check deployment status
    kubectl rollout status deployment/"${APP_NAME}" -n "${NAMESPACE}" --timeout="${TIMEOUT}s"
    
    # Check pod readiness
    ready_pods=$(kubectl get pods -n "${NAMESPACE}" -l app="${APP_NAME}" -o json | jq '.items | map(select(.status.conditions[] | select(.type=="Ready" and .status=="True"))) | length')
    total_pods=$(kubectl get pods -n "${NAMESPACE}" -l app="${APP_NAME}" -o json | jq '.items | length')
    
    if [[ "${ready_pods}" -ne "${total_pods}" ]]; then
        error "Not all pods are ready: ${ready_pods}/${total_pods}"
    fi
    
    log "Health check passed: ${ready_pods}/${total_pods} pods are ready"
}

# Main process
main() {
    log "Deployment script started"
    
    check_prerequisites
    deploy_application
    health_check
    
    log "Deployment completed successfully!"
}

# Execute script
main "$@"