Complete DevOps Engineer Roadmap
Technology
Complete DevOps Engineer Roadmap
Overview
DevOps engineers are specialists who bridge development and operations, achieving automation, efficiency, and reliable software delivery. In 2025, DevOps has evolved into platform engineering, focusing on building reusable, self-service infrastructure that enables developers to deliver value rapidly. GitOps, AI-driven automation, and cloud-native technologies have become mainstream.
Details
Phase 1: Building Foundation (3-6 months)
Linux Fundamentals and System Administration
-
Linux Mastery
- Shell scripting (Bash, Zsh)
- File systems and permissions
- Process management and system resources
- Network basics (TCP/IP, DNS, HTTP/HTTPS)
- systemd and service management
-
Basic Tools
- Text processing (grep, sed, awk)
- System monitoring (top, htop, iostat, netstat)
- Package management (apt, yum, dnf)
- SSH configuration and security
Programming Skills
-
Scripting Languages
- Python (automation scripts)
- Go (tool development)
- JavaScript (Node.js environment)
-
Version Control
- Complete Git understanding (branching strategies, merge, rebase)
- GitLab/GitHub/Bitbucket
- Code review best practices
Cloud Fundamentals
- Major Cloud Platforms
- AWS (EC2, S3, IAM, VPC)
- Google Cloud Platform (Compute Engine, Cloud Storage)
- Azure (Virtual Machines, Storage Accounts)
Phase 2: Containers and Orchestration (6-12 months)
Complete Docker Mastery
-
Container Basics
- Dockerfile optimization
- Multi-stage builds
- Understanding image layers
- Security scanning (Trivy, Snyk)
-
Practical Docker
- Docker Compose
- Volumes and networks
- Registry management (Docker Hub, Harbor)
- Image size minimization (distroless, Alpine)
Kubernetes Mastery
-
Core Concepts
- Architecture (Control Plane, Worker Nodes)
- Basic resources (Pod, Service, Deployment, StatefulSet)
- Storage (PV, PVC, StorageClass)
- Networking (Service, Ingress, NetworkPolicy)
-
Advanced Features
- ConfigMap and Secrets management
- RBAC (Role-Based Access Control)
- Helm Charts
- Auto-scaling (HPA, VPA, Cluster Autoscaler)
- Service Mesh (Istio, Linkerd)
Infrastructure as Code
-
Terraform
- HCL syntax
- Module development
- State management and remote backends
- Terraform workspaces
- Provider development
-
Other IaC Tools
- Ansible (configuration management)
- Pulumi (programmable IaC)
- CloudFormation (AWS-specific)
Phase 3: CI/CD and GitOps (12-18 months)
CI/CD Pipelines
-
GitHub Actions
- Workflow building
- Custom action development
- Matrix builds
- Security scan integration
- Deployment automation
-
Other CI/CD Tools
- Jenkins (Pipeline as Code)
- GitLab CI/CD
- CircleCI
- ArgoCD (GitOps)
GitOps Practice
-
GitOps Principles
- Declarative configuration
- Git as single source of truth
- Automated deployments
- Continuous synchronization
-
Implementation Patterns
- ArgoCD/Flux
- Kustomize
- Progressive Delivery (Flagger)
- Rollback strategies
Monitoring and Observability
-
Metrics Collection
- Prometheus
- Grafana
- Alert configuration (Alertmanager)
-
Log Management
- ELK Stack (Elasticsearch, Logstash, Kibana)
- Fluentd/Fluent Bit
- Loki
-
Distributed Tracing
- Jaeger
- Zipkin
- OpenTelemetry
Phase 4: Advanced DevOps and Platform Engineering (18-24 months)
Security (DevSecOps)
-
Container Security
- Image scanning
- Runtime security (Falco)
- Pod Security Standards
- OPA (Open Policy Agent)
-
Secret Management
- HashiCorp Vault
- Kubernetes Secrets
- External Secrets Operator
- SOPS
-
Compliance Automation
- CIS Benchmarks
- STIG compliance
- Audit logging
AI-Driven Automation
-
AI and MLOps Integration
- Self-healing infrastructure
- Predictive scaling
- Anomaly detection
- Intelligent alerting
-
ChatOps and AI
- Slack bot automation
- Automated incident response
- AI-assisted troubleshooting
Platform Engineering
-
Developer Experience
- Self-service portals
- Internal Developer Platform (IDP)
- Golden path provisioning
- Development environment standardization
-
Platform as a Service Building
- Backstage (developer portal)
- Crossplane (cloud resource management)
- Service catalog
- Cost management and optimization
SRE (Site Reliability Engineering)
-
Reliability Principles
- SLI/SLO/SLA definition
- Error budgets
- Toil reduction
- Post-mortem culture
-
Chaos Engineering
- Chaos Monkey
- Litmus
- Failure injection testing
- Recovery procedure automation
Advantages and Disadvantages
Advantages
- High demand: DevOps skills remain one of the most sought-after skill sets in 2025
- Broad skills: Gain knowledge across infrastructure, development, and security
- Automation satisfaction: Automate manual tasks and significantly improve efficiency
- Career flexibility: Paths to SRE, Platform Engineer, Cloud Architect roles
- Innovation forefront: Always work with cutting-edge technologies
Disadvantages
- Broad learning scope: Need to master diverse technology stacks
- Always on-call potential: 24/7 system operations may require emergency response
- Rapid change: Tools and technologies update very quickly
- Heavy responsibility: Infrastructure failures affect the entire system
- Stress: Coordination and communication between multiple teams
Reference Pages
- Kubernetes Official Documentation - Complete Kubernetes guide
- Terraform Official Documentation - Terraform reference
- AWS Well-Architected Framework - AWS best practices
- Google SRE Books - Google SRE handbooks
- CNCF Landscape - Cloud native tools overview
- DevOps Roadmap - Interactive DevOps roadmap
- The Phoenix Project - DevOps novel
- GitOps Working Group - GitOps standards
Code Examples
CI/CD Pipeline with GitHub Actions
# .github/workflows/deploy.yml
name: Build and Deploy to Kubernetes
on:
push:
branches: [main]
pull_request:
branches: [main]
env:
REGISTRY: ghcr.io
IMAGE_NAME: ${{ github.repository }}
jobs:
build-and-test:
runs-on: ubuntu-latest
permissions:
contents: read
packages: write
steps:
- name: Checkout code
uses: actions/checkout@v3
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v2
- name: Log in to GitHub Container Registry
uses: docker/login-action@v2
with:
registry: ${{ env.REGISTRY }}
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- name: Extract metadata
id: meta
uses: docker/metadata-action@v4
with:
images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}
tags: |
type=ref,event=branch
type=ref,event=pr
type=semver,pattern={{version}}
type=semver,pattern={{major}}.{{minor}}
type=sha
- name: Build and push Docker image
uses: docker/build-push-action@v4
with:
context: .
push: true
tags: ${{ steps.meta.outputs.tags }}
labels: ${{ steps.meta.outputs.labels }}
cache-from: type=registry,ref=${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:buildcache
cache-to: type=registry,ref=${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:buildcache,mode=max
- name: Run Trivy vulnerability scanner
uses: aquasecurity/trivy-action@master
with:
image-ref: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ steps.meta.outputs.version }}
format: 'sarif'
output: 'trivy-results.sarif'
- name: Upload Trivy scan results to GitHub Security
uses: github/codeql-action/upload-sarif@v2
with:
sarif_file: 'trivy-results.sarif'
deploy:
needs: build-and-test
runs-on: ubuntu-latest
if: github.ref == 'refs/heads/main'
steps:
- name: Checkout code
uses: actions/checkout@v3
- name: Install kubectl
uses: azure/setup-kubectl@v3
with:
version: 'v1.28.0'
- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@v2
with:
aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
aws-region: us-east-1
- name: Update kubeconfig
run: |
aws eks update-kubeconfig --name production-cluster --region us-east-1
- name: Deploy to Kubernetes
run: |
kubectl set image deployment/app app=${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:sha-${GITHUB_SHA::7} -n production
kubectl rollout status deployment/app -n production
Infrastructure Building with Terraform
# main.tf - EKS cluster construction
terraform {
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.0"
}
kubernetes = {
source = "hashicorp/kubernetes"
version = "~> 2.23"
}
}
backend "s3" {
bucket = "terraform-state-bucket"
key = "eks/terraform.tfstate"
region = "us-east-1"
}
}
# VPC module
module "vpc" {
source = "terraform-aws-modules/vpc/aws"
version = "5.0.0"
name = "${var.cluster_name}-vpc"
cidr = "10.0.0.0/16"
azs = data.aws_availability_zones.available.names
private_subnets = ["10.0.1.0/24", "10.0.2.0/24", "10.0.3.0/24"]
public_subnets = ["10.0.101.0/24", "10.0.102.0/24", "10.0.103.0/24"]
enable_nat_gateway = true
single_nat_gateway = false
enable_dns_hostnames = true
tags = {
"kubernetes.io/cluster/${var.cluster_name}" = "shared"
}
public_subnet_tags = {
"kubernetes.io/cluster/${var.cluster_name}" = "shared"
"kubernetes.io/role/elb" = "1"
}
private_subnet_tags = {
"kubernetes.io/cluster/${var.cluster_name}" = "shared"
"kubernetes.io/role/internal-elb" = "1"
}
}
# EKS cluster
module "eks" {
source = "terraform-aws-modules/eks/aws"
version = "19.15.3"
cluster_name = var.cluster_name
cluster_version = "1.28"
vpc_id = module.vpc.vpc_id
subnet_ids = module.vpc.private_subnets
cluster_endpoint_public_access = true
eks_managed_node_group_defaults = {
instance_types = ["t3.medium"]
}
eks_managed_node_groups = {
main = {
name = "node-group-1"
instance_types = ["t3.medium"]
min_size = 2
max_size = 10
desired_size = 3
pre_bootstrap_user_data = <<-EOT
echo 'net.ipv4.ip_forward = 1' >> /etc/sysctl.conf
sysctl -p
EOT
}
}
# OIDC Provider for IRSA
enable_irsa = true
# Cluster add-ons
cluster_addons = {
coredns = {
most_recent = true
}
kube-proxy = {
most_recent = true
}
vpc-cni = {
most_recent = true
}
}
tags = var.tags
}
# IRSA for ALB Controller
module "load_balancer_controller_irsa" {
source = "terraform-aws-modules/iam/aws//modules/iam-role-for-service-accounts-eks"
version = "5.30.0"
role_name = "${var.cluster_name}-aws-load-balancer-controller"
attach_load_balancer_controller_policy = true
oidc_providers = {
main = {
provider_arn = module.eks.oidc_provider_arn
namespace_service_accounts = ["kube-system:aws-load-balancer-controller"]
}
}
}
# Helm provider configuration
provider "helm" {
kubernetes {
host = module.eks.cluster_endpoint
cluster_ca_certificate = base64decode(module.eks.cluster_certificate_authority_data)
exec {
api_version = "client.authentication.k8s.io/v1beta1"
command = "aws"
args = ["eks", "get-token", "--cluster-name", var.cluster_name]
}
}
}
# AWS Load Balancer Controller
resource "helm_release" "aws_load_balancer_controller" {
name = "aws-load-balancer-controller"
repository = "https://aws.github.io/eks-charts"
chart = "aws-load-balancer-controller"
namespace = "kube-system"
version = "1.6.0"
set {
name = "clusterName"
value = var.cluster_name
}
set {
name = "serviceAccount.create"
value = "true"
}
set {
name = "serviceAccount.annotations.eks\\.amazonaws\\.com/role-arn"
value = module.load_balancer_controller_irsa.iam_role_arn
}
depends_on = [
module.eks
]
}
Kubernetes Manifests and GitOps
# kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
namespace: production
resources:
- deployment.yaml
- service.yaml
- ingress.yaml
- configmap.yaml
- hpa.yaml
configMapGenerator:
- name: app-config
envs:
- config.env
secretGenerator:
- name: app-secrets
envs:
- secrets.env
patchesStrategicMerge:
- replica-patch.yaml
images:
- name: myapp
newName: ghcr.io/myorg/myapp
newTag: v1.2.3
---
# argocd-application.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: production-app
namespace: argocd
spec:
project: default
source:
repoURL: https://github.com/myorg/k8s-configs
targetRevision: HEAD
path: apps/production
destination:
server: https://kubernetes.default.svc
namespace: production
syncPolicy:
automated:
prune: true
selfHeal: true
syncOptions:
- CreateNamespace=true
retry:
limit: 5
backoff:
duration: 5s
factor: 2
maxDuration: 3m
Monitoring Configuration with Prometheus
# prometheus-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
data:
prometheus.yml: |
global:
scrape_interval: 15s
evaluation_interval: 15s
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
rule_files:
- /etc/prometheus/rules/*.yml
scrape_configs:
- job_name: 'kubernetes-apiservers'
kubernetes_sd_configs:
- role: endpoints
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
action: keep
regex: default;kubernetes;https
- job_name: 'kubernetes-nodes'
kubernetes_sd_configs:
- role: node
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
---
# alert-rules.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-rules
data:
node-alerts.yml: |
groups:
- name: node-alerts
interval: 30s
rules:
- alert: NodeCPUHigh
expr: 100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "Node CPU usage is high"
description: "CPU usage is above 80% (current value: {{ $value }}%)"
- alert: NodeMemoryHigh
expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
for: 5m
labels:
severity: warning
annotations:
summary: "Node memory usage is high"
description: "Memory usage is above 85% (current value: {{ $value }}%)"
Automation with Shell Scripts
#!/bin/bash
# deploy.sh - Automated deployment script
set -euo pipefail
# Variable definitions
NAMESPACE="${NAMESPACE:-production}"
APP_NAME="${APP_NAME:-myapp}"
IMAGE_TAG="${IMAGE_TAG:-latest}"
TIMEOUT="${TIMEOUT:-300}"
# Function definitions
log() {
echo "[$(date +'%Y-%m-%d %H:%M:%S')] $*"
}
error() {
log "ERROR: $*" >&2
exit 1
}
# Prerequisites check
check_prerequisites() {
log "Checking prerequisites..."
command -v kubectl >/dev/null 2>&1 || error "kubectl is not installed"
command -v helm >/dev/null 2>&1 || error "helm is not installed"
kubectl cluster-info >/dev/null 2>&1 || error "Cannot connect to Kubernetes cluster"
log "Prerequisites check passed"
}
# Execute deployment
deploy_application() {
log "Starting deployment of ${APP_NAME} with tag ${IMAGE_TAG}..."
# Create namespace (if not exists)
kubectl create namespace "${NAMESPACE}" --dry-run=client -o yaml | kubectl apply -f -
# Update ConfigMaps and Secrets
kubectl apply -f configs/ -n "${NAMESPACE}"
# Deploy with Helm chart
helm upgrade --install "${APP_NAME}" ./charts/"${APP_NAME}" \
--namespace "${NAMESPACE}" \
--set image.tag="${IMAGE_TAG}" \
--set replicaCount=3 \
--wait \
--timeout="${TIMEOUT}s"
log "Deployment completed successfully"
}
# Health check
health_check() {
log "Performing health check..."
# Check deployment status
kubectl rollout status deployment/"${APP_NAME}" -n "${NAMESPACE}" --timeout="${TIMEOUT}s"
# Check pod readiness
ready_pods=$(kubectl get pods -n "${NAMESPACE}" -l app="${APP_NAME}" -o json | jq '.items | map(select(.status.conditions[] | select(.type=="Ready" and .status=="True"))) | length')
total_pods=$(kubectl get pods -n "${NAMESPACE}" -l app="${APP_NAME}" -o json | jq '.items | length')
if [[ "${ready_pods}" -ne "${total_pods}" ]]; then
error "Not all pods are ready: ${ready_pods}/${total_pods}"
fi
log "Health check passed: ${ready_pods}/${total_pods} pods are ready"
}
# Main process
main() {
log "Deployment script started"
check_prerequisites
deploy_application
health_check
log "Deployment completed successfully!"
}
# Execute script
main "$@"