DevOps and Cloud Interview Questions 2025: AWS, Docker, Kubernetes, CI/CD, and Infrastructure as Code
An interviewer asks: "Our deployment takes 2 hours and fails 30% of the time. How would you fix it?" A candidate responds: "I'd set up a CI/CD pipeline." That's where most people stop. The interviewer follows up: "Which tools? What stages? How do you handle rollbacks?" Silence.
DevOps interviews test whether you can actually ship reliable software, not just use buzzwords. You need to understand containers, orchestration, cloud services, infrastructure as code, and most importantly—how these pieces fit together to create reliable deployments. Here's everything you need to know.
What DevOps Interviews Actually Test
Companies hiring DevOps engineers want to know:
Can you build reliable deployments? Manual deployments don't scale Do you understand cloud infrastructure? Modern apps run on AWS/GCP/Azure Can you debug production issues? Monitoring, logging, and troubleshooting Do you automate repetitive tasks? Scripts, IaC, and CI/CD pipelines Can you make cost-effective decisions? Cloud costs matter
Docker: Containerization Fundamentals
Basic Concepts
Question: "What is Docker and why do we use it?"
Strong answer: "Docker packages applications with their dependencies into containers—isolated, lightweight environments that run consistently across development, staging, and production. It solves the 'works on my machine' problem.
Unlike VMs which virtualize hardware, containers share the host OS kernel, making them much lighter:
- VM: Includes full OS (GBs), slow to start (minutes)
- Container: Just app + dependencies (MBs), starts in seconds
Benefits:
- Consistency across environments
- Easy scaling (spin up/down containers quickly)
- Isolation (dependencies don't conflict)
- Efficient resource usage (vs VMs)"
Dockerfile Best Practices
Question: "Write an optimized Dockerfile for a Node.js application."
Bad example:
FROM node:18
COPY . /app
WORKDIR /app
RUN npm install
CMD ["node", "server.js"]
Problems:
- No layer caching optimization
- Installs dev dependencies
- Runs as root user (security risk)
- Large base image
Good example:
# Use Alpine Linux for smaller image (18-alpine vs 18 = 100MB vs 900MB)
FROM node:18-alpine AS builder
# Set working directory
WORKDIR /app
# Copy package files first (layer caching)
# Only re-runs npm install if package.json changes
COPY package*.json ./
# Install dependencies
RUN npm ci --only=production
# Copy application code
COPY . .
# Multi-stage build - final image doesn't include build tools
FROM node:18-alpine
WORKDIR /app
# Create non-root user
RUN addgroup -g 1001 -S nodejs && \
adduser -S nodejs -u 1001
# Copy built app from builder stage
COPY --from=builder --chown=nodejs:nodejs /app .
# Switch to non-root user
USER nodejs
# Expose port
EXPOSE 3000
# Health check
HEALTHCHECK --interval=30s --timeout=3s --start-period=5s \
CMD node healthcheck.js || exit 1
# Start application
CMD ["node", "server.js"]
Key optimizations:
- Multi-stage build reduces final image size
- Layer caching (package.json copied separately)
- Non-root user (security)
- Alpine base image (smaller size)
- Health check for container orchestration
npm ci
instead ofnpm install
(faster, more reliable)
Docker Commands
Question: "Explain essential Docker commands and when to use them."
# Build image
docker build -t myapp:v1.0 .
docker build -t myapp:latest --no-cache . # Force rebuild
# Run container
docker run -d \ # Detached mode (background)
-p 3000:3000 \ # Port mapping (host:container)
--name myapp \ # Container name
--env-file .env \ # Environment variables
--restart unless-stopped \ # Auto-restart policy
-v $(pwd)/data:/app/data \ # Volume mount
myapp:latest
# List containers
docker ps # Running containers
docker ps -a # All containers (including stopped)
# View logs
docker logs myapp # View logs
docker logs -f myapp # Follow logs (like tail -f)
docker logs --tail 100 myapp # Last 100 lines
# Execute command in container
docker exec -it myapp bash # Interactive shell
docker exec myapp ls /app # Run command
# Stop/remove
docker stop myapp # Graceful stop
docker kill myapp # Force stop
docker rm myapp # Remove container
docker rmi myapp:latest # Remove image
# System cleanup
docker system prune -a # Remove unused containers/images/networks
docker volume prune # Remove unused volumes
# Inspect container
docker inspect myapp # Full container details
docker stats myapp # Resource usage (CPU, memory, network)
Docker Compose for Multi-Container Apps
Question: "Set up a web app with database and cache using Docker Compose."
# docker-compose.yml
version: '3.8'
services:
# Application server
app:
build:
context: .
dockerfile: Dockerfile
ports:
- "3000:3000"
environment:
NODE_ENV: production
DATABASE_URL: postgres://user:password@db:5432/myapp
REDIS_URL: redis://cache:6379
depends_on:
- db
- cache
volumes:
- ./uploads:/app/uploads
restart: unless-stopped
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:3000/health"]
interval: 30s
timeout: 10s
retries: 3
# PostgreSQL database
db:
image: postgres:15-alpine
environment:
POSTGRES_USER: user
POSTGRES_PASSWORD: password
POSTGRES_DB: myapp
volumes:
- postgres-data:/var/lib/postgresql/data
- ./init.sql:/docker-entrypoint-initdb.d/init.sql
ports:
- "5432:5432"
restart: unless-stopped
# Redis cache
cache:
image: redis:7-alpine
command: redis-server --appendonly yes
volumes:
- redis-data:/data
restart: unless-stopped
# Nginx reverse proxy
nginx:
image: nginx:alpine
ports:
- "80:80"
- "443:443"
volumes:
- ./nginx.conf:/etc/nginx/nginx.conf:ro
- ./ssl:/etc/nginx/ssl:ro
depends_on:
- app
restart: unless-stopped
volumes:
postgres-data:
redis-data:
networks:
default:
driver: bridge
Commands:
docker-compose up -d # Start all services
docker-compose down # Stop and remove all services
docker-compose logs -f app # Follow logs for app service
docker-compose exec app sh # Shell into app container
docker-compose restart app # Restart specific service
docker-compose ps # List services
Kubernetes: Container Orchestration
Core Concepts
Question: "Explain Kubernetes architecture and key components."
"Kubernetes orchestrates containers across multiple servers (nodes). Key components:
Control Plane (Master):
- API Server: Frontend for Kubernetes. All commands go through it
- etcd: Key-value store for cluster state
- Scheduler: Assigns pods to nodes based on resources
- Controller Manager: Maintains desired state (replicas, endpoints, etc.)
Worker Nodes:
- Kubelet: Agent that runs containers and reports status
- Container Runtime: Docker/containerd that actually runs containers
- Kube-proxy: Network proxy for service communication
Key Resources:
- Pod: Smallest unit, contains 1+ containers
- Deployment: Manages replica sets and rolling updates
- Service: Stable network endpoint for pods
- ConfigMap/Secret: Configuration and sensitive data
- Ingress: HTTP/HTTPS routing
- PersistentVolume: Storage that outlives pods"
Kubernetes Manifests
Question: "Deploy a web application with database on Kubernetes."
Deployment (app):
# app-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: myapp
labels:
app: myapp
spec:
replicas: 3 # Run 3 pods for high availability
selector:
matchLabels:
app: myapp
strategy:
type: RollingUpdate # Zero-downtime updates
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
template:
metadata:
labels:
app: myapp
spec:
containers:
- name: myapp
image: myregistry/myapp:v1.0
ports:
- containerPort: 3000
env:
- name: DATABASE_URL
valueFrom:
secretKeyRef:
name: app-secrets
key: database-url
- name: NODE_ENV
value: "production"
resources:
requests: # Minimum guaranteed
memory: "256Mi"
cpu: "250m"
limits: # Maximum allowed
memory: "512Mi"
cpu: "500m"
livenessProbe: # Restart if fails
httpGet:
path: /health
port: 3000
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe: # Remove from load balancer if fails
httpGet:
path: /ready
port: 3000
initialDelaySeconds: 5
periodSeconds: 5
Service (load balancer):
# app-service.yaml
apiVersion: v1
kind: Service
metadata:
name: myapp-service
spec:
type: LoadBalancer # Creates external load balancer
selector:
app: myapp # Routes to pods with this label
ports:
- protocol: TCP
port: 80 # External port
targetPort: 3000 # Container port
ConfigMap (configuration):
# app-configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: app-config
data:
app.conf: |
log_level=info
max_connections=100
feature_flags.json: |
{
"new_feature": true,
"beta_mode": false
}
Secret (sensitive data):
# app-secret.yaml
apiVersion: v1
kind: Secret
metadata:
name: app-secrets
type: Opaque
data:
database-url: cG9zdGdyZXM6Ly8uLi4= # base64 encoded
api-key: c2VjcmV0a2V5MTIz # base64 encoded
Ingress (HTTP routing):
# app-ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: myapp-ingress
annotations:
cert-manager.io/cluster-issuer: letsencrypt-prod
nginx.ingress.kubernetes.io/rate-limit: "100"
spec:
tls:
- hosts:
- myapp.com
secretName: myapp-tls
rules:
- host: myapp.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: myapp-service
port:
number: 80
Commands:
# Apply manifests
kubectl apply -f app-deployment.yaml
kubectl apply -f app-service.yaml
# Get resources
kubectl get pods
kubectl get deployments
kubectl get services
# Describe (detailed info)
kubectl describe pod myapp-xxx
# Logs
kubectl logs myapp-xxx
kubectl logs -f myapp-xxx --tail=100
# Execute command in pod
kubectl exec -it myapp-xxx -- /bin/sh
# Scale deployment
kubectl scale deployment myapp --replicas=5
# Update image (rolling update)
kubectl set image deployment/myapp myapp=myregistry/myapp:v2.0
# Rollback
kubectl rollout undo deployment/myapp
# Delete resources
kubectl delete -f app-deployment.yaml
kubectl delete pod myapp-xxx
AWS: Cloud Services
Core AWS Services
Question: "Design a highly available web application on AWS."
Architecture:
Internet
│
├──────────────────────────────────┐
│ │
Route 53 (DNS) CloudFront (CDN)
│ │
│ S3 (Static Assets)
│
Application Load Balancer
│
├──────────────┬──────────────┐
│ │ │
EC2 (AZ-1) EC2 (AZ-2) EC2 (AZ-3)
│ │ │
└──────────────┴──────────────┘
│
┌────────┴────────┐
│ │
RDS Primary ElastiCache
(Multi-AZ) (Redis)
│
RDS Standby
Key services explained:
Compute:
- EC2: Virtual servers, full control
- Lambda: Serverless functions, pay per execution
- ECS/EKS: Container orchestration (Docker/Kubernetes)
- Elastic Beanstalk: PaaS, deploys apps automatically
Storage:
- S3: Object storage (images, videos, backups)
- EBS: Block storage for EC2 (like hard drives)
- EFS: Shared file system across EC2 instances
- Glacier: Archive storage (cheap, slow retrieval)
Database:
- RDS: Managed relational databases (PostgreSQL, MySQL)
- DynamoDB: NoSQL, serverless, auto-scaling
- ElastiCache: Managed Redis/Memcached
- Redshift: Data warehousing (analytics)
Networking:
- VPC: Virtual private network, isolate resources
- Route 53: DNS service
- CloudFront: CDN, caches content globally
- API Gateway: Create/manage REST APIs
Monitoring/Logging:
- CloudWatch: Metrics, logs, alarms
- CloudTrail: Audit logs (who did what)
- X-Ray: Distributed tracing
EC2 and Auto Scaling
Question: "Set up auto-scaling for a web application."
Auto Scaling Group configuration:
# Create launch template
aws ec2 create-launch-template \
--launch-template-name web-app-template \
--version-description "v1" \
--launch-template-data '{
"ImageId": "ami-0abcdef1234567890",
"InstanceType": "t3.medium",
"KeyName": "my-key-pair",
"SecurityGroupIds": ["sg-0abc123"],
"UserData": "base64-encoded-startup-script",
"IamInstanceProfile": {
"Name": "EC2-S3-Access-Role"
}
}'
# Create Auto Scaling Group
aws autoscaling create-auto-scaling-group \
--auto-scaling-group-name web-app-asg \
--launch-template "LaunchTemplateName=web-app-template,Version=1" \
--min-size 2 \
--max-size 10 \
--desired-capacity 3 \
--target-group-arns arn:aws:elasticloadbalancing:... \
--vpc-zone-identifier "subnet-abc,subnet-def,subnet-ghi" \
--health-check-type ELB \
--health-check-grace-period 300
# Create scaling policies
# Scale up when CPU > 70%
aws autoscaling put-scaling-policy \
--auto-scaling-group-name web-app-asg \
--policy-name scale-up \
--scaling-adjustment 2 \
--adjustment-type ChangeInCapacity \
--cooldown 300
# CloudWatch alarm to trigger scale-up
aws cloudwatch put-metric-alarm \
--alarm-name cpu-high \
--alarm-description "Scale up when CPU exceeds 70%" \
--metric-name CPUUtilization \
--namespace AWS/EC2 \
--statistic Average \
--period 300 \
--threshold 70 \
--comparison-operator GreaterThanThreshold \
--evaluation-periods 2 \
--alarm-actions arn:aws:autoscaling:...
IAM (Identity and Access Management)
Question: "Explain IAM policies and best practices."
IAM Policy (JSON):
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:PutObject"
],
"Resource": "arn:aws:s3:::my-bucket/*"
},
{
"Effect": "Allow",
"Action": "s3:ListBucket",
"Resource": "arn:aws:s3:::my-bucket"
},
{
"Effect": "Deny",
"Action": "s3:DeleteBucket",
"Resource": "*"
}
]
}
Best practices:
- Principle of least privilege: Only grant permissions needed
- Use roles for EC2/Lambda: Don't embed credentials in code
- Enable MFA for privileged users
- Rotate access keys regularly
- Use policy conditions for additional security:
{
"Effect": "Allow",
"Action": "s3:*",
"Resource": "*",
"Condition": {
"IpAddress": {
"aws:SourceIp": "203.0.113.0/24" // Only from office IP
},
"DateGreaterThan": {
"aws:CurrentTime": "2025-01-01T00:00:00Z"
}
}
}
CI/CD Pipelines
Question: "Design a CI/CD pipeline for a Node.js application."
GitHub Actions example:
# .github/workflows/deploy.yml
name: CI/CD Pipeline
on:
push:
branches: [main]
pull_request:
branches: [main]
env:
AWS_REGION: us-east-1
ECR_REPOSITORY: myapp
ECS_SERVICE: myapp-service
ECS_CLUSTER: production
jobs:
test:
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v3
- name: Setup Node.js
uses: actions/setup-node@v3
with:
node-version: '18'
cache: 'npm'
- name: Install dependencies
run: npm ci
- name: Run linter
run: npm run lint
- name: Run tests
run: npm test
- name: Check code coverage
run: npm run coverage
env:
COVERAGE_THRESHOLD: 80
build:
needs: test
runs-on: ubuntu-latest
if: github.ref == 'refs/heads/main'
steps:
- name: Checkout code
uses: actions/checkout@v3
- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@v2
with:
aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
aws-region: ${{ env.AWS_REGION }}
- name: Login to Amazon ECR
id: login-ecr
uses: aws-actions/amazon-ecr-login@v1
- name: Build, tag, and push image
env:
ECR_REGISTRY: ${{ steps.login-ecr.outputs.registry }}
IMAGE_TAG: ${{ github.sha }}
run: |
docker build -t $ECR_REGISTRY/$ECR_REPOSITORY:$IMAGE_TAG .
docker tag $ECR_REGISTRY/$ECR_REPOSITORY:$IMAGE_TAG \
$ECR_REGISTRY/$ECR_REPOSITORY:latest
docker push $ECR_REGISTRY/$ECR_REPOSITORY:$IMAGE_TAG
docker push $ECR_REGISTRY/$ECR_REPOSITORY:latest
deploy:
needs: build
runs-on: ubuntu-latest
steps:
- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@v2
with:
aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
aws-region: ${{ env.AWS_REGION }}
- name: Deploy to ECS
run: |
aws ecs update-service \
--cluster ${{ env.ECS_CLUSTER }} \
--service ${{ env.ECS_SERVICE }} \
--force-new-deployment
- name: Wait for deployment
run: |
aws ecs wait services-stable \
--cluster ${{ env.ECS_CLUSTER }} \
--services ${{ env.ECS_SERVICE }}
- name: Notify Slack
uses: 8398a7/action-slack@v3
with:
status: ${{ job.status }}
text: 'Deployment to production completed!'
webhook_url: ${{ secrets.SLACK_WEBHOOK }}
if: always()
Pipeline stages explained:
- Test: Run linter, unit tests, integration tests
- Build: Create Docker image, push to registry
- Deploy: Update ECS service, wait for stability
- Notify: Alert team via Slack/email
Infrastructure as Code (Terraform)
Question: "Use Terraform to provision AWS infrastructure."
# main.tf
terraform {
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.0"
}
}
backend "s3" {
bucket = "mycompany-terraform-state"
key = "production/terraform.tfstate"
region = "us-east-1"
}
}
provider "aws" {
region = var.aws_region
}
# VPC
resource "aws_vpc" "main" {
cidr_block = "10.0.0.0/16"
enable_dns_hostnames = true
enable_dns_support = true
tags = {
Name = "production-vpc"
Environment = "production"
}
}
# Subnets
resource "aws_subnet" "public" {
count = 3
vpc_id = aws_vpc.main.id
cidr_block = "10.0.${count.index}.0/24"
availability_zone = data.aws_availability_zones.available.names[count.index]
map_public_ip_on_launch = true
tags = {
Name = "public-subnet-${count.index + 1}"
}
}
# Security Group
resource "aws_security_group" "web" {
name = "web-sg"
description = "Security group for web servers"
vpc_id = aws_vpc.main.id
ingress {
from_port = 80
to_port = 80
protocol = "tcp"
cidr_blocks = ["0.0.0.0/0"]
}
ingress {
from_port = 443
to_port = 443
protocol = "tcp"
cidr_blocks = ["0.0.0.0/0"]
}
egress {
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = ["0.0.0.0/0"]
}
}
# RDS Database
resource "aws_db_instance" "postgres" {
identifier = "myapp-db"
engine = "postgres"
engine_version = "15.3"
instance_class = "db.t3.medium"
allocated_storage = 100
storage_type = "gp3"
db_name = "myapp"
username = var.db_username
password = var.db_password
multi_az = true
backup_retention_period = 7
skip_final_snapshot = false
final_snapshot_identifier = "myapp-final-snapshot"
vpc_security_group_ids = [aws_security_group.db.id]
db_subnet_group_name = aws_db_subnet_group.main.name
tags = {
Environment = "production"
}
}
# Outputs
output "db_endpoint" {
value = aws_db_instance.postgres.endpoint
description = "Database endpoint"
sensitive = true
}
Commands:
terraform init # Initialize Terraform
terraform plan # Preview changes
terraform apply # Apply changes
terraform destroy # Destroy infrastructure
terraform fmt # Format code
terraform validate # Validate configuration
terraform show # Show current state
terraform output db_endpoint # Show specific output
Monitoring and Logging
Question: "Set up monitoring for production application."
CloudWatch Metrics + Alarms:
# Create custom metric
aws cloudwatch put-metric-data \
--namespace MyApp \
--metric-name OrdersPerMinute \
--value 42 \
--timestamp $(date -u +"%Y-%m-%dT%H:%M:%SZ")
# Create alarm
aws cloudwatch put-metric-alarm \
--alarm-name high-error-rate \
--alarm-description "Alert when error rate exceeds 5%" \
--metric-name ErrorRate \
--namespace MyApp \
--statistic Average \
--period 300 \
--threshold 5 \
--comparison-operator GreaterThanThreshold \
--evaluation-periods 2 \
--alarm-actions arn:aws:sns:us-east-1:123456789:alerts
Prometheus + Grafana (Kubernetes):
# prometheus-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
data:
prometheus.yml: |
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
ELK Stack for logging:
# filebeat-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: filebeat-config
data:
filebeat.yml: |
filebeat.inputs:
- type: container
paths:
- /var/log/containers/*.log
processors:
- add_kubernetes_metadata:
host: ${NODE_NAME}
matchers:
- logs_path:
logs_path: "/var/log/containers/"
output.elasticsearch:
hosts: ['${ELASTICSEARCH_HOST}:${ELASTICSEARCH_PORT}']
username: ${ELASTICSEARCH_USERNAME}
password: ${ELASTICSEARCH_PASSWORD}
Common Interview Questions
"Explain blue-green deployment vs canary deployment."
Blue-Green:
Blue environment (current): v1.0 (100% traffic)
Green environment (new): v2.0 (0% traffic)
1. Deploy v2.0 to Green
2. Test Green environment
3. Switch traffic: Blue 0% → Green 100%
4. Keep Blue as backup for quick rollback
Canary:
Current: v1.0 (95% traffic)
Canary: v2.0 (5% traffic)
1. Deploy v2.0 to small subset
2. Monitor metrics (errors, latency)
3. Gradually increase: 5% → 25% → 50% → 100%
4. Rollback if issues detected
"How do you handle secrets in production?"
"Never commit secrets to Git. Use:
- AWS Secrets Manager/Parameter Store: Encrypted, rotatable, audit logs
- HashiCorp Vault: Centralized secret management
- Kubernetes Secrets: Base64 encoded, encrypted at rest
- Environment variables: From secret stores, not hardcoded
Example with AWS Secrets Manager:
# Store secret
aws secretsmanager create-secret \
--name prod/database/password \
--secret-string "super-secure-password"
# Retrieve in application
aws secretsmanager get-secret-value \
--secret-id prod/database/password \
--query SecretString --output text
```"
## How to Prepare
1. **Get hands-on experience:** Set up a personal project with Docker, Kubernetes, and CI/CD
2. **Use free tiers:** AWS, GCP, Azure all have free tiers for learning
3. **Practice on [Vibe Interviews](https://vibeinterviews.com):** Get DevOps questions with instant feedback
4. **Learn IaC:** Write Terraform for AWS resources
5. **Study production architectures:** Read engineering blogs from Netflix, Airbnb, Uber
6. **Understand monitoring:** Set up Prometheus/Grafana or CloudWatch
7. **Practice troubleshooting:** Intentionally break things and fix them
DevOps interviews test your ability to ship reliable, scalable systems. Master containerization, orchestration, cloud services, and automation—you'll stand out whether you're interviewing for DevOps, SRE, or platform engineering roles.
Vibe Interviews Team
Part of the Vibe Interviews team, dedicated to helping job seekers ace their interviews and land their dream roles.
Ready to Practice Your Interview Skills?
Apply what you've learned with AI-powered mock interviews. Get instant feedback and improve with every session.
Start Practicing Now