Docker, Kubernetes & AWS

Containers, orchestration, and cloud architecture — from fundamentals to production patterns with trade-offs and bottlenecks.

Docker Kubernetes AWS Core AWS Networking AWS Data Serverless & Queues Architecture Patterns

Docker — Containers & Images

Container vs VM — Differences & Trade-offs

Must Know

▶

	Virtual Machine (VM)	Container
Isolation	Full OS isolation (hypervisor)	Process isolation (shared OS kernel)
Boot time	Minutes	Milliseconds
Size	GBs (full OS image)	MBs (only app + libs)
Security	Strong (separate kernel)	Weaker (shared kernel, namespaces)
Overhead	High (hypervisor + full OS)	Near-zero (host kernel)
Portability	Slower (large images)	Fast (OCI images, Docker Hub)
Use case	Multi-OS, legacy apps, full isolation	Microservices, CI/CD, cloud-native

VM Stack: Container Stack: ┌─────────────────────┐ ┌─────────────────────┐ │ Application │ │ App A │ App B │ App C │ ├─────────────────────┤ ├───────┴───────┴───────┤ │ Guest OS │ │ Container Runtime │ ├─────────────────────┤ │ (Docker / containerd)│ │ Hypervisor │ ├───────────────────────┤ ├─────────────────────┤ │ Host OS / Kernel │ │ Host Hardware │ ├───────────────────────┤ └─────────────────────┘ │ Host Hardware │ └───────────────────────┘ Docker uses: Linux namespaces (PID, net, mnt, uts, ipc) cgroups (CPU, memory, I/O limits) Union filesystem (OverlayFS — layer-based)

Docker Images — Layers, Union FS, Build Cache

Docker

▶

Union Filesystem (OverlayFS)

Image = stack of read-only layers. Container adds writable layer on top. Layers are shared across images (same base layer stored once). Changes are copy-on-write to writable layer.

Image Layers (bottom to top): ┌─────────────────────────┐ ← Writable container layer ├─────────────────────────┤ ← COPY app/ . (your code) ├─────────────────────────┤ ← RUN pip install -r requirements.txt ├─────────────────────────┤ ← COPY requirements.txt . ├─────────────────────────┤ ← RUN apt-get install python3 └─────────────────────────┘ ← FROM ubuntu:22.04 (base) Cache Invalidation: if layer N changes, all layers N+1, N+2... rebuild! → Put rarely-changing things early (base, dependencies) → Put frequently-changing things late (your application code)

# Optimized Production Dockerfile (multi-stage)
# Stage 1: Build
FROM maven:3.9-openjdk-21 AS builder
WORKDIR /app
COPY pom.xml .
RUN mvn dependency:go-offline -q  # cache deps layer separately
COPY src ./src
RUN mvn package -DskipTests -q

# Stage 2: Runtime (minimal image)
FROM eclipse-temurin:21-jre-alpine  # smaller: jre not jdk, alpine not ubuntu
WORKDIR /app
# Non-root user for security
RUN addgroup -S appgroup && adduser -S appuser -G appgroup
USER appuser
COPY --from=builder /app/target/app.jar ./app.jar
EXPOSE 8080
# Use exec form (not shell form) for proper signal handling
ENTRYPOINT ["java", "-XX:+UseContainerSupport", "-jar", "app.jar"]

Bottlenecks & Anti-patterns:
COPY . . first → any file change rebuilds all layers
RUN apt-get update && apt-get install... in separate layers (stale cache)
Running as root in container (security risk)
Large base images (ubuntu:latest = 77MB vs alpine = 7MB)
Storing secrets in image layers (they persist even if deleted later)

Docker Networking — Bridge, Host, Overlay

Networking

▶

Network Mode	How	Use Case	Trade-off
bridge (default)	Private virtual network on host; containers communicate by name within network	Single host multi-container apps	Must publish ports (-p 8080:80) to host
host	Container uses host's network namespace directly	Performance-critical, low latency	No isolation, port conflicts
none	No network	Fully isolated batch jobs	No connectivity
overlay	Multi-host virtual network (Docker Swarm/K8s)	Cross-host container communication	Encapsulation overhead (VXLAN)
macvlan	Container gets its own MAC/IP on physical network	Legacy apps needing real network presence	Requires promiscuous mode on NIC

Container DNS

On user-defined bridge networks, Docker runs a built-in DNS. Containers resolve each other by container name. e.g., web service can curl http://db:5432 — Docker resolves "db" to the database container IP automatically.

Docker Compose — Multi-Container Apps

DockerDev

▶

# docker-compose.yml — production-grade example
version: '3.9'
services:
  api:
    build: ./api
    image: myapp/api:latest
    ports:
      - "8080:8080"
    environment:
      - DB_HOST=postgres
      - REDIS_URL=redis://cache:6379
      - KAFKA_BROKERS=kafka:9092
    depends_on:
      postgres: { condition: service_healthy }  # wait until healthy
      cache:    { condition: service_started }
    restart: unless-stopped
    deploy:
      resources:
        limits:   { cpus: '0.5', memory: 512M }
        reservations: { memory: 256M }
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
      interval: 10s
      timeout: 5s
      retries: 3

  postgres:
    image: postgres:16-alpine
    volumes:
      - pg_data:/var/lib/postgresql/data  # named volume: persists data
    environment:
      POSTGRES_DB: mydb
      POSTGRES_USER: user
      POSTGRES_PASSWORD_FILE: /run/secrets/pg_password  # use secrets not env
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U user -d mydb"]
      interval: 5s
      retries: 5

  cache:
    image: redis:7-alpine
    command: redis-server --maxmemory 256mb --maxmemory-policy allkeys-lru

  kafka:
    image: confluentinc/cp-kafka:7.5.0
    environment:
      KAFKA_BROKER_ID: 1
      KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
      KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka:9092

volumes:
  pg_data:  # named volume managed by Docker (survives container restart)

# docker-compose up -d       -- start in background
# docker-compose logs -f api -- tail service logs
# docker-compose scale api=3 -- run 3 instances

Docker Volumes — Bind Mounts, Named Volumes, tmpfs

Storage

▶

Type	Storage Location	Lifecycle	Use Case
Named Volume	Docker-managed (/var/lib/docker/volumes)	Persists until explicitly deleted	DB data, app data, production persistence
Bind Mount	Specific host path (e.g. ./src:/app/src)	Exists as long as host path exists	Dev mode: hot reload, config injection
tmpfs	Host RAM only	Container lifetime only	Sensitive data (secrets), high-speed temp

# Named volume
docker run -v myapp_data:/var/lib/mysql mysql:8

# Bind mount (dev hot reload)
docker run -v $(pwd)/src:/app/src -v $(pwd)/node_modules:/app/node_modules node:20

# tmpfs (never written to disk)
docker run --tmpfs /tmp:size=100m,mode=1777 myapp

# Inspect volumes
docker volume ls
docker volume inspect myapp_data

# Backup volume
docker run --rm -v myapp_data:/source -v $(pwd):/backup alpine \
  tar czf /backup/backup.tar.gz -C /source .

Kubernetes — Container Orchestration

Kubernetes Architecture — Control Plane & Nodes

Must Know

▶

Kubernetes Cluster: ┌──────────────────────────────────────────────────────────────┐ │ Control Plane │ │ ├── kube-apiserver → REST API; all components talk here │ │ ├── etcd → Distributed KV store (cluster state)│ │ ├── kube-scheduler → Assigns pods to nodes │ │ ├── kube-controller-manager → Reconciliation loops │ │ └── cloud-controller → Cloud provider integration │ ├──────────────────────────────────────────────────────────────┤ │ Worker Node 1 Worker Node 2 │ │ ├── kubelet (agent) ├── kubelet │ │ ├── kube-proxy (iptables) ├── kube-proxy │ │ ├── containerd ├── containerd │ │ ├── Pod A: [Container1] ├── Pod C: [Container1] │ │ └── Pod B: [Container1] └── Pod D: [Container1, C2] │ └──────────────────────────────────────────────────────────────┘

Component	Role
etcd	Source of truth for all cluster state. Must be HA (3+ nodes). Loss = cluster state loss.
kube-apiserver	All requests go through here. Authentication, admission control, validation.
kube-scheduler	Watches for unscheduled pods. Selects best node based on resources, affinity, taints.
controller-manager	Runs control loops: ReplicaSet controller, Deployment controller, Node controller, etc.
kubelet	Agent on each node. Ensures containers in Pods are running and healthy.
kube-proxy	Maintains network rules (iptables/ipvs) for Service load balancing.

Core Kubernetes Objects

Must Know

▶

Object	What it does
Pod	Smallest deployable unit. 1+ containers sharing network/storage. Ephemeral — don't manage directly.
Deployment	Manages ReplicaSets; rolling updates, rollback. For stateless apps.
StatefulSet	Like Deployment but for stateful apps (DBs). Stable pod names, ordered startup, persistent volumes per pod.
DaemonSet	One pod per node. For cluster-wide agents: log collector (Fluentd), metrics (Prometheus Node Exporter).
Service	Stable virtual IP + DNS for a set of pods. Types: ClusterIP, NodePort, LoadBalancer, ExternalName.
Ingress	HTTP/HTTPS routing rules → Services. Host-based + path-based routing. Needs Ingress Controller (Nginx, Traefik).
ConfigMap	Non-sensitive config as key-value pairs, injected as env vars or files.
Secret	Base64-encoded sensitive data (not encrypted by default! Use sealed-secrets or Vault for real security).
HPA	Horizontal Pod Autoscaler: scales pods based on CPU/memory/custom metrics.
PVC/PV	PersistentVolumeClaim: requests storage. PersistentVolume: actual storage resource.

# Production Deployment manifest
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-service
  labels: { app: api }
spec:
  replicas: 3
  selector: { matchLabels: { app: api } }
  strategy:
    type: RollingUpdate
    rollingUpdate: { maxSurge: 1, maxUnavailable: 0 }  # zero-downtime
  template:
    metadata: { labels: { app: api } }
    spec:
      containers:
      - name: api
        image: myapp/api:v1.2.3  # pin to digest in prod: @sha256:...
        ports: [{ containerPort: 8080 }]
        resources:
          requests: { cpu: "100m", memory: "256Mi" }  # scheduler uses this
          limits:   { cpu: "500m", memory: "512Mi" }  # OOMKilled if exceeded
        readinessProbe:   # traffic only when ready
          httpGet: { path: /ready, port: 8080 }
          initialDelaySeconds: 5
          periodSeconds: 5
        livenessProbe:    # restart if unhealthy
          httpGet: { path: /health, port: 8080 }
          initialDelaySeconds: 15
        env:
        - name: DB_PASSWORD
          valueFrom: { secretKeyRef: { name: db-secret, key: password } }
---
apiVersion: v1
kind: Service
metadata: { name: api-service }
spec:
  selector: { app: api }
  ports: [{ port: 80, targetPort: 8080 }]
  type: ClusterIP
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata: { name: api-hpa }
spec:
  scaleTargetRef: { apiVersion: apps/v1, kind: Deployment, name: api-service }
  minReplicas: 3
  maxReplicas: 20
  metrics:
  - type: Resource
    resource: { name: cpu, target: { type: Utilization, averageUtilization: 70 } }

Service Types, Ingress & Service Mesh

Networking

▶

Service Type	Accessible From	Use Case
ClusterIP (default)	Within cluster only	Internal microservice communication
NodePort	NodeIP:Port from outside	Dev/test; not for production
LoadBalancer	Cloud LB public IP	Expose single service externally; costs money per service
Ingress + IngressController	HTTP/HTTPS routing via single LB	Route multiple services via path/host — cost-effective

Ingress Routing: Internet → ALB/NLB → Ingress Controller (Nginx pod) api.example.com/v1/users → api-service:80 api.example.com/v1/orders → order-service:80 app.example.com/ → frontend-service:80 Service Mesh (Istio/Linkerd): Sidecar proxy (Envoy) injected into each pod → mTLS between all services → Circuit breaking, retries, timeout → Distributed tracing → Traffic shifting (canary: 10% to v2, 90% to v1)

Bottleneck: Service mesh adds ~2ms latency per hop (sidecar proxy). Worth it for security and observability at scale, but overhead for small clusters.

AWS Core Services

EC2 — Elastic Compute Cloud

AWSCompute

▶

Purchase Option	Discount	Best For
On-Demand	0%	Unpredictable, short-term workloads
Reserved (1–3yr)	Up to 72%	Steady-state baseline workloads
Spot Instances	Up to 90%	Fault-tolerant batch jobs, can be interrupted with 2-min notice
Savings Plans	Up to 66%	Flexible (compute family agnostic)
Dedicated Hosts	Variable	Compliance, licensing (SQL Server, Oracle)

Instance Families

m — General Purpose: m7g, m6i (balanced CPU/memory)
c — Compute Optimized: c7g (CPU-intensive, ML inference)
r — Memory Optimized: r7i (in-memory DB, Redis, Kafka)
i — Storage Optimized: i4i (NVMe SSD, Cassandra, HDFS)
g — GPU: g5 (ML training, graphics)
t — Burstable: t3 (dev/test, variable workloads)

EC2 Bottlenecks & Best Practices

Network: use enhanced networking (ENA), placement groups for HPC
Storage I/O: use io2 Block Express for >64K IOPS (RDBMS)
CPU: c instances for compute-bound; Graviton (ARM) = 20% cheaper, 40% better perf/watt
Right-sizing: use Compute Optimizer + CloudWatch metrics
Spot: use Spot with ASG + diversified instance types (fallback)

Auto Scaling Group (ASG)

Min/desired/max capacity; scales on CloudWatch metrics (CPU, custom)
Target Tracking: "keep CPU at 60%" — simplest, recommended
Step Scaling: different actions at different thresholds
Predictive Scaling: ML-based forecast, scales in advance
Always pair with ELB — unhealthy instances replaced automatically

S3 — Simple Storage Service

AWSStorage

▶

Storage Class	Use Case	Retrieval	Cost
S3 Standard	Frequently accessed (app assets, active data)	Immediate	$$$
S3 Standard-IA	Infrequent access, must be available quickly	Immediate	$$
S3 One Zone-IA	Recreatable data (thumbnails)	Immediate	$
S3 Intelligent-Tiering	Unknown access patterns — auto-moves between tiers	Immediate	$$+monitoring fee
S3 Glacier Instant	Archive, rare access but instant needed	Milliseconds	$
S3 Glacier Flexible	Archive, 12-hour retrieval OK	1–12 hours	¢
S3 Glacier Deep Archive	Long-term (7-10 year retention)	Up to 48 hours	¢¢

S3 Key Features & Best Practices

Durability: 99.999999999% (11 9s) — stores across 3 AZs minimum
Versioning: Enable on important buckets — protects from delete/overwrite
Lifecycle rules: Auto-transition to cheaper tier or delete after N days
Presigned URLs: Temporary access to private objects (e.g., 15-min download link)
S3 Transfer Acceleration: Uses CloudFront edge for faster uploads globally
Multipart upload: Required for >5GB; recommended for >100MB
Block Public Access: Enable at account level — prevents accidental public exposure
S3 Event Notifications: Trigger Lambda/SQS/SNS on put/delete (data processing pipeline)

S3 Bottleneck: S3 has 3,500 PUT/COPY/POST/DELETE and 5,500 GET/HEAD requests per second per prefix. Distribute objects across multiple prefixes (don't use date prefix for high-throughput writes).

RDS & Aurora — Managed Databases

AWSDatabase

▶

	RDS (Standard)	Aurora
Engines	MySQL, Postgres, MariaDB, Oracle, SQL Server	MySQL-compatible, Postgres-compatible
Storage	EBS volumes, manual scaling	Shared distributed storage, auto-scales to 128TB
Read Replicas	Up to 5, eventual consistency	Up to 15 Aurora Replicas, <10ms lag
Failover	1–2 minutes (Multi-AZ standby)	~30 seconds (Aurora Replica promoted)
Performance	Baseline	5x MySQL, 3x Postgres performance
Cost	Lower for simple workloads	Higher base, better at scale

Multi-AZ vs Read Replicas

Multi-AZ: Synchronous standby in another AZ. Automatic failover. NOT for read scaling — standby is idle.
Read Replicas: Async replication. Use for read scaling, analytics. Can be in different region.
Aurora Serverless v2: Auto-scales compute capacity in fine-grained increments (0.5 ACU units). Good for variable workloads.

RDS Bottlenecks:
Connection pool exhaustion — use RDS Proxy (connection pooling, IAM auth)
Long-running queries blocking writes — use query timeout, read replica for analytics
Storage I/O limit — use io1/io2 with provisioned IOPS for write-heavy workloads

ElastiCache — Redis & Memcached

AWSCache

▶

	Redis	Memcached
Data types	Rich (String, Hash, List, Set, ZSet, Geo, Streams)	Simple key-value only
Persistence	RDB + AOF options	No persistence
Replication	Primary + replicas, cluster mode	No replication
Multi-threading	Single-threaded (I/O multithreading in 6.0+)	Multi-threaded
Choose when	Session, leaderboards, pub/sub, complex data, persistence	Simple cache, multi-CPU, horizontal scale

ElastiCache Redis — Cluster Mode

Cluster disabled: Single primary + up to 5 replicas. All data on one shard. Simple.
Cluster enabled: Sharded across up to 500 node groups. Scales horizontally. Cross-shard commands limited.
Global Datastore: Cross-region replication for multi-region low-latency reads.

AWS Networking

VPC — Virtual Private Cloud

AWSMust Know

▶

VPC Architecture (Production): VPC: 10.0.0.0/16 (65,536 IPs) │ ├── Public Subnet AZ-a: 10.0.0.0/24 → Internet Gateway → public internet │ └── NAT Gateway (for private → internet outbound) │ └── ALB / NLB (load balancers) │ └── Bastion host (SSH jump server) │ ├── Private Subnet AZ-a: 10.0.10.0/24 → NO direct internet access │ └── EC2 App Servers, ECS tasks │ └── RDS (never in public subnet!) │ ├── Public Subnet AZ-b: 10.0.1.0/24 (for HA — multi-AZ) └── Private Subnet AZ-b: 10.0.11.0/24 Route Tables: Public: 0.0.0.0/0 → Internet Gateway Private: 0.0.0.0/0 → NAT Gateway (outbound only)

Component	Purpose
Internet Gateway (IGW)	Allows public subnet instances to communicate with internet
NAT Gateway	Allows private subnet instances to initiate outbound internet (not inbound)
Security Group	Stateful virtual firewall at instance level (allow rules only)
NACL	Stateless firewall at subnet level (allow + deny rules, order matters)
VPC Peering	Private network connection between two VPCs (no transitive routing)
Transit Gateway	Hub-and-spoke: connect many VPCs + on-prem (replaces VPC peering mesh)
VPC Endpoints	Private connection to AWS services (S3, DynamoDB) without NAT/IGW
PrivateLink	Expose services privately to other VPCs

Load Balancers — ALB, NLB, CLB

AWSNetworking

▶

	ALB (Application)	NLB (Network)	CLB (Classic, deprecated)
Layer	L7 (HTTP/HTTPS/WebSocket/gRPC)	L4 (TCP/UDP/TLS)	L4+L7
Routing	Path, host, header, query, IP	IP:Port only	Basic
Latency	~400ms overhead	~<1ms (ultra low)	Medium
TLS offload	Yes (ACM certs)	Yes (passthrough optional)	Yes
WebSocket			Limited
Static IP	(DNS name)	(per-AZ Elastic IP)
Best for	Web apps, microservices, HTTP APIs	TCP/UDP, gaming, IoT, VoIP, Kubernetes service LoadBalancer	Legacy

ALB Target Groups

Register targets as EC2 instances, IP addresses, Lambda functions, or another ALB. Health checks per target group. Weighted target groups for blue/green and canary deployments.

CloudFront — CDN & Edge Caching

AWSCDN

▶

User Request Flow: User → DNS → CloudFront Edge (450+ PoPs globally) → Cache HIT: serve from edge (<5ms) → Cache MISS: forward to Origin (S3 / ALB / custom HTTP) → Cache response at edge → Serve user Origins: S3, ALB, EC2, API Gateway, on-prem HTTP endpoint

CloudFront Key Features

Signed URLs/Cookies: Time-limited access to private content (streaming, downloads)
Lambda@Edge / CloudFront Functions: Run code at edge (auth, redirects, header manipulation)
Origin Shield: Additional caching layer to protect origin from traffic spikes
Field-Level Encryption: Encrypt specific POST fields at edge (credit cards)
WAF integration: Block malicious requests at edge before they reach your origin
Invalidation: /images/* to clear specific cached paths ($0.005/1000 paths)

Cache Invalidation: CloudFront TTL-based. Use versioned file names (style.v1.2.3.css) instead of frequent invalidations for static assets.

Route 53 — DNS & Traffic Policies

AWSDNS

▶

Routing Policy	Use Case
Simple	Single resource, no health checks
Failover	Active-passive: route to secondary if primary unhealthy
Weighted	Traffic splitting (10% to v2, 90% to v1) — canary releases
Latency-based	Route to region with lowest latency from user
Geolocation	Route based on user's geographic location (compliance, language)
Geoproximity	Shift traffic between regions using bias (traffic routing, migration)
Multivalue Answer	Return multiple IPs, client-side LB with health checks

Route 53 Health Checks

Monitor endpoint health from multiple AWS regions. Integrate with failover routing. Can trigger CloudWatch alarms. Also supports monitoring CloudWatch metrics (so you can failover based on error rate, not just HTTP 200).

AWS Data & Analytics

DynamoDB — NoSQL at Scale

AWSPopular

▶

Data Model

Table: Collection of items (no fixed schema except primary key)
Partition Key: Determines partition (must have high cardinality). Hash-based.
Sort Key: Optional. Enables range queries within a partition.
GSI (Global Secondary Index): Query on non-key attributes. Separate partition+sort key.
LSI (Local Secondary Index): Alternate sort key for same partition key.

import boto3

dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('users')

# PutItem
table.put_item(Item={
    'user_id': 'u123',        # Partition key
    'email': 'user@example.com',
    'created_at': 1716825600,
    'profile': {'name': 'Alice', 'age': 30}
})

# GetItem (single item by PK)
response = table.get_item(Key={'user_id': 'u123'})
user = response['Item']

# Query (items in a partition)
response = table.query(
    KeyConditionExpression='user_id = :uid AND created_at BETWEEN :start AND :end',
    ExpressionAttributeValues={':uid': 'u123', ':start': 1700000000, ':end': 1716825600}
)

# Conditional write (atomic)
table.update_item(
    Key={'user_id': 'u123'},
    UpdateExpression='SET #st = :active',
    ConditionExpression='attribute_exists(user_id)',
    ExpressionAttributeNames={'#st': 'status'},
    ExpressionAttributeValues={':active': 'active'}
)

Hot Partition: If all writes hit one partition key (e.g., same date prefix), DynamoDB throttles. Distribute with high-cardinality keys or add random suffix (write sharding).
DynamoDB Capacity: Provisioned (predictable, auto-scaling) or On-Demand (variable, more expensive). Use On-Demand for new tables or unpredictable traffic.

AWS Data Pipeline — Kinesis, Glue, Athena, Redshift

AWSAnalytics

▶

Modern Data Pipeline on AWS: Sources → Kinesis Data Streams (real-time, like Kafka) → Kinesis Firehose (managed delivery to S3/Redshift/ES) ↓ S3 Data Lake (Parquet/ORC format, partitioned by date/source) ↓ ┌───────────────────────────────────────────┐ │ AWS Glue (ETL, data catalog, crawlers) │ │ Amazon Athena (serverless SQL on S3 data) │ │ Amazon Redshift (data warehouse, OLAP) │ └───────────────────────────────────────────┘ ↓ QuickSight (BI / dashboards)

Service	Purpose	Best For
Kinesis Data Streams	Real-time stream ingestion (like Kafka but AWS-managed)	Real-time analytics, alerting, 1-7 day retention
Kinesis Firehose	Fully managed delivery to S3/Redshift/OpenSearch	Simple streaming ETL, no coding required
AWS Glue	Serverless ETL, data catalog	Batch ETL, schema discovery, data catalog for Athena
Amazon Athena	Interactive SQL on S3 (Presto under hood)	Ad-hoc queries on data lake, pay per query scanned
Amazon Redshift	Fully managed columnar data warehouse	Complex analytics, BI, large joins

Serverless & Messaging

AWS Lambda — Serverless Functions

AWSPopular

▶

How Lambda Works

Upload code (zip or container image up to 10GB)
Set memory (128MB–10GB), timeout (max 15 min), concurrency limit
AWS provisions execution environment (firecracker microVM) on invocation
Pay per invocation + GB-seconds of execution
Scale to 1000+ concurrent executions by default

import json, boto3, os

def handler(event, context):
    # event: the triggering event payload (API GW request, S3 event, SQS message)
    # context: runtime info (function name, memory limit, remaining time)
    
    # Trigger sources: API Gateway, S3, SQS, SNS, DynamoDB Streams, EventBridge,
    #                  Kinesis, ALB, Cognito, Step Functions, CloudWatch Events

    # Example: S3 trigger
    for record in event.get('Records', []):
        bucket = record['s3']['bucket']['name']
        key    = record['s3']['object']['key']
        # Process file...
    
    # Warm vs Cold start
    # Cold start: ~100ms (Python) to ~1s (Java) for first invocation after idle
    # Warm: subsequent invocations reuse container (~1ms overhead)
    
    return {
        'statusCode': 200,
        'headers': {'Content-Type': 'application/json'},
        'body': json.dumps({'message': 'Success'})
    }

# Best practices:
# - Initialize clients outside handler (reused across warm invocations)
db_client = boto3.client('rds-data')  # module-level initialization!

Cold Start Bottleneck: Java Lambda cold starts ~1s. Mitigations: Provisioned Concurrency (keep N instances warm, costs money), Lambda SnapStart (Java — snapshot JVM state), use Python/Node for latency-sensitive.

Lambda Concurrency

Unreserved: Shared pool (default 1000 per region)
Reserved Concurrency: Guarantee N for a function; caps it to prevent throttling others
Provisioned Concurrency: Pre-warm N instances → zero cold starts (costs $ always)
SQS trigger: Lambda scales to match SQS queue depth (up to concurrency limit)

SQS, SNS & EventBridge

AWSMessaging

▶

Service	Model	Best For
SQS Standard	Queue (pull-based), at-least-once, best-effort order	Decoupling, work queues, task distribution
SQS FIFO	Queue, exactly-once, strict order, 300 msg/s (3000 with batching)	Financial, inventory where order matters
SNS	Pub/Sub fan-out (push to SQS, Lambda, HTTP, email, SMS)	Fan-out to multiple consumers simultaneously
EventBridge	Event bus with routing rules (match JSON patterns)	Event-driven architectures, SaaS integrations, scheduled rules

Fan-out Pattern (SNS + SQS): S3 Upload Event → SNS Topic ├── SQS Queue A → Lambda (thumbnail generation) ├── SQS Queue B → Lambda (virus scanning) └── SQS Queue C → Lambda (metadata extraction) Benefits: each consumer has own queue (independent rate/failure handling) SNS delivers to all, SQS buffers for each consumer

SQS Key Properties

Visibility Timeout: Message hidden from other consumers while being processed (default 30s)
Message Retention: 1 min to 14 days (default 4 days)
DLQ: After maxReceiveCount failures → moves to Dead Letter Queue
Long Polling: Wait up to 20s for message (reduces empty responses + cost)
Max message size: 256KB (use S3 for large payloads, store reference in message)

ECS vs EKS — Container Orchestration

AWSContainers

▶

	ECS (Elastic Container Service)	EKS (Elastic Kubernetes Service)
Control Plane	AWS proprietary	Kubernetes
Learning curve	Lower — AWS-native concepts	Higher — K8s expertise needed
Flexibility	AWS ecosystem only	Standard K8s (portable)
Cost	Free control plane + EC2/Fargate	$0.10/hr control plane + EC2/Fargate
Launch types	EC2 (manage servers) or Fargate (serverless)	EC2 or Fargate (for K8s)
Choose when	AWS-native team, simpler use case, cost-sensitive	Multi-cloud, existing K8s investment, more control

Fargate — Serverless Containers

Run containers without managing EC2 instances. AWS provisions compute per task. Pay for vCPU + memory allocated. Great for: variable workloads, CI/CD build runners, low-ops teams. Downside: slower startup than EC2, no GPU support, higher per-unit cost.

IAM — Identity & Access Management

AWSSecurity Critical

▶

IAM Concepts

Users: Long-term identity (humans, CI/CD). Avoid; use Roles when possible.
Groups: Collection of users with shared permissions.
Roles: Temporary credentials assumed by AWS services, apps, or federated users. No stored credentials.
Policies: JSON documents defining Allow/Deny on Actions/Resources.
Conditions: Restrict by IP, MFA, time, tags.

// Least-privilege policy: Lambda can only read from specific S3 bucket
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": ["s3:GetObject", "s3:ListBucket"],
      "Resource": [
        "arn:aws:s3:::my-bucket",
        "arn:aws:s3:::my-bucket/*"
      ],
      "Condition": {
        "StringEquals": { "s3:prefix": ["uploads/"] }
      }
    }
  ]
}

IAM Security Best Practices:
Principle of least privilege — only what's needed
Use roles (not users) for EC2/Lambda/ECS
Enable MFA on root and admin users
Rotate access keys; better: use OIDC/SSO
Use SCPs (Service Control Policies) in AWS Organizations
Never put AWS credentials in code, Docker images, or Git repos

AWS Architecture Patterns

Typical 3-Tier Web Architecture on AWS

AWSArchitecture

▶

Internet ↓ Route 53 (DNS, health check, failover) ↓ CloudFront (CDN — static assets, edge caching) ↓ WAF (Web Application Firewall — OWASP rules, rate limiting) ↓ ALB (Application Load Balancer — HTTP routing, SSL termination) ↓ Auto Scaling Group (EC2 / ECS / EKS — app servers) ├── AZ-a: 2+ instances └── AZ-b: 2+ instances ↓ ┌─────────────────────────────────────────────┐ │ RDS Aurora (Multi-AZ) ← Primary writes │ │ RDS Read Replicas ← Read-heavy queries │ │ ElastiCache Redis ← Session/cache layer │ │ S3 ← Blob storage │ └─────────────────────────────────────────────┘ ↓ CloudWatch (metrics, logs, alarms) X-Ray (distributed tracing)

Event-Driven Serverless Architecture

AWSServerless

▶

Client → API Gateway → Lambda (auth + route) ↓ ┌────────────────────────────────────┐ │ EventBridge (event bus) │ │ Rule: order.created → SQS │ │ Rule: payment.failed → SNS → email │ └────────────────────────────────────┘ ↓ SQS → Lambda (order processor) Lambda (inventory updater) Lambda (notification sender) ↓ DynamoDB (orders table) S3 (invoices, receipts) Aurora Serverless (reports)

Serverless Trade-offs

No infrastructure management; auto-scale to zero
Pay-per-use (great for low/unpredictable traffic)
Cold starts add latency (<1s but noticeable)
15-minute max execution time (not for long jobs)
Stateless — use DynamoDB/ElastiCache for state
Vendor lock-in; local dev is harder
Difficult to debug distributed failures (use X-Ray)

AWS Well-Architected Framework — 6 Pillars

AWSInterview

▶

Pillar	Key Principle	AWS Services
Security	Implement security at every layer; least privilege; encrypt in transit + at rest	IAM, KMS, WAF, Shield, CloudTrail, GuardDuty
Reliability	Recover from failures; test recovery; horizontal scale	Multi-AZ, ASG, Route53, CloudWatch, Backup
Performance Efficiency	Use right resource type; experiment; serverless	CloudFront, ElastiCache, Lambda, Graviton
Cost Optimization	Pay only for what you use; rightsize; spot instances	Cost Explorer, Reserved Instances, Savings Plans, Spot
Operational Excellence	Operations as code; make frequent small changes; learn from failure	CloudFormation/CDK, CodePipeline, CloudWatch, Systems Manager
Sustainability	Reduce environmental impact; use efficient resources	Graviton, Fargate spot, S3 Intelligent-Tiering

Infrastructure as Code — CloudFormation & CDK

AWSDevOps

▶

# CloudFormation — Infrastructure as YAML
AWSTemplateFormatVersion: '2010-09-09'
Description: API with Lambda and DynamoDB

Parameters:
  Environment: { Type: String, Default: prod }

Resources:
  OrdersTable:
    Type: AWS::DynamoDB::Table
    Properties:
      TableName: !Sub orders-${Environment}
      BillingMode: PAY_PER_REQUEST
      AttributeDefinitions:
        - { AttributeName: order_id, AttributeType: S }
      KeySchema:
        - { AttributeName: order_id, KeyType: HASH }

  ApiFunction:
    Type: AWS::Lambda::Function
    Properties:
      FunctionName: !Sub api-handler-${Environment}
      Runtime: python3.12
      Handler: index.handler
      Code: { ZipFile: |
        def handler(event, context):
          return {'statusCode': 200, 'body': 'OK'}
      }
      Role: !GetAtt LambdaRole.Arn
      Environment:
        Variables:
          TABLE_NAME: !Ref OrdersTable

  LambdaRole:
    Type: AWS::IAM::Role
    Properties:
      AssumeRolePolicyDocument:
        Statement: [{Effect: Allow, Principal: {Service: lambda.amazonaws.com}, Action: sts:AssumeRole}]
      ManagedPolicyArns:
        - arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole

Outputs:
  TableName: { Value: !Ref OrdersTable }

# AWS CDK — Infrastructure as Python code
from aws_cdk import (App, Stack, Duration,
                     aws_lambda as _lambda,
                     aws_dynamodb as dynamodb,
                     aws_apigateway as apigw,
                     aws_iam as iam)
from constructs import Construct

class ApiStack(Stack):
    def __init__(self, scope: Construct, id: str, **kwargs):
        super().__init__(scope, id, **kwargs)

        # DynamoDB Table
        table = dynamodb.Table(self, "OrdersTable",
            partition_key=dynamodb.Attribute(
                name="order_id", type=dynamodb.AttributeType.STRING),
            billing_mode=dynamodb.BillingMode.PAY_PER_REQUEST,
            removal_policy=RemovalPolicy.DESTROY  # for dev
        )

        # Lambda Function
        handler = _lambda.Function(self, "ApiHandler",
            runtime=_lambda.Runtime.PYTHON_3_12,
            code=_lambda.Code.from_asset("lambda"),
            handler="index.handler",
            timeout=Duration.seconds(30),
            environment={"TABLE_NAME": table.table_name}
        )

        # Grant Lambda read/write access to DynamoDB
        table.grant_read_write_data(handler)

        # API Gateway
        api = apigw.RestApi(self, "OrdersApi")
        orders = api.root.add_resource("orders")
        orders.add_method("POST", apigw.LambdaIntegration(handler))

app = App()
ApiStack(app, "ApiStack", env={"account": "123456789", "region": "us-east-1"})
app.synth()

CDK vs CloudFormation: CDK generates CloudFormation under the hood. CDK has full programming language power (loops, conditions, abstractions). CloudFormation is more explicit. Both are infrastructure-as-code — version controlled, reproducible, reviewable.

Observability on AWS — CloudWatch, X-Ray, OpenTelemetry

AWSMonitoring

▶

Tool	What	Use For
CloudWatch Metrics	EC2/RDS/Lambda/custom metrics + dashboards	Alerts, autoscaling triggers, SLO monitoring
CloudWatch Logs	Log aggregation from all AWS services + custom	Centralized log search, Insights queries
CloudWatch Alarms	Threshold-based alerts → SNS/SQS/ASG actions	PagerDuty integration, auto-recovery
AWS X-Ray	Distributed tracing (service map, latency analysis)	Identify bottlenecks across Lambda/API GW/DynamoDB
CloudTrail	API activity audit log (who called what when)	Security auditing, compliance, incident investigation
AWS Config	Track config changes to AWS resources	Compliance, drift detection
Amazon Managed Grafana	Grafana dashboards connected to CloudWatch/Prometheus	Rich dashboards, open-source ecosystem

CloudWatch Logs Insights Query Examples

-- Find most common errors in Lambda
fields @timestamp, @message
| filter @message like /ERROR/
| stats count(*) by bin(5m)

-- Lambda cold start analysis
fields @type, @duration, @billedDuration, @initDuration
| filter @type = "REPORT"
| stats avg(@initDuration) by bin(1h)