System Design

Core Fundamentals

CAP Theorem

Pick any 2: Consistency Availability Partition Tolerance

Consistency Every read returns the most recent write
Availability Every request gets a non-error response
Partition Tolerance System works despite network partitions

System	Type
PostgreSQL, MySQL	CA (single node)
Cassandra, DynamoDB	AP (eventual)
HBase, Zookeeper, etcd	CP (may reject)
MongoDB (default)	CP (tunable)
Couchbase	AP or CP (configurable)

Reality

P is always required in distributed systems. The real choice is C vs A during a partition.

ACID vs BASE

ACID (SQL)	BASE (NoSQL)
Atomicity all or nothing	Basically Available
Consistency valid state always	Soft State
Isolation no interference	Eventually Consistent
Durability committed = safe	Prioritizes availability

Use ACID: banking, bookings, inventory. Use BASE: social feeds, analytics, caching, data catalog metadata.

Scalability Vertical vs Horizontal + Strategies

Vertical Scaling (Scale Up)

Add CPU/RAM to existing server
Simple no application changes
Single point of failure
Hard hardware limit, expensive

Horizontal Scaling (Scale Out)

Add more commodity machines
Theoretically unlimited
Requires stateless services + LB
Complex: distributed state, consistency

Key Scaling Strategies

Stateless Services: Store state in Redis/DB. Any instance handles any request.
Read Replicas: Primary for writes, replicas for reads (80% of traffic is reads)
Sharding: Partition data across nodes by shard key
CDN: Cache static assets globally, reduce origin server load drastically
Async Processing: Queue slow work (emails, reports) instant API response
Denormalization: Pre-compute joins faster reads at cost of storage + write complexity

PACELC Theorem (CAP Extension)

If (P)artition: choose (A) or (C). Else (E): choose (L)atency or (C)onsistency

CAP only models partition scenarios. PACELC says: even normally, there is a latency vs consistency tradeoff.
Replication = lower latency OR stronger consistency never both.

System	Partition choice	Else choice
DynamoDB	Availability	Latency (low)
Cassandra	Availability	Latency (low)
Google Spanner	Consistency	Consistency (strong)
MySQL	Consistency	Consistency

Databases

Database Indexing B-Tree, Hash, Composite

Must Know

Type	Structure	Best For	Not For
B-Tree	Balanced tree	Range queries, ORDER BY, <, >, BETWEEN	High-cardinality exact lookups (hash faster)
Hash Index	Hash table	Exact equality (=), O(1)	Range queries, sorting
Composite	B-Tree on N cols	Multi-column queries (leftmost prefix rule)	Queries not starting with leftmost column
Covering Index	Index has all queried cols	Satisfy query from index alone (no heap fetch)	Wide tables with many columns
Full-Text	Inverted index	LIKE '%word%', text search	Exact match (regular index faster)
Partial Index	Index with WHERE clause	WHERE status='active' index only active rows	When all rows need indexing

Index Trade-offs

Reads faster: SELECT queries benefit dramatically (O(log n) vs O(n))
Writes slower: every INSERT/UPDATE/DELETE updates all indexes
Storage cost: indexes use ~20-30% extra disk space each
Cardinality: index on gender (2 values) = nearly useless; index on user_id = excellent

Leftmost Prefix Rule

Composite index (a, b, c) helps queries on: a (a,b) (a,b,c). Does NOT help: (b) (c) (b,c) alone.

SQL vs NoSQL When to Use What

Factor	SQL (Relational)	NoSQL
Schema	Fixed, enforced (DDL)	Flexible, schema-on-read
Relationships	Foreign keys, JOINs	Denormalized, embedded
Horizontal Scaling	Difficult (sharding complex)	Native (Cassandra, Mongo)
Transactions	Full ACID	Limited / eventual
Query Language	SQL (very expressive)	API-specific, limited joins

NoSQL Types & Use Cases

Key-Value (Redis, DynamoDB): sessions, cache, feature flags
Document (MongoDB): product catalogs, CMS, user profiles
Wide-Column (Cassandra, HBase): IoT, time-series, write-heavy analytics
Graph (Neo4j): social network, knowledge graph, data lineage
Search (Elasticsearch): full-text search, log analytics
Time-Series (InfluxDB, TimescaleDB): metrics, monitoring

Choose SQL When

Complex queries with multiple JOINs
ACID transactions are critical (money, inventory)
Schema is stable and well-defined
Rich aggregation / reporting needs
E.g.: banking, ERP, booking, HR systems
Choose NoSQL when schema evolves rapidly or write throughput is extreme

Database Replication Strategies

Single Leader (Primary-Replica): Leader (writes) async/sync replication Replica 1, 2, 3 Reads can go to any replica (scale reads 3x) Multi-Leader: Leader A (DC West) replication Leader B (DC East) Both accept writes; conflict resolution required Leaderless (Dynamo-style): Client writes to N nodes; quorum W > N/2 Client reads from R nodes; R + W > N = consistency

Strategy	Pros	Cons
Single Leader	Simple, no conflicts, ordered writes	Leader bottleneck, failover needed
Multi-Leader	Write availability in multiple DCs	Conflict resolution required
Leaderless	High availability, no SPOF	Complex quorums, eventual consistency

Sync vs Async Replication

Synchronous: write confirmed only when all replicas ack safe but slow
Asynchronous: write confirmed on leader write fast but potential data loss on leader crash
Semi-sync: at least 1 replica acks balance of safety and speed (MySQL default)

Database Sharding Strategies & Pitfalls

Strategy	How	Pros	Cons
Range Sharding	A-M shard1, N-Z shard2	Simple range queries	Hotspots (all celebrities on one shard)
Hash Sharding	hash(key) % N	Even distribution	Rebalancing when N changes remaps many keys
Consistent Hashing	Hash ring + virtual nodes	Minimal remapping on add/remove	More complex, virtual nodes needed for uniformity
Directory Sharding	Lookup table: key shard	Flexible, can move data	Lookup service = SPOF, extra hop
Geo Sharding	US shard1, EU shard2	Data locality, compliance (GDPR)	Cross-region queries expensive

Sharding Pitfalls

Cross-shard JOINs = scatter-gather (expensive) Distributed transactions across shards need 2PC Rebalancing requires data migration Choose shard key carefully hard to change later Hot shard = one shard receives disproportionate traffic

Caching

Caching Strategies All Patterns

Critical

Strategy	Flow	Use When	Downside
Cache-Aside (Lazy)	Check cache miss load DB store in cache	Read-heavy, tolerate some stale data	Cache miss penalty, thundering herd on cold start
Read-Through	Cache handles DB fetch transparently on miss	Simplified app code, uniform read path	Cache must understand data model
Write-Through	Write to cache AND DB synchronously	Strong consistency needed	Write latency doubled
Write-Behind (Write-Back)	Write to cache; async flush to DB	Write-heavy, latency-critical	Data loss if cache crashes before flush
Write-Around	Write directly to DB, skip cache	Write-once, read-rarely data (logs)	Cache miss on first read after write
Refresh-Ahead	Proactively refresh cache before TTL expires	Predictable, hot data access patterns	Wasted refresh if data not re-accessed

Eviction Policies

LRU (Least Recently Used): evict least recently accessed. Best for most workloads.
LFU (Least Frequently Used): evict least accessed overall. Good when some items are permanently hot.
TTL (Time-To-Live): expire after fixed duration. Prevents stale data.
FIFO: simple but poor hit rate. Rarely used in production.

Redis Data Structures for System Design

String: counter (INCR), session, feature flag, simple KV
Hash: user profile fields (HSET/HGET), object properties
List: activity feed, queue (LPUSH/RPOP), recent N items
Set: unique visitors, followers, tags (SADD/SINTER for mutual friends)
Sorted Set (ZSet): leaderboard (ZADD/ZRANGE), rate limiting, priority queue
Pub/Sub: real-time notifications, chat cross-server broadcast
Streams: durable event log (lightweight Kafka replacement)
Geo: GEOADD/GEORADIUS for nearest driver, nearby stores

Cache Problems & Solutions

Problem	Cause	Solution
Cache Stampede (Thundering Herd)	Many concurrent misses hit DB simultaneously after cache expiry	Mutex/distributed lock, probabilistic early expiry, background refresh, never-expire + async update
Cache Penetration	Repeated queries for non-existent keys bypass cache, hammer DB	Cache null values with short TTL; Bloom Filter to pre-check existence
Cache Avalanche	Mass cache expiry at same time DB overwhelmed	Randomize TTL (base_ttl + random(300s)); circuit breaker; multi-layer cache
Hot Key Problem	Single key (celebrity post) gets extreme traffic on one shard	Replicate hot key to multiple shards; local in-process L1 cache; rate limit

Microservices & Architecture Patterns

Monolith vs Microservices

Monolith Pros

Simple to develop, test, deploy
No distributed system complexity
No network latency between components
ACID transactions are trivial
Single deployable artifact

Monolith Cons

Must scale entire app even if only one part needs it
One tech stack for everything
Large codebase = slow builds, hard cognitive load
Deploy all-or-nothing risky releases

Microservices Pros

Independent deploy, scale, and fault isolation
Technology flexibility per service
Small teams own individual services (Conway's Law)
One service crash doesn't bring down the whole system

Microservices Cons

Distributed systems complexity (latency, partial failures)
Cross-service transactions require Saga / 2PC
More infrastructure: service discovery, API gateway, tracing
Debugging is harder; need distributed tracing (Jaeger, Zipkin)

When to Choose Microservices

When different parts have drastically different scaling needs, different teams need independent release cycles, or different components need different technology stacks. NOT for early-stage premature optimization kills velocity.

API Gateway Pattern

Client (Mobile / Web / Third-party) API Gateway - Auth (JWT validation, one place) - Rate Limiting (per client / IP) - SSL Termination - Request Routing / Load Balancing - Request / Response Transformation - Logging, Distributed Tracing - Circuit Breaking User Service Order Service Catalog Service

Tools

Kong, AWS API Gateway, NGINX, Traefik, Envoy, Spring Cloud Gateway

BFF Pattern (Backend for Frontend)

Separate API gateway per client type (mobile BFF, web BFF)
Each BFF aggregates exactly the data that client needs
Avoids over-fetching on mobile (bandwidth matters) vs web

CQRS Command Query Responsibility Segregation

Architecture Pattern

Commands (writes) Queries (reads) Command Handler Query Handler Write Model Read Model (normalized SQL) (denormalized view, Elasticsearch, Redis) Events published Kafka Consumers update read model async Example: Orders POST /orders Write to MySQL publish OrderPlaced Kafka Kafka consumer build summary view in Redis / Elasticsearch GET /orders/summary reads from fast read store

Pros

Optimize reads and writes independently
Scale read and write services separately
Read model tailored per use case
Natural fit with Event Sourcing

Cons

Eventual consistency between read/write models
Significantly more complex architecture
Duplicate data, sync must be maintained
Overkill for simple CRUD apps

Event Sourcing

Architecture Pattern

Core Idea: Store sequence of events, not current state

Every state change = immutable event appended to event log
Current state = replay all events from the beginning
Like Git for your data full history, time travel, audit trail
Snapshots: periodically save state to avoid full replay from the beginning

Traditional: Event Sourcing: Account { balance: $150 } Event Log (append-only): 1. AccountOpened {balance: $0} 2. Deposited {amount: $200} 3. Withdrawn {amount: $50} Current state: replay all $0 + $200 - $50 = $150

Relevant to Alation

Data lineage, audit trail, metadata versioning these are core Alation use cases where event sourcing shines. Know this well.

Saga Pattern Distributed Transactions

Distributed

Problem

How do you maintain consistency across multiple microservices without a distributed ACID transaction (2PC)?

Two Saga Types

Choreography: Services react to domain events. No central coordinator. Decoupled but hard to visualize flow.
Orchestration: Central Saga Orchestrator sends commands to each service and handles failures. Easier to trace, single place for flow logic.

Choreography (Event-driven): OrderPlaced InventoryReserved PaymentProcessed Shipped (failure) PaymentFailed InventoryReleased OrderCancelled Orchestration: SagaOrchestrator reserve-inventory process-payment ship done (on any failure) compensate backwards

Outbox Pattern Reliable Event Publishing

Problem

If you write to DB and then publish to Kafka, what if Kafka publish fails? You have inconsistent state.

Solution: Transactional Outbox DB Transaction (atomic): 1. INSERT INTO orders ... 2. INSERT INTO outbox_events ... Outbox Relay (Debezium CDC or polling): Kafka downstream consumers

How It Works

Write business record AND outbox event in same DB transaction (atomic)
Separate relay process reads outbox table and publishes to Kafka
Relay uses CDC (Debezium reads WAL) or polling loop
Guarantees at-least-once delivery; consumer must be idempotent

Resilience Patterns

Circuit Breaker Pattern

Must Know

Problem

If Service B is slow, calls from A to B pile up, threads exhaust, A also fails cascading failure across the system.

CLOSED (normal) OPEN (failing) HALF-OPEN (testing) All requests flow fail fast (no requests) allow 1-2 test requests failure rate timeout expires fail OPEN succeed CLOSED exceeds threshold OPEN

Configuration

Failure threshold: 50% failure rate in 10s window OPEN
Open duration: 30s (fail fast, don't retry hammering broken service)
Success threshold to close: 3 consecutive successes in HALF-OPEN
Fallback: return cached response, default value, or degrade gracefully
Libraries: Resilience4j (Java), Hystrix (Netflix, deprecated), Polly (.NET)

Retry, Timeout, Bulkhead, Rate Limiting

Pattern	What It Does	Key Config
Retry + Backoff	Re-attempt failed requests with exponential backoff + jitter	max_retries=3, delay=2^n * 100ms + random(100ms)
Timeout	Fail fast if response not received in time limit	connect_timeout=3s, read_timeout=10s
Bulkhead	Isolate thread/connection pools per downstream service	maxConcurrentCalls=10 per downstream
Rate Limiting	Limit incoming requests per client/IP/API key	100 req/min per API key
Fallback	Return degraded response instead of error	Return cached data, empty list, default value

Exponential Backoff + Jitter Formula

delay = min(cap, base * 2^attempt) + random(0, jitter_max). Without jitter all retriers retry simultaneously creates another thundering herd on the recovering service.

Rate Limiting Algorithms

Algorithm	How	Burst	Memory	Notes
Token Bucket	Bucket refills at rate R, each request uses 1 token	Yes (burst up to bucket size)	O(1)	Most common; allows burst
Leaky Bucket	Queue requests, process at fixed rate	No (smooth output)	O(queue)	Smooth traffic shaping
Fixed Window	Count per fixed minute/hour window	Yes (edge burst issue)	O(1)	Simple; 2x burst possible at boundary
Sliding Window Log	Store timestamp per request, count within window	Accurate	O(requests)	Accurate but memory-intensive
Sliding Window Counter	Blend current + previous window proportionally	Approximate	O(1)	Good balance; Redis-friendly

Redis Rate Limiting (Sliding Window Counter)

Key: rate:{user_id}:{current_minute}
INCR + EXPIRE (atomic with Lua or pipeline)
Blend: count = current_bucket + previous_bucket * (1 - elapsed/window)

Distributed Systems Concepts

Consistent Hashing

Hash Ring (0 2^32 - 1): 0 (=2^32) S1 (@ 100) 350 200 S4 [ring] S2 300 S3 (@ 300) Key lookup: hash(key) % 2^32 walk clockwise to nearest server Adding S5 between S1-S2: only keys in [hash(S1), hash(S5)] remap Removing S2: only S2's keys move to S3 Virtual Nodes: S1 maps to 150 positions on ring (S1-vn1, S1-vn2 ...) Much more uniform distribution even with heterogeneous hardware

Used In

Cassandra, DynamoDB: partition data across nodes
Redis Cluster: slot assignment (actually uses fixed 16384 slots)
CDNs: route requests to nearest/least-loaded edge node
Load balancers: session affinity without lookup tables

Distributed Locking

Approach	How	Trade-offs
Redis SET NX EX	SET lock_key owner NX EX 30 atomic set-if-not-exists with TTL	Fast, simple; Redlock algorithm for multi-node safety
Zookeeper	Create ephemeral sequential znodes; lowest sequence = lock holder	Strongly consistent; handles crashes via ephemeral nodes
DB Pessimistic Lock	SELECT ... FOR UPDATE on DB row	Simple; slow; doesn't work cross-service
Optimistic Locking	version column; UPDATE WHERE version=expected; retry on conflict	No blocking; great for low contention; retry logic needed

Redis Lock Safety Rule

Only release the lock if YOU own it: check owner == your_id before DEL. Use Lua script for atomic check-and-delete. Auto-expiry prevents deadlock if holder crashes.

Load Balancing Algorithms

Algorithm	How	Best For
Round Robin	Rotate through servers in order	Uniform servers, stateless requests
Weighted Round Robin	Proportional to server capacity	Heterogeneous hardware
Least Connections	Route to server with fewest active connections	Long-lived connections, variable request duration
IP Hash	hash(client_ip) % N same client same server	Session stickiness (stateful apps)
Least Response Time	Lowest latency + fewest connections combined	Latency-sensitive APIs

L4 vs L7 Load Balancing

L4 (Transport): routes by IP + TCP/UDP port. Very fast, no HTTP inspection. AWS NLB.
L7 (Application): routes by HTTP path, headers, cookies. Smarter. /api API servers; /static CDN. AWS ALB, NGINX.

Bloom Filters

Probabilistic data structure: "definitely NOT in set" OR "probably in set"

Space-efficient: represents 1M items in ~1MB vs hash set in 100MB
False positives possible, false negatives never
Cannot delete (use Counting Bloom Filter for deletion)

Use Case	How Bloom Filter Helps
URL Shortener	Check if short code exists before DB query
Cache Penetration	Block queries for non-existent keys before they hit DB
Email deduplication	Check if email already sent before DB lookup
Chrome Safe Browsing	Quick check if URL is malicious
Cassandra read path	Check if key exists in SSTable before disk read

Networking & Communication

REST vs GraphQL vs gRPC

API Design

Factor	REST	GraphQL	gRPC
Protocol	HTTP/1.1 + JSON	HTTP/1.1 + JSON	HTTP/2 + Protobuf
Schema	OpenAPI (optional)	Strongly typed SDL	Strongly typed .proto
Over/Under-fetch	Common problem	Client specifies exact fields needed	N/A (binary, compact)
Performance	Good	Good (1 round trip for complex data)	Excellent (binary, HTTP/2 multiplexing)
Streaming	SSE / WebSockets	Subscriptions (WebSocket)	Bidirectional streaming native
Use case	Public APIs, simple CRUD	Mobile BFF, complex nested queries	Internal microservices, high-throughput
Caching	Easy (HTTP cache headers)	Complex (POST-based, no HTTP cache)	Custom only
Tooling	Excellent (Postman, curl)	Good (GraphiQL, Apollo)	Good (grpcurl, Buf)

WebSockets vs Long Polling vs SSE

Approach	How	Latency	Direction	Use Case
Short Polling	Client requests every N seconds	Up to N seconds	Pull only	Simple updates where delay is OK
Long Polling	Server holds request open until data available	Low	Pull only	Simple push before WebSockets era
SSE	Server pushes over persistent HTTP (text/event-stream)	Very low	Server Client only	Dashboards, live feeds, notifications
WebSockets	Full-duplex TCP after HTTP upgrade handshake	Very low	Bidirectional	Chat, gaming, collaborative editors

WebSocket at Scale Challenge

WebSocket servers are stateful (connection on specific server). Solution: Sticky sessions at LB + Redis Pub/Sub to broadcast messages across servers. Any server can publish to Redis; subscriber servers push to their connected users.

CDN & DNS

CDN Flow: User (India) DNS CDN Edge Node (Singapore nearest) Cache Hit return immediately Cache Miss fetch from Origin (US) cache return DNS Resolution: Browser OS Cache DNS Resolver (ISP/8.8.8.8) Root NS (".") TLD NS (".com") Authoritative NS Returns IP Browser connects (TTL cached)

CDN Cache Invalidation

Versioned URLs (best practice): /app.v3.2.1.js new URL = new cache entry
TTL-based: Cache-Control: max-age=86400
Purge API: Explicitly invalidate specific paths on deploy (Cloudflare, Fastly)

Reverse Proxy vs Forward Proxy

Forward Proxy: Client-controlled. Client Proxy Internet. VPN, content filtering.
Reverse Proxy: Server-controlled. Internet Proxy Servers. Load balancing, SSL termination, hiding server topology.

Authentication & Security

JWT vs Session Tokens

JWT (Stateless): Header.Payload.Signature Header: {"alg":"HS256","typ":"JWT"} Payload: {"sub":"user123","roles":["admin"],"exp":1234567890} Signature: HMAC-SHA256(base64(header)+"."+base64(payload), secret) Server validates signature ONLY no DB lookup Cannot revoke before expiry (use short TTL: 15min) Refresh token (long-lived, DB-backed) to get new access token Session (Stateful): Client sends: Cookie: session_id=abc123 Server: SELECT * FROM sessions WHERE id='abc123' (Redis for speed) Can revoke instantly (delete from sessions table)

	JWT	Session Token
Scalability	Excellent no shared state	Needs shared Redis/DB across all servers
Revocation	Hard need token blacklist in Redis	Easy delete session record
Token Size	Large (~500 bytes)	Small (random 32-byte string)
Best For	Microservices, stateless APIs, mobile	Traditional web apps, high-security (banking)

OAuth 2.0 Authorization Code Flow

1. User clicks "Login with Google" 2. App redirects: GET /oauth/authorize ?client_id=APP_ID &redirect_uri=https://app.com/callback &scope=email+profile &response_type=code &state=random_csrf_token <-- CSRF protection 3. User authenticates & grants permission at Google 4. Google redirects: GET https://app.com/callback ?code=AUTH_CODE &state=random_csrf_token <-- verify state matches 5. App backend calls: POST /oauth/token {code: AUTH_CODE, client_secret: SECRET, grant_type: authorization_code} 6. Google returns: {access_token, refresh_token, expires_in} 7. App calls: GET /userinfo with Bearer access_token 8. On expiry: POST /oauth/token {grant_type: refresh_token, refresh_token: RT}

Key Security Points

state param = CSRF protection PKCE replaces client_secret for mobile/SPA access_token = short-lived (15min) refresh_token = long-lived (30 days, stored securely server-side) Never expose client_secret in frontend

System Design Interview Questions

Interview Framework

1Clarify Requirements (functional + non-functional, scale) 2Estimate Scale (RPS, storage, bandwidth) 3High-Level Design (major components) 4API Design (endpoints, data models) 5Deep Dive (critical components) 6Trade-offs (what you chose and why)

Q1

Design a URL Shortener (TinyURL)

Requirements: Shorten URLs, redirect, analytics, expiry. Write: 100M URLs/day. Read: 1B redirects/day (10x writes).

Estimates: 1B reads/day = 12K RPS. 100M writes/day = 1.2K RPS. Storage: 100 bytes * 100M * 365 days * 5 years = ~18TB.

Short Code: Base62 (a-z, A-Z, 0-9). 7 chars = 62^7 = 3.5 trillion unique codes. Use DB auto-increment ID convert to Base62.

Write: POST /shorten {long_url, expiry} ShortenerService INSERT into MySQL (id, short_code, long_url, expiry) Return https://short.ly/abc1234 Read: GET /abc1234 Redis cache lookup (short_code long_url) HIT: 302 Redirect long_url MISS: MySQL lookup populate Redis (TTL 24h) 302 Redirect Analytics (async): On each redirect Kafka event {short_code, timestamp, ip, user_agent} Analytics Consumer ClickHouse / DynamoDB for aggregates

Design Decisions

Auto-increment + Base62: no collision, simple
Redis cache: 99%+ redirects served from cache
302 redirect (not 301): server sees every request accurate analytics
Async analytics: redirect latency unaffected

Trade-offs

Auto-increment is guessable (sequential codes)
Hash-based codes: possible collision need retry logic
Pre-generated codes pool: batch-generate offline
Bloom filter: check code non-existence before DB query

Q2

Design a Chat System (WhatsApp / Slack)

Requirements: 1:1 and group chat, online presence, message history, delivery/read receipts, media sharing.

Real-time Transport: WebSocket (full-duplex, bidirectional). Long Polling as fallback for restrictive networks.

Message Storage: Cassandra. RowKey = channel_id, clustering by timestamp. Write-heavy, append-only, time-ordered = perfect fit.

Message Flow (Alice Bob): 1. Alice sends msg via WebSocket Chat Server A 2. Chat Server A: a) Persist to Cassandra async b) PUBLISH to Redis channel "user:bob:msgs" 3. Chat Server B (Bob's connection) subscribes receives from Redis 4. Chat Server B delivers to Bob via WebSocket 5. Bob's client sends "delivered" ack Chat Server B Cassandra 6. Alice receives delivery receipt via WebSocket Group Chat (fan-out): message Kafka Fan-out consumer deliver to each member's inbox For large groups (10K+): pull model members fetch on open Presence Service: WebSocket connect SET user:online:{userId} 1 EX 60 (heartbeat every 30s) Heartbeat miss key expires user offline

Key Decisions

Redis Pub/Sub connects stateful WebSocket servers
Cassandra: ideal for high-write time-series message history
Offline inbox: store in DB, deliver on reconnect
Media: upload to S3, share URL in message

Challenges

Large groups (100K): fan-out on write too expensive use pull
Message ordering: Cassandra stores by timestamp; tie-break with UUID
E2E encryption: client encrypts, server stores ciphertext only
Message dedup: idempotency key per message to prevent duplicates on retry

Q3

Design Twitter / News Feed

Core Challenge: Fan-out on Write vs Fan-out on Read

Push (Fan-out on Write): When user tweets, push to all followers' Redis feeds. Fast reads. Bad for celebrities (50M followers).
Pull (Fan-out on Read): Merge tweets from followed users at read time. Slow reads. DB expensive.
Hybrid (Twitter's approach): Push for regular users, Pull for celebrities at read time. Best of both.

Tweet Write: POST /tweet TweetService Cassandra (tweets by userId + timestamp) Kafka: "tweet-created" event FanoutWorker (async): for each follower (if follower_count < 10K): ZADD feed:{followerId} tweet.timestamp tweet.id (celebrities skip pulled at read time) Feed Read: GET /feed/{userId} Feed Service: 1. ZRANGE feed:{userId} 0 20 get regular user tweet IDs 2. For each celebrity followed: fetch their latest 20 tweets directly 3. Merge all sort by timestamp return top 20 Like/RT: Redis: INCR likes:{tweetId} Async flush to Cassandra every 5min

Architecture Decisions

Cassandra for tweets: write-heavy, time-series pattern
Redis Sorted Set for feed: O(log N) insert, O(1) range read
Async fan-out via Kafka: tweet API returns instantly
S3 + CDN for media content

Trade-offs

Push model: more storage (each tweet copied to N followers' feeds)
Pull model: slow read (query N followed users, merge, sort)
Celebrity tweets: fan-out to 100M followers takes minutes
Trending: sliding window count via Redis sorted sets

Q4

Design YouTube / Video Streaming

Upload: Chunked upload to S3 Kafka event Transcoder cluster (FFmpeg) multiple resolutions (360p/720p/1080p/4K) output to CDN

Streaming: HLS (HTTP Live Streaming). Video split into ~10s segments (.ts files). Manifest file (.m3u8) lists URLs. Client adapts resolution based on bandwidth (ABR).

Metadata: MySQL for structured data (title, description, tags, owner). Redis for view counts. Elasticsearch for search.

Upload Flow: Client Chunked multipart upload (resumable) S3 (raw) S3 event Kafka: "video-uploaded" Transcoder workers (GPU instances): Input: raw.mp4 FFmpeg [360p, 720p, 1080p, 4K] + thumbnails Upload outputs CDN origin CDN distributes globally Metadata service: UPDATE videos SET status='published' WHERE id=... Playback: GET /video/{id}/manifest return .m3u8 (HLS manifest file) Client parses fetches segments from CDN edge Client detects bandwidth drop switches to lower quality segment URL

Key Decisions

CDN essential: video = 80%+ of internet traffic
HLS/DASH: industry standard, all browsers support
Async transcoding: instant upload response, background processing
Separate read/write paths (CQRS-like for metadata)

Challenges

Storage: 1 video 5 qualities + thumbnail + captions = 10x raw size
Transcoding cost: GPU instances expensive; queue management critical
View count: Redis INCR async flush to DB (approximate but fast)
Copyright: Content-ID fingerprinting pipeline runs in background

Q5

Design a Distributed Cache (like Redis Cluster)

Data Distribution: Consistent hashing. hash(key) maps key to server. Virtual nodes for uniform distribution.

Replication: 1 primary + 2 replicas per shard. Reads go to replicas (eventual consistency). Writes always to primary.

Eviction: LRU per shard. Track access time. When memory full, evict least-recently-used entries first.

Persistence: RDB (point-in-time snapshot, compact) + AOF (append-only log, durable). Use both for full safety.

Design Decisions

In-memory: microsecond latency vs millisecond for disk
Single-threaded event loop: no lock contention
Pipelining: batch multiple commands in one round-trip
Lua scripting: atomic multi-command operations

Trade-offs

Memory limit: working set must fit in RAM (expensive)
Hot key: shard replicas, local L1 cache in application
Replication lag: replica may briefly serve stale reads
Split-brain: use Redis Sentinel or Cluster for HA failover

Q6

Design Search Autocomplete (Typeahead)

Requirements: Top-5 suggestions per prefix, under 100ms latency. 10M users, 10 queries each = 100M queries/day = ~1,200 RPS.

Data Structure: Trie with frequency at each node. Or Elasticsearch prefix query with score boosting.

Ranking: Query frequency + recency + personalization weight.

Query (read, latency-critical): GET /autocomplete?q=data+cat CDN (cache popular prefixes like "data", "date", "da") Redis: ZREVRANGE prefix:data_cat 0 4 [top5 sorted by score] Trie Service (in-memory, sharded by first char) Elasticsearch fallback Update (async): User search logs Kafka Aggregation Service (5-min windows) Update term frequency Rebuild top-K per prefix Push to Redis ZSet (ZADD prefix:data_cat freq "data catalog")

Optimizations

Trie sharded by first 2 chars (26^2 = 676 shards)
Redis ZSet caches top-5 per prefix (1-hour TTL)
CDN caches most common prefixes globally
Client debounce: only query after 300ms idle

Challenges

Trie memory: 5M terms avg 8 chars = large in-memory footprint
Real-time updates to trie are expensive batch updates every 5 min
Personalization: blend global score + user history with weights
Multi-language: Unicode = more complex trie / different sharding

Q7

Design a Notification System

Event Sources: User Action / Business Alert / Marketing Campaign Notification Service API (validates payload, applies user preferences, deduplication check) async Kafka Topics (separate topics per channel): -- notifications-email Email Worker SendGrid / SES (retry + DLQ) -- notifications-sms SMS Worker Twilio (retry + DLQ) -- notifications-push Push Worker FCM / APNs (retry + DLQ) -- notifications-inapp InApp Worker WebSocket / DB Notification DB: -- Delivery status (sent, delivered, failed, read) -- User preferences (email opt-in, push enabled, quiet hours 10pm-8am)

Key Decisions

Idempotency key: prevents duplicate sends across retries
User preferences service: opt-outs, quiet hours, frequency caps
Priority lanes: critical alerts bypass normal queue
Rate limit: max 3 emails/hour per user to prevent spam

Challenges

Deduplication: same event published twice idempotency key check before send
Marketing blast to 10M: partition workers, throttle to ~100K/min
Delivery tracking: webhooks from SendGrid/Twilio update status
DLQ monitoring: alert on-call when DLQ grows beyond threshold

Q8

Design Uber / Ride-Hailing

Location Tracking: Drivers send GPS every 5s. GEOADD in Redis (O(log N) insert). GEORADIUS for nearest-driver query.

Geohashing: Encode lat/lng as a string. Nearby locations share prefix. Efficient area queries without lat/lng math.

Matching: Find N nearest available drivers send offer simultaneously first accept wins create Trip (MySQL, ACID).

Driver Location: Driver app WebSocket Location Service GEOADD drivers {lng, lat, driver_id} Ride Request: User RideRequest Service: 1. GEORADIUS(user_lat, user_lng, radius=5km) [driverA, driverB, driverC] 2. Filter: available + rating > 4.5 + correct vehicle type 3. Offer sent to top 3 simultaneously (first accept wins) 4. Accepted: INSERT INTO trips (atomic, MySQL) lock driver state Real-time Trip Tracking: Driver GPS every 3s Kafka TripTrackingService WebSocket User

Key Decisions

Redis Geo: O(log N) radius search, in-memory speed
MySQL for trips: ACID needed for payment integrity
Cassandra for location history: write-heavy time-series
Surge pricing: pre-computed by geohash zone demand

Challenges

Simultaneous accept: optimistic lock on driver status (available in-trip)
GPS accuracy: Kalman filter to smooth noisy GPS readings
Driver state machine: available offered in-ride available
ETA: integrate OSRM / Google Maps API for routing and ETA

Q9

Design a Data Catalog (like Alation)

Core Features: Metadata ingestion from databases/BI tools, search, lineage tracking, data quality scoring, collaboration (stewardship, annotations).

Metadata Storage: Graph DB (Neo4j) for lineage (A B C relationships). Elasticsearch for full-text search. PostgreSQL for structured catalog data.

Connectors: Pull metadata via JDBC/REST from source systems (Snowflake, BigQuery, Tableau). Schedule crawls. Detect schema drift.

Metadata Ingestion: Connector Framework Source System (Snowflake/BigQuery/Tableau) Extract: tables, columns, types, stats, query history Transform: normalize to internal metadata model Load: PostgreSQL (structured catalog: tables, columns, owners) Elasticsearch (search index: name, description, tags) Neo4j (lineage graph: table derived from upstream table) Event Store (schema change history for audit) Search: User query Elasticsearch (full-text + faceted filtering) Boost: verified assets, frequently accessed, recently updated Lineage: MATCH (a)-[:DERIVED_FROM*]->(b) WHERE a.name='revenue_report' Show upstream impact analysis for data quality issues

Key Decisions

Graph DB: lineage relationships are inherently graph-structured
Elasticsearch: fuzzy search, facets, ranking by relevance
Event Sourcing: full audit trail of metadata changes
CQRS: write-optimized ingest path, read-optimized search path

Challenges

Schema drift: source changes must detect and re-index affected assets
Scale: 100K+ tables, millions of columns = large search index
Trust/quality scoring: aggregate from multiple signals (freshness, usage, docs)
Real-time lineage: parse SQL query logs to extract column-level lineage

System Design Complete Guide

CAP Theorem

Pick any 2: Consistency Availability Partition Tolerance

ACID vs BASE

Scalability Vertical vs Horizontal + Strategies

Vertical Scaling (Scale Up)

Horizontal Scaling (Scale Out)

Key Scaling Strategies

PACELC Theorem (CAP Extension)

If (P)artition: choose (A) or (C). Else (E): choose (L)atency or (C)onsistency

Database Indexing B-Tree, Hash, Composite

Index Trade-offs

SQL vs NoSQL When to Use What

NoSQL Types & Use Cases

Choose SQL When

Database Replication Strategies

Sync vs Async Replication

Database Sharding Strategies & Pitfalls

Caching Strategies All Patterns

Eviction Policies

Redis Data Structures for System Design

Cache Problems & Solutions

Monolith vs Microservices

Monolith Pros

Monolith Cons

Microservices Pros

Microservices Cons

API Gateway Pattern

BFF Pattern (Backend for Frontend)

CQRS Command Query Responsibility Segregation

Pros

Cons

Event Sourcing

Core Idea: Store sequence of events, not current state

Saga Pattern Distributed Transactions

Two Saga Types

Outbox Pattern Reliable Event Publishing

How It Works

Circuit Breaker Pattern

Configuration

Retry, Timeout, Bulkhead, Rate Limiting

Rate Limiting Algorithms

Redis Rate Limiting (Sliding Window Counter)

Consistent Hashing

Used In

Distributed Locking

Load Balancing Algorithms

L4 vs L7 Load Balancing

Bloom Filters

Probabilistic data structure: "definitely NOT in set" OR "probably in set"

REST vs GraphQL vs gRPC

WebSockets vs Long Polling vs SSE

CDN & DNS

CDN Cache Invalidation

Reverse Proxy vs Forward Proxy

JWT vs Session Tokens

OAuth 2.0 Authorization Code Flow

Design a URL Shortener (TinyURL)

Design Decisions

Trade-offs

Design a Chat System (WhatsApp / Slack)

Key Decisions

Challenges

Design Twitter / News Feed

Core Challenge: Fan-out on Write vs Fan-out on Read

Architecture Decisions

Trade-offs

Design YouTube / Video Streaming

Key Decisions

Challenges

Design a Distributed Cache (like Redis Cluster)

Design Decisions

Trade-offs

Design Search Autocomplete (Typeahead)

Optimizations

Challenges

Design a Notification System

Key Decisions

Challenges