System Design Complete Guide

System Design

Fundamentals Architecture Patterns Real Interview Questions with full answers and trade-offs.

Core Fundamentals

CAP Theorem

Pick any 2: Consistency Availability Partition Tolerance

  • Consistency Every read returns the most recent write
  • Availability Every request gets a non-error response
  • Partition Tolerance System works despite network partitions
SystemType
PostgreSQL, MySQLCA (single node)
Cassandra, DynamoDBAP (eventual)
HBase, Zookeeper, etcdCP (may reject)
MongoDB (default)CP (tunable)
CouchbaseAP or CP (configurable)
Reality
P is always required in distributed systems. The real choice is C vs A during a partition.

ACID vs BASE

ACID (SQL)BASE (NoSQL)
Atomicity all or nothingBasically Available
Consistency valid state alwaysSoft State
Isolation no interferenceEventually Consistent
Durability committed = safePrioritizes availability
Use ACID: banking, bookings, inventory. Use BASE: social feeds, analytics, caching, data catalog metadata.

Scalability Vertical vs Horizontal + Strategies

Vertical Scaling (Scale Up)
  • Add CPU/RAM to existing server
  • Simple no application changes
  • Single point of failure
  • Hard hardware limit, expensive
Horizontal Scaling (Scale Out)
  • Add more commodity machines
  • Theoretically unlimited
  • Requires stateless services + LB
  • Complex: distributed state, consistency

Key Scaling Strategies

  • Stateless Services: Store state in Redis/DB. Any instance handles any request.
  • Read Replicas: Primary for writes, replicas for reads (80% of traffic is reads)
  • Sharding: Partition data across nodes by shard key
  • CDN: Cache static assets globally, reduce origin server load drastically
  • Async Processing: Queue slow work (emails, reports) instant API response
  • Denormalization: Pre-compute joins faster reads at cost of storage + write complexity

PACELC Theorem (CAP Extension)

If (P)artition: choose (A) or (C). Else (E): choose (L)atency or (C)onsistency

  • CAP only models partition scenarios. PACELC says: even normally, there is a latency vs consistency tradeoff.
  • Replication = lower latency OR stronger consistency never both.
SystemPartition choiceElse choice
DynamoDBAvailabilityLatency (low)
CassandraAvailabilityLatency (low)
Google SpannerConsistencyConsistency (strong)
MySQLConsistencyConsistency
Databases

Database Indexing B-Tree, Hash, Composite

Must Know
TypeStructureBest ForNot For
B-TreeBalanced treeRange queries, ORDER BY, <, >, BETWEENHigh-cardinality exact lookups (hash faster)
Hash IndexHash tableExact equality (=), O(1)Range queries, sorting
CompositeB-Tree on N colsMulti-column queries (leftmost prefix rule)Queries not starting with leftmost column
Covering IndexIndex has all queried colsSatisfy query from index alone (no heap fetch)Wide tables with many columns
Full-TextInverted indexLIKE '%word%', text searchExact match (regular index faster)
Partial IndexIndex with WHERE clauseWHERE status='active' index only active rowsWhen all rows need indexing

Index Trade-offs

  • Reads faster: SELECT queries benefit dramatically (O(log n) vs O(n))
  • Writes slower: every INSERT/UPDATE/DELETE updates all indexes
  • Storage cost: indexes use ~20-30% extra disk space each
  • Cardinality: index on gender (2 values) = nearly useless; index on user_id = excellent
Leftmost Prefix Rule
Composite index (a, b, c) helps queries on: a (a,b) (a,b,c). Does NOT help: (b) (c) (b,c) alone.

SQL vs NoSQL When to Use What

FactorSQL (Relational)NoSQL
SchemaFixed, enforced (DDL)Flexible, schema-on-read
RelationshipsForeign keys, JOINsDenormalized, embedded
Horizontal ScalingDifficult (sharding complex)Native (Cassandra, Mongo)
TransactionsFull ACIDLimited / eventual
Query LanguageSQL (very expressive)API-specific, limited joins

NoSQL Types & Use Cases

  • Key-Value (Redis, DynamoDB): sessions, cache, feature flags
  • Document (MongoDB): product catalogs, CMS, user profiles
  • Wide-Column (Cassandra, HBase): IoT, time-series, write-heavy analytics
  • Graph (Neo4j): social network, knowledge graph, data lineage
  • Search (Elasticsearch): full-text search, log analytics
  • Time-Series (InfluxDB, TimescaleDB): metrics, monitoring

Choose SQL When

  • Complex queries with multiple JOINs
  • ACID transactions are critical (money, inventory)
  • Schema is stable and well-defined
  • Rich aggregation / reporting needs
  • E.g.: banking, ERP, booking, HR systems
  • Choose NoSQL when schema evolves rapidly or write throughput is extreme

Database Replication Strategies

Single Leader (Primary-Replica): Leader (writes) async/sync replication Replica 1, 2, 3 Reads can go to any replica (scale reads 3x) Multi-Leader: Leader A (DC West) replication Leader B (DC East) Both accept writes; conflict resolution required Leaderless (Dynamo-style): Client writes to N nodes; quorum W > N/2 Client reads from R nodes; R + W > N = consistency
StrategyProsCons
Single LeaderSimple, no conflicts, ordered writesLeader bottleneck, failover needed
Multi-LeaderWrite availability in multiple DCsConflict resolution required
LeaderlessHigh availability, no SPOFComplex quorums, eventual consistency

Sync vs Async Replication

  • Synchronous: write confirmed only when all replicas ack safe but slow
  • Asynchronous: write confirmed on leader write fast but potential data loss on leader crash
  • Semi-sync: at least 1 replica acks balance of safety and speed (MySQL default)

Database Sharding Strategies & Pitfalls

StrategyHowProsCons
Range ShardingA-M shard1, N-Z shard2Simple range queriesHotspots (all celebrities on one shard)
Hash Shardinghash(key) % NEven distributionRebalancing when N changes remaps many keys
Consistent HashingHash ring + virtual nodesMinimal remapping on add/removeMore complex, virtual nodes needed for uniformity
Directory ShardingLookup table: key shardFlexible, can move dataLookup service = SPOF, extra hop
Geo ShardingUS shard1, EU shard2Data locality, compliance (GDPR)Cross-region queries expensive
Sharding Pitfalls
Cross-shard JOINs = scatter-gather (expensive) Distributed transactions across shards need 2PC Rebalancing requires data migration Choose shard key carefully hard to change later Hot shard = one shard receives disproportionate traffic
Caching

Caching Strategies All Patterns

Critical
StrategyFlowUse WhenDownside
Cache-Aside (Lazy)Check cache miss load DB store in cacheRead-heavy, tolerate some stale dataCache miss penalty, thundering herd on cold start
Read-ThroughCache handles DB fetch transparently on missSimplified app code, uniform read pathCache must understand data model
Write-ThroughWrite to cache AND DB synchronouslyStrong consistency neededWrite latency doubled
Write-Behind (Write-Back)Write to cache; async flush to DBWrite-heavy, latency-criticalData loss if cache crashes before flush
Write-AroundWrite directly to DB, skip cacheWrite-once, read-rarely data (logs)Cache miss on first read after write
Refresh-AheadProactively refresh cache before TTL expiresPredictable, hot data access patternsWasted refresh if data not re-accessed

Eviction Policies

  • LRU (Least Recently Used): evict least recently accessed. Best for most workloads.
  • LFU (Least Frequently Used): evict least accessed overall. Good when some items are permanently hot.
  • TTL (Time-To-Live): expire after fixed duration. Prevents stale data.
  • FIFO: simple but poor hit rate. Rarely used in production.

Redis Data Structures for System Design

  • String: counter (INCR), session, feature flag, simple KV
  • Hash: user profile fields (HSET/HGET), object properties
  • List: activity feed, queue (LPUSH/RPOP), recent N items
  • Set: unique visitors, followers, tags (SADD/SINTER for mutual friends)
  • Sorted Set (ZSet): leaderboard (ZADD/ZRANGE), rate limiting, priority queue
  • Pub/Sub: real-time notifications, chat cross-server broadcast
  • Streams: durable event log (lightweight Kafka replacement)
  • Geo: GEOADD/GEORADIUS for nearest driver, nearby stores

Cache Problems & Solutions

ProblemCauseSolution
Cache Stampede (Thundering Herd)Many concurrent misses hit DB simultaneously after cache expiryMutex/distributed lock, probabilistic early expiry, background refresh, never-expire + async update
Cache PenetrationRepeated queries for non-existent keys bypass cache, hammer DBCache null values with short TTL; Bloom Filter to pre-check existence
Cache AvalancheMass cache expiry at same time DB overwhelmedRandomize TTL (base_ttl + random(300s)); circuit breaker; multi-layer cache
Hot Key ProblemSingle key (celebrity post) gets extreme traffic on one shardReplicate hot key to multiple shards; local in-process L1 cache; rate limit
Microservices & Architecture Patterns

Monolith vs Microservices

Monolith Pros
  • Simple to develop, test, deploy
  • No distributed system complexity
  • No network latency between components
  • ACID transactions are trivial
  • Single deployable artifact
Monolith Cons
  • Must scale entire app even if only one part needs it
  • One tech stack for everything
  • Large codebase = slow builds, hard cognitive load
  • Deploy all-or-nothing risky releases
Microservices Pros
  • Independent deploy, scale, and fault isolation
  • Technology flexibility per service
  • Small teams own individual services (Conway's Law)
  • One service crash doesn't bring down the whole system
Microservices Cons
  • Distributed systems complexity (latency, partial failures)
  • Cross-service transactions require Saga / 2PC
  • More infrastructure: service discovery, API gateway, tracing
  • Debugging is harder; need distributed tracing (Jaeger, Zipkin)
When to Choose Microservices
When different parts have drastically different scaling needs, different teams need independent release cycles, or different components need different technology stacks. NOT for early-stage premature optimization kills velocity.

API Gateway Pattern

Client (Mobile / Web / Third-party) API Gateway - Auth (JWT validation, one place) - Rate Limiting (per client / IP) - SSL Termination - Request Routing / Load Balancing - Request / Response Transformation - Logging, Distributed Tracing - Circuit Breaking User Service Order Service Catalog Service
Tools
Kong, AWS API Gateway, NGINX, Traefik, Envoy, Spring Cloud Gateway

BFF Pattern (Backend for Frontend)

  • Separate API gateway per client type (mobile BFF, web BFF)
  • Each BFF aggregates exactly the data that client needs
  • Avoids over-fetching on mobile (bandwidth matters) vs web

CQRS Command Query Responsibility Segregation

Architecture Pattern
Commands (writes) Queries (reads) Command Handler Query Handler Write Model Read Model (normalized SQL) (denormalized view, Elasticsearch, Redis) Events published Kafka Consumers update read model async Example: Orders POST /orders Write to MySQL publish OrderPlaced Kafka Kafka consumer build summary view in Redis / Elasticsearch GET /orders/summary reads from fast read store
Pros
  • Optimize reads and writes independently
  • Scale read and write services separately
  • Read model tailored per use case
  • Natural fit with Event Sourcing
Cons
  • Eventual consistency between read/write models
  • Significantly more complex architecture
  • Duplicate data, sync must be maintained
  • Overkill for simple CRUD apps

Event Sourcing

Architecture Pattern

Core Idea: Store sequence of events, not current state

  • Every state change = immutable event appended to event log
  • Current state = replay all events from the beginning
  • Like Git for your data full history, time travel, audit trail
  • Snapshots: periodically save state to avoid full replay from the beginning
Traditional: Event Sourcing: Account { balance: $150 } Event Log (append-only): 1. AccountOpened {balance: $0} 2. Deposited {amount: $200} 3. Withdrawn {amount: $50} Current state: replay all $0 + $200 - $50 = $150
Relevant to Alation
Data lineage, audit trail, metadata versioning these are core Alation use cases where event sourcing shines. Know this well.

Saga Pattern Distributed Transactions

Distributed
Problem
How do you maintain consistency across multiple microservices without a distributed ACID transaction (2PC)?

Two Saga Types

  • Choreography: Services react to domain events. No central coordinator. Decoupled but hard to visualize flow.
  • Orchestration: Central Saga Orchestrator sends commands to each service and handles failures. Easier to trace, single place for flow logic.
Choreography (Event-driven): OrderPlaced InventoryReserved PaymentProcessed Shipped (failure) PaymentFailed InventoryReleased OrderCancelled Orchestration: SagaOrchestrator reserve-inventory process-payment ship done (on any failure) compensate backwards

Outbox Pattern Reliable Event Publishing

Problem
If you write to DB and then publish to Kafka, what if Kafka publish fails? You have inconsistent state.
Solution: Transactional Outbox DB Transaction (atomic): 1. INSERT INTO orders ... 2. INSERT INTO outbox_events ... Outbox Relay (Debezium CDC or polling): Kafka downstream consumers

How It Works

  • Write business record AND outbox event in same DB transaction (atomic)
  • Separate relay process reads outbox table and publishes to Kafka
  • Relay uses CDC (Debezium reads WAL) or polling loop
  • Guarantees at-least-once delivery; consumer must be idempotent
Resilience Patterns

Circuit Breaker Pattern

Must Know
Problem
If Service B is slow, calls from A to B pile up, threads exhaust, A also fails cascading failure across the system.
CLOSED (normal) OPEN (failing) HALF-OPEN (testing) All requests flow fail fast (no requests) allow 1-2 test requests failure rate timeout expires fail OPEN succeed CLOSED exceeds threshold OPEN

Configuration

  • Failure threshold: 50% failure rate in 10s window OPEN
  • Open duration: 30s (fail fast, don't retry hammering broken service)
  • Success threshold to close: 3 consecutive successes in HALF-OPEN
  • Fallback: return cached response, default value, or degrade gracefully
  • Libraries: Resilience4j (Java), Hystrix (Netflix, deprecated), Polly (.NET)

Retry, Timeout, Bulkhead, Rate Limiting

PatternWhat It DoesKey Config
Retry + BackoffRe-attempt failed requests with exponential backoff + jittermax_retries=3, delay=2^n * 100ms + random(100ms)
TimeoutFail fast if response not received in time limitconnect_timeout=3s, read_timeout=10s
BulkheadIsolate thread/connection pools per downstream servicemaxConcurrentCalls=10 per downstream
Rate LimitingLimit incoming requests per client/IP/API key100 req/min per API key
FallbackReturn degraded response instead of errorReturn cached data, empty list, default value
Exponential Backoff + Jitter Formula
delay = min(cap, base * 2^attempt) + random(0, jitter_max). Without jitter all retriers retry simultaneously creates another thundering herd on the recovering service.

Rate Limiting Algorithms

AlgorithmHowBurstMemoryNotes
Token BucketBucket refills at rate R, each request uses 1 tokenYes (burst up to bucket size)O(1)Most common; allows burst
Leaky BucketQueue requests, process at fixed rateNo (smooth output)O(queue)Smooth traffic shaping
Fixed WindowCount per fixed minute/hour windowYes (edge burst issue)O(1)Simple; 2x burst possible at boundary
Sliding Window LogStore timestamp per request, count within windowAccurateO(requests)Accurate but memory-intensive
Sliding Window CounterBlend current + previous window proportionallyApproximateO(1)Good balance; Redis-friendly

Redis Rate Limiting (Sliding Window Counter)

  • Key: rate:{user_id}:{current_minute}
  • INCR + EXPIRE (atomic with Lua or pipeline)
  • Blend: count = current_bucket + previous_bucket * (1 - elapsed/window)
Distributed Systems Concepts

Consistent Hashing

Hash Ring (0 2^32 - 1): 0 (=2^32) S1 (@ 100) 350 200 S4 [ring] S2 300 S3 (@ 300) Key lookup: hash(key) % 2^32 walk clockwise to nearest server Adding S5 between S1-S2: only keys in [hash(S1), hash(S5)] remap Removing S2: only S2's keys move to S3 Virtual Nodes: S1 maps to 150 positions on ring (S1-vn1, S1-vn2 ...) Much more uniform distribution even with heterogeneous hardware

Used In

  • Cassandra, DynamoDB: partition data across nodes
  • Redis Cluster: slot assignment (actually uses fixed 16384 slots)
  • CDNs: route requests to nearest/least-loaded edge node
  • Load balancers: session affinity without lookup tables

Distributed Locking

ApproachHowTrade-offs
Redis SET NX EXSET lock_key owner NX EX 30 atomic set-if-not-exists with TTLFast, simple; Redlock algorithm for multi-node safety
ZookeeperCreate ephemeral sequential znodes; lowest sequence = lock holderStrongly consistent; handles crashes via ephemeral nodes
DB Pessimistic LockSELECT ... FOR UPDATE on DB rowSimple; slow; doesn't work cross-service
Optimistic Lockingversion column; UPDATE WHERE version=expected; retry on conflictNo blocking; great for low contention; retry logic needed
Redis Lock Safety Rule
Only release the lock if YOU own it: check owner == your_id before DEL. Use Lua script for atomic check-and-delete. Auto-expiry prevents deadlock if holder crashes.

Load Balancing Algorithms

AlgorithmHowBest For
Round RobinRotate through servers in orderUniform servers, stateless requests
Weighted Round RobinProportional to server capacityHeterogeneous hardware
Least ConnectionsRoute to server with fewest active connectionsLong-lived connections, variable request duration
IP Hashhash(client_ip) % N same client same serverSession stickiness (stateful apps)
Least Response TimeLowest latency + fewest connections combinedLatency-sensitive APIs

L4 vs L7 Load Balancing

  • L4 (Transport): routes by IP + TCP/UDP port. Very fast, no HTTP inspection. AWS NLB.
  • L7 (Application): routes by HTTP path, headers, cookies. Smarter. /api API servers; /static CDN. AWS ALB, NGINX.

Bloom Filters

Probabilistic data structure: "definitely NOT in set" OR "probably in set"

  • Space-efficient: represents 1M items in ~1MB vs hash set in 100MB
  • False positives possible, false negatives never
  • Cannot delete (use Counting Bloom Filter for deletion)
Use CaseHow Bloom Filter Helps
URL ShortenerCheck if short code exists before DB query
Cache PenetrationBlock queries for non-existent keys before they hit DB
Email deduplicationCheck if email already sent before DB lookup
Chrome Safe BrowsingQuick check if URL is malicious
Cassandra read pathCheck if key exists in SSTable before disk read
Networking & Communication

REST vs GraphQL vs gRPC

API Design
FactorRESTGraphQLgRPC
ProtocolHTTP/1.1 + JSONHTTP/1.1 + JSONHTTP/2 + Protobuf
SchemaOpenAPI (optional)Strongly typed SDLStrongly typed .proto
Over/Under-fetchCommon problemClient specifies exact fields neededN/A (binary, compact)
PerformanceGoodGood (1 round trip for complex data)Excellent (binary, HTTP/2 multiplexing)
StreamingSSE / WebSocketsSubscriptions (WebSocket)Bidirectional streaming native
Use casePublic APIs, simple CRUDMobile BFF, complex nested queriesInternal microservices, high-throughput
CachingEasy (HTTP cache headers)Complex (POST-based, no HTTP cache)Custom only
ToolingExcellent (Postman, curl)Good (GraphiQL, Apollo)Good (grpcurl, Buf)

WebSockets vs Long Polling vs SSE

ApproachHowLatencyDirectionUse Case
Short PollingClient requests every N secondsUp to N secondsPull onlySimple updates where delay is OK
Long PollingServer holds request open until data availableLowPull onlySimple push before WebSockets era
SSEServer pushes over persistent HTTP (text/event-stream)Very lowServer Client onlyDashboards, live feeds, notifications
WebSocketsFull-duplex TCP after HTTP upgrade handshakeVery lowBidirectionalChat, gaming, collaborative editors
WebSocket at Scale Challenge
WebSocket servers are stateful (connection on specific server). Solution: Sticky sessions at LB + Redis Pub/Sub to broadcast messages across servers. Any server can publish to Redis; subscriber servers push to their connected users.

CDN & DNS

CDN Flow: User (India) DNS CDN Edge Node (Singapore nearest) Cache Hit return immediately Cache Miss fetch from Origin (US) cache return DNS Resolution: Browser OS Cache DNS Resolver (ISP/8.8.8.8) Root NS (".") TLD NS (".com") Authoritative NS Returns IP Browser connects (TTL cached)

CDN Cache Invalidation

  • Versioned URLs (best practice): /app.v3.2.1.js new URL = new cache entry
  • TTL-based: Cache-Control: max-age=86400
  • Purge API: Explicitly invalidate specific paths on deploy (Cloudflare, Fastly)

Reverse Proxy vs Forward Proxy

  • Forward Proxy: Client-controlled. Client Proxy Internet. VPN, content filtering.
  • Reverse Proxy: Server-controlled. Internet Proxy Servers. Load balancing, SSL termination, hiding server topology.
Authentication & Security

JWT vs Session Tokens

JWT (Stateless): Header.Payload.Signature Header: {"alg":"HS256","typ":"JWT"} Payload: {"sub":"user123","roles":["admin"],"exp":1234567890} Signature: HMAC-SHA256(base64(header)+"."+base64(payload), secret) Server validates signature ONLY no DB lookup Cannot revoke before expiry (use short TTL: 15min) Refresh token (long-lived, DB-backed) to get new access token Session (Stateful): Client sends: Cookie: session_id=abc123 Server: SELECT * FROM sessions WHERE id='abc123' (Redis for speed) Can revoke instantly (delete from sessions table)
JWTSession Token
ScalabilityExcellent no shared stateNeeds shared Redis/DB across all servers
RevocationHard need token blacklist in RedisEasy delete session record
Token SizeLarge (~500 bytes)Small (random 32-byte string)
Best ForMicroservices, stateless APIs, mobileTraditional web apps, high-security (banking)

OAuth 2.0 Authorization Code Flow

1. User clicks "Login with Google" 2. App redirects: GET /oauth/authorize ?client_id=APP_ID &redirect_uri=https://app.com/callback &scope=email+profile &response_type=code &state=random_csrf_token <-- CSRF protection 3. User authenticates & grants permission at Google 4. Google redirects: GET https://app.com/callback ?code=AUTH_CODE &state=random_csrf_token <-- verify state matches 5. App backend calls: POST /oauth/token {code: AUTH_CODE, client_secret: SECRET, grant_type: authorization_code} 6. Google returns: {access_token, refresh_token, expires_in} 7. App calls: GET /userinfo with Bearer access_token 8. On expiry: POST /oauth/token {grant_type: refresh_token, refresh_token: RT}
Key Security Points
state param = CSRF protection PKCE replaces client_secret for mobile/SPA access_token = short-lived (15min) refresh_token = long-lived (30 days, stored securely server-side) Never expose client_secret in frontend
System Design Interview Questions
Interview Framework
1Clarify Requirements (functional + non-functional, scale) 2Estimate Scale (RPS, storage, bandwidth) 3High-Level Design (major components) 4API Design (endpoints, data models) 5Deep Dive (critical components) 6Trade-offs (what you chose and why)
Q1

Design a URL Shortener (TinyURL)

Requirements: Shorten URLs, redirect, analytics, expiry. Write: 100M URLs/day. Read: 1B redirects/day (10x writes).
Estimates: 1B reads/day = 12K RPS. 100M writes/day = 1.2K RPS. Storage: 100 bytes * 100M * 365 days * 5 years = ~18TB.
Short Code: Base62 (a-z, A-Z, 0-9). 7 chars = 62^7 = 3.5 trillion unique codes. Use DB auto-increment ID convert to Base62.
Write: POST /shorten {long_url, expiry} ShortenerService INSERT into MySQL (id, short_code, long_url, expiry) Return https://short.ly/abc1234 Read: GET /abc1234 Redis cache lookup (short_code long_url) HIT: 302 Redirect long_url MISS: MySQL lookup populate Redis (TTL 24h) 302 Redirect Analytics (async): On each redirect Kafka event {short_code, timestamp, ip, user_agent} Analytics Consumer ClickHouse / DynamoDB for aggregates
Design Decisions
  • Auto-increment + Base62: no collision, simple
  • Redis cache: 99%+ redirects served from cache
  • 302 redirect (not 301): server sees every request accurate analytics
  • Async analytics: redirect latency unaffected
Trade-offs
  • Auto-increment is guessable (sequential codes)
  • Hash-based codes: possible collision need retry logic
  • Pre-generated codes pool: batch-generate offline
  • Bloom filter: check code non-existence before DB query
Q2

Design a Chat System (WhatsApp / Slack)

Requirements: 1:1 and group chat, online presence, message history, delivery/read receipts, media sharing.
Real-time Transport: WebSocket (full-duplex, bidirectional). Long Polling as fallback for restrictive networks.
Message Storage: Cassandra. RowKey = channel_id, clustering by timestamp. Write-heavy, append-only, time-ordered = perfect fit.
Message Flow (Alice Bob): 1. Alice sends msg via WebSocket Chat Server A 2. Chat Server A: a) Persist to Cassandra async b) PUBLISH to Redis channel "user:bob:msgs" 3. Chat Server B (Bob's connection) subscribes receives from Redis 4. Chat Server B delivers to Bob via WebSocket 5. Bob's client sends "delivered" ack Chat Server B Cassandra 6. Alice receives delivery receipt via WebSocket Group Chat (fan-out): message Kafka Fan-out consumer deliver to each member's inbox For large groups (10K+): pull model members fetch on open Presence Service: WebSocket connect SET user:online:{userId} 1 EX 60 (heartbeat every 30s) Heartbeat miss key expires user offline
Key Decisions
  • Redis Pub/Sub connects stateful WebSocket servers
  • Cassandra: ideal for high-write time-series message history
  • Offline inbox: store in DB, deliver on reconnect
  • Media: upload to S3, share URL in message
Challenges
  • Large groups (100K): fan-out on write too expensive use pull
  • Message ordering: Cassandra stores by timestamp; tie-break with UUID
  • E2E encryption: client encrypts, server stores ciphertext only
  • Message dedup: idempotency key per message to prevent duplicates on retry
Q3

Design Twitter / News Feed

Core Challenge: Fan-out on Write vs Fan-out on Read

  • Push (Fan-out on Write): When user tweets, push to all followers' Redis feeds. Fast reads. Bad for celebrities (50M followers).
  • Pull (Fan-out on Read): Merge tweets from followed users at read time. Slow reads. DB expensive.
  • Hybrid (Twitter's approach): Push for regular users, Pull for celebrities at read time. Best of both.
Tweet Write: POST /tweet TweetService Cassandra (tweets by userId + timestamp) Kafka: "tweet-created" event FanoutWorker (async): for each follower (if follower_count < 10K): ZADD feed:{followerId} tweet.timestamp tweet.id (celebrities skip pulled at read time) Feed Read: GET /feed/{userId} Feed Service: 1. ZRANGE feed:{userId} 0 20 get regular user tweet IDs 2. For each celebrity followed: fetch their latest 20 tweets directly 3. Merge all sort by timestamp return top 20 Like/RT: Redis: INCR likes:{tweetId} Async flush to Cassandra every 5min
Architecture Decisions
  • Cassandra for tweets: write-heavy, time-series pattern
  • Redis Sorted Set for feed: O(log N) insert, O(1) range read
  • Async fan-out via Kafka: tweet API returns instantly
  • S3 + CDN for media content
Trade-offs
  • Push model: more storage (each tweet copied to N followers' feeds)
  • Pull model: slow read (query N followed users, merge, sort)
  • Celebrity tweets: fan-out to 100M followers takes minutes
  • Trending: sliding window count via Redis sorted sets
Q4

Design YouTube / Video Streaming

Upload: Chunked upload to S3 Kafka event Transcoder cluster (FFmpeg) multiple resolutions (360p/720p/1080p/4K) output to CDN
Streaming: HLS (HTTP Live Streaming). Video split into ~10s segments (.ts files). Manifest file (.m3u8) lists URLs. Client adapts resolution based on bandwidth (ABR).
Metadata: MySQL for structured data (title, description, tags, owner). Redis for view counts. Elasticsearch for search.
Upload Flow: Client Chunked multipart upload (resumable) S3 (raw) S3 event Kafka: "video-uploaded" Transcoder workers (GPU instances): Input: raw.mp4 FFmpeg [360p, 720p, 1080p, 4K] + thumbnails Upload outputs CDN origin CDN distributes globally Metadata service: UPDATE videos SET status='published' WHERE id=... Playback: GET /video/{id}/manifest return .m3u8 (HLS manifest file) Client parses fetches segments from CDN edge Client detects bandwidth drop switches to lower quality segment URL
Key Decisions
  • CDN essential: video = 80%+ of internet traffic
  • HLS/DASH: industry standard, all browsers support
  • Async transcoding: instant upload response, background processing
  • Separate read/write paths (CQRS-like for metadata)
Challenges
  • Storage: 1 video 5 qualities + thumbnail + captions = 10x raw size
  • Transcoding cost: GPU instances expensive; queue management critical
  • View count: Redis INCR async flush to DB (approximate but fast)
  • Copyright: Content-ID fingerprinting pipeline runs in background
Q5

Design a Distributed Cache (like Redis Cluster)

Data Distribution: Consistent hashing. hash(key) maps key to server. Virtual nodes for uniform distribution.
Replication: 1 primary + 2 replicas per shard. Reads go to replicas (eventual consistency). Writes always to primary.
Eviction: LRU per shard. Track access time. When memory full, evict least-recently-used entries first.
Persistence: RDB (point-in-time snapshot, compact) + AOF (append-only log, durable). Use both for full safety.
Design Decisions
  • In-memory: microsecond latency vs millisecond for disk
  • Single-threaded event loop: no lock contention
  • Pipelining: batch multiple commands in one round-trip
  • Lua scripting: atomic multi-command operations
Trade-offs
  • Memory limit: working set must fit in RAM (expensive)
  • Hot key: shard replicas, local L1 cache in application
  • Replication lag: replica may briefly serve stale reads
  • Split-brain: use Redis Sentinel or Cluster for HA failover
Q6

Design Search Autocomplete (Typeahead)

Requirements: Top-5 suggestions per prefix, under 100ms latency. 10M users, 10 queries each = 100M queries/day = ~1,200 RPS.
Data Structure: Trie with frequency at each node. Or Elasticsearch prefix query with score boosting.
Ranking: Query frequency + recency + personalization weight.
Query (read, latency-critical): GET /autocomplete?q=data+cat CDN (cache popular prefixes like "data", "date", "da") Redis: ZREVRANGE prefix:data_cat 0 4 [top5 sorted by score] Trie Service (in-memory, sharded by first char) Elasticsearch fallback Update (async): User search logs Kafka Aggregation Service (5-min windows) Update term frequency Rebuild top-K per prefix Push to Redis ZSet (ZADD prefix:data_cat freq "data catalog")
Optimizations
  • Trie sharded by first 2 chars (26^2 = 676 shards)
  • Redis ZSet caches top-5 per prefix (1-hour TTL)
  • CDN caches most common prefixes globally
  • Client debounce: only query after 300ms idle
Challenges
  • Trie memory: 5M terms avg 8 chars = large in-memory footprint
  • Real-time updates to trie are expensive batch updates every 5 min
  • Personalization: blend global score + user history with weights
  • Multi-language: Unicode = more complex trie / different sharding
Q7

Design a Notification System

Event Sources: User Action / Business Alert / Marketing Campaign Notification Service API (validates payload, applies user preferences, deduplication check) async Kafka Topics (separate topics per channel): -- notifications-email Email Worker SendGrid / SES (retry + DLQ) -- notifications-sms SMS Worker Twilio (retry + DLQ) -- notifications-push Push Worker FCM / APNs (retry + DLQ) -- notifications-inapp InApp Worker WebSocket / DB Notification DB: -- Delivery status (sent, delivered, failed, read) -- User preferences (email opt-in, push enabled, quiet hours 10pm-8am)
Key Decisions
  • Idempotency key: prevents duplicate sends across retries
  • User preferences service: opt-outs, quiet hours, frequency caps
  • Priority lanes: critical alerts bypass normal queue
  • Rate limit: max 3 emails/hour per user to prevent spam
Challenges
  • Deduplication: same event published twice idempotency key check before send
  • Marketing blast to 10M: partition workers, throttle to ~100K/min
  • Delivery tracking: webhooks from SendGrid/Twilio update status
  • DLQ monitoring: alert on-call when DLQ grows beyond threshold
Q8

Design Uber / Ride-Hailing

Location Tracking: Drivers send GPS every 5s. GEOADD in Redis (O(log N) insert). GEORADIUS for nearest-driver query.
Geohashing: Encode lat/lng as a string. Nearby locations share prefix. Efficient area queries without lat/lng math.
Matching: Find N nearest available drivers send offer simultaneously first accept wins create Trip (MySQL, ACID).
Driver Location: Driver app WebSocket Location Service GEOADD drivers {lng, lat, driver_id} Ride Request: User RideRequest Service: 1. GEORADIUS(user_lat, user_lng, radius=5km) [driverA, driverB, driverC] 2. Filter: available + rating > 4.5 + correct vehicle type 3. Offer sent to top 3 simultaneously (first accept wins) 4. Accepted: INSERT INTO trips (atomic, MySQL) lock driver state Real-time Trip Tracking: Driver GPS every 3s Kafka TripTrackingService WebSocket User
Key Decisions
  • Redis Geo: O(log N) radius search, in-memory speed
  • MySQL for trips: ACID needed for payment integrity
  • Cassandra for location history: write-heavy time-series
  • Surge pricing: pre-computed by geohash zone demand
Challenges
  • Simultaneous accept: optimistic lock on driver status (available in-trip)
  • GPS accuracy: Kalman filter to smooth noisy GPS readings
  • Driver state machine: available offered in-ride available
  • ETA: integrate OSRM / Google Maps API for routing and ETA
Q9

Design a Data Catalog (like Alation)

Core Features: Metadata ingestion from databases/BI tools, search, lineage tracking, data quality scoring, collaboration (stewardship, annotations).
Metadata Storage: Graph DB (Neo4j) for lineage (A B C relationships). Elasticsearch for full-text search. PostgreSQL for structured catalog data.
Connectors: Pull metadata via JDBC/REST from source systems (Snowflake, BigQuery, Tableau). Schedule crawls. Detect schema drift.
Metadata Ingestion: Connector Framework Source System (Snowflake/BigQuery/Tableau) Extract: tables, columns, types, stats, query history Transform: normalize to internal metadata model Load: PostgreSQL (structured catalog: tables, columns, owners) Elasticsearch (search index: name, description, tags) Neo4j (lineage graph: table derived from upstream table) Event Store (schema change history for audit) Search: User query Elasticsearch (full-text + faceted filtering) Boost: verified assets, frequently accessed, recently updated Lineage: MATCH (a)-[:DERIVED_FROM*]->(b) WHERE a.name='revenue_report' Show upstream impact analysis for data quality issues
Key Decisions
  • Graph DB: lineage relationships are inherently graph-structured
  • Elasticsearch: fuzzy search, facets, ranking by relevance
  • Event Sourcing: full audit trail of metadata changes
  • CQRS: write-optimized ingest path, read-optimized search path
Challenges
  • Schema drift: source changes must detect and re-index affected assets
  • Scale: 100K+ tables, millions of columns = large search index
  • Trust/quality scoring: aggregate from multiple signals (freshness, usage, docs)
  • Real-time lineage: parse SQL query logs to extract column-level lineage