System Design
Fundamentals Architecture Patterns Real Interview Questions with full answers and trade-offs.
Fundamentals
Databases
Caching
Microservices
Resilience
Distributed
Networking
Auth & Security
Interview Questions
Core Fundamentals
CAP Theorem
Pick any 2: Consistency Availability Partition Tolerance
- Consistency Every read returns the most recent write
- Availability Every request gets a non-error response
- Partition Tolerance System works despite network partitions
| System | Type |
|---|---|
| PostgreSQL, MySQL | CA (single node) |
| Cassandra, DynamoDB | AP (eventual) |
| HBase, Zookeeper, etcd | CP (may reject) |
| MongoDB (default) | CP (tunable) |
| Couchbase | AP or CP (configurable) |
Reality
P is always required in distributed systems. The real choice is C vs A during a partition.ACID vs BASE
| ACID (SQL) | BASE (NoSQL) |
|---|---|
| Atomicity all or nothing | Basically Available |
| Consistency valid state always | Soft State |
| Isolation no interference | Eventually Consistent |
| Durability committed = safe | Prioritizes availability |
Use ACID: banking, bookings, inventory. Use BASE: social feeds, analytics, caching, data catalog metadata.
Scalability Vertical vs Horizontal + Strategies
Vertical Scaling (Scale Up)
- Add CPU/RAM to existing server
- Simple no application changes
- Single point of failure
- Hard hardware limit, expensive
Horizontal Scaling (Scale Out)
- Add more commodity machines
- Theoretically unlimited
- Requires stateless services + LB
- Complex: distributed state, consistency
Key Scaling Strategies
- Stateless Services: Store state in Redis/DB. Any instance handles any request.
- Read Replicas: Primary for writes, replicas for reads (80% of traffic is reads)
- Sharding: Partition data across nodes by shard key
- CDN: Cache static assets globally, reduce origin server load drastically
- Async Processing: Queue slow work (emails, reports) instant API response
- Denormalization: Pre-compute joins faster reads at cost of storage + write complexity
PACELC Theorem (CAP Extension)
If (P)artition: choose (A) or (C). Else (E): choose (L)atency or (C)onsistency
- CAP only models partition scenarios. PACELC says: even normally, there is a latency vs consistency tradeoff.
- Replication = lower latency OR stronger consistency never both.
| System | Partition choice | Else choice |
|---|---|---|
| DynamoDB | Availability | Latency (low) |
| Cassandra | Availability | Latency (low) |
| Google Spanner | Consistency | Consistency (strong) |
| MySQL | Consistency | Consistency |
Databases
Database Indexing B-Tree, Hash, Composite
Must Know
| Type | Structure | Best For | Not For |
|---|---|---|---|
| B-Tree | Balanced tree | Range queries, ORDER BY, <, >, BETWEEN | High-cardinality exact lookups (hash faster) |
| Hash Index | Hash table | Exact equality (=), O(1) | Range queries, sorting |
| Composite | B-Tree on N cols | Multi-column queries (leftmost prefix rule) | Queries not starting with leftmost column |
| Covering Index | Index has all queried cols | Satisfy query from index alone (no heap fetch) | Wide tables with many columns |
| Full-Text | Inverted index | LIKE '%word%', text search | Exact match (regular index faster) |
| Partial Index | Index with WHERE clause | WHERE status='active' index only active rows | When all rows need indexing |
Index Trade-offs
- Reads faster: SELECT queries benefit dramatically (O(log n) vs O(n))
- Writes slower: every INSERT/UPDATE/DELETE updates all indexes
- Storage cost: indexes use ~20-30% extra disk space each
- Cardinality: index on gender (2 values) = nearly useless; index on user_id = excellent
Leftmost Prefix Rule
Composite index (a, b, c) helps queries on: a (a,b) (a,b,c). Does NOT help: (b) (c) (b,c) alone.SQL vs NoSQL When to Use What
| Factor | SQL (Relational) | NoSQL |
|---|---|---|
| Schema | Fixed, enforced (DDL) | Flexible, schema-on-read |
| Relationships | Foreign keys, JOINs | Denormalized, embedded |
| Horizontal Scaling | Difficult (sharding complex) | Native (Cassandra, Mongo) |
| Transactions | Full ACID | Limited / eventual |
| Query Language | SQL (very expressive) | API-specific, limited joins |
NoSQL Types & Use Cases
- Key-Value (Redis, DynamoDB): sessions, cache, feature flags
- Document (MongoDB): product catalogs, CMS, user profiles
- Wide-Column (Cassandra, HBase): IoT, time-series, write-heavy analytics
- Graph (Neo4j): social network, knowledge graph, data lineage
- Search (Elasticsearch): full-text search, log analytics
- Time-Series (InfluxDB, TimescaleDB): metrics, monitoring
Choose SQL When
- Complex queries with multiple JOINs
- ACID transactions are critical (money, inventory)
- Schema is stable and well-defined
- Rich aggregation / reporting needs
- E.g.: banking, ERP, booking, HR systems
- Choose NoSQL when schema evolves rapidly or write throughput is extreme
Database Replication Strategies
Single Leader (Primary-Replica):
Leader (writes) async/sync replication Replica 1, 2, 3
Reads can go to any replica (scale reads 3x)
Multi-Leader:
Leader A (DC West) replication Leader B (DC East)
Both accept writes; conflict resolution required
Leaderless (Dynamo-style):
Client writes to N nodes; quorum W > N/2
Client reads from R nodes; R + W > N = consistency
| Strategy | Pros | Cons |
|---|---|---|
| Single Leader | Simple, no conflicts, ordered writes | Leader bottleneck, failover needed |
| Multi-Leader | Write availability in multiple DCs | Conflict resolution required |
| Leaderless | High availability, no SPOF | Complex quorums, eventual consistency |
Sync vs Async Replication
- Synchronous: write confirmed only when all replicas ack safe but slow
- Asynchronous: write confirmed on leader write fast but potential data loss on leader crash
- Semi-sync: at least 1 replica acks balance of safety and speed (MySQL default)
Database Sharding Strategies & Pitfalls
| Strategy | How | Pros | Cons |
|---|---|---|---|
| Range Sharding | A-M shard1, N-Z shard2 | Simple range queries | Hotspots (all celebrities on one shard) |
| Hash Sharding | hash(key) % N | Even distribution | Rebalancing when N changes remaps many keys |
| Consistent Hashing | Hash ring + virtual nodes | Minimal remapping on add/remove | More complex, virtual nodes needed for uniformity |
| Directory Sharding | Lookup table: key shard | Flexible, can move data | Lookup service = SPOF, extra hop |
| Geo Sharding | US shard1, EU shard2 | Data locality, compliance (GDPR) | Cross-region queries expensive |
Sharding Pitfalls
Cross-shard JOINs = scatter-gather (expensive) Distributed transactions across shards need 2PC Rebalancing requires data migration Choose shard key carefully hard to change later Hot shard = one shard receives disproportionate trafficCaching
Caching Strategies All Patterns
Critical
| Strategy | Flow | Use When | Downside |
|---|---|---|---|
| Cache-Aside (Lazy) | Check cache miss load DB store in cache | Read-heavy, tolerate some stale data | Cache miss penalty, thundering herd on cold start |
| Read-Through | Cache handles DB fetch transparently on miss | Simplified app code, uniform read path | Cache must understand data model |
| Write-Through | Write to cache AND DB synchronously | Strong consistency needed | Write latency doubled |
| Write-Behind (Write-Back) | Write to cache; async flush to DB | Write-heavy, latency-critical | Data loss if cache crashes before flush |
| Write-Around | Write directly to DB, skip cache | Write-once, read-rarely data (logs) | Cache miss on first read after write |
| Refresh-Ahead | Proactively refresh cache before TTL expires | Predictable, hot data access patterns | Wasted refresh if data not re-accessed |
Eviction Policies
- LRU (Least Recently Used): evict least recently accessed. Best for most workloads.
- LFU (Least Frequently Used): evict least accessed overall. Good when some items are permanently hot.
- TTL (Time-To-Live): expire after fixed duration. Prevents stale data.
- FIFO: simple but poor hit rate. Rarely used in production.
Redis Data Structures for System Design
String: counter (INCR), session, feature flag, simple KVHash: user profile fields (HSET/HGET), object propertiesList: activity feed, queue (LPUSH/RPOP), recent N itemsSet: unique visitors, followers, tags (SADD/SINTER for mutual friends)Sorted Set (ZSet): leaderboard (ZADD/ZRANGE), rate limiting, priority queuePub/Sub: real-time notifications, chat cross-server broadcastStreams: durable event log (lightweight Kafka replacement)Geo: GEOADD/GEORADIUS for nearest driver, nearby stores
Cache Problems & Solutions
| Problem | Cause | Solution |
|---|---|---|
| Cache Stampede (Thundering Herd) | Many concurrent misses hit DB simultaneously after cache expiry | Mutex/distributed lock, probabilistic early expiry, background refresh, never-expire + async update |
| Cache Penetration | Repeated queries for non-existent keys bypass cache, hammer DB | Cache null values with short TTL; Bloom Filter to pre-check existence |
| Cache Avalanche | Mass cache expiry at same time DB overwhelmed | Randomize TTL (base_ttl + random(300s)); circuit breaker; multi-layer cache |
| Hot Key Problem | Single key (celebrity post) gets extreme traffic on one shard | Replicate hot key to multiple shards; local in-process L1 cache; rate limit |
Microservices & Architecture Patterns
Monolith vs Microservices
Monolith Pros
- Simple to develop, test, deploy
- No distributed system complexity
- No network latency between components
- ACID transactions are trivial
- Single deployable artifact
Monolith Cons
- Must scale entire app even if only one part needs it
- One tech stack for everything
- Large codebase = slow builds, hard cognitive load
- Deploy all-or-nothing risky releases
Microservices Pros
- Independent deploy, scale, and fault isolation
- Technology flexibility per service
- Small teams own individual services (Conway's Law)
- One service crash doesn't bring down the whole system
Microservices Cons
- Distributed systems complexity (latency, partial failures)
- Cross-service transactions require Saga / 2PC
- More infrastructure: service discovery, API gateway, tracing
- Debugging is harder; need distributed tracing (Jaeger, Zipkin)
When to Choose Microservices
When different parts have drastically different scaling needs, different teams need independent release cycles, or different components need different technology stacks. NOT for early-stage premature optimization kills velocity.API Gateway Pattern
Client (Mobile / Web / Third-party)
API Gateway - Auth (JWT validation, one place) - Rate Limiting (per client / IP) - SSL Termination - Request Routing / Load Balancing - Request / Response Transformation - Logging, Distributed Tracing - Circuit Breaking User Service Order Service Catalog Service
Tools
Kong, AWS API Gateway, NGINX, Traefik, Envoy, Spring Cloud GatewayBFF Pattern (Backend for Frontend)
- Separate API gateway per client type (mobile BFF, web BFF)
- Each BFF aggregates exactly the data that client needs
- Avoids over-fetching on mobile (bandwidth matters) vs web
CQRS Command Query Responsibility Segregation
Architecture Pattern
Commands (writes) Queries (reads)
Command Handler Query Handler
Write Model Read Model
(normalized SQL) (denormalized view, Elasticsearch, Redis)
Events published Kafka Consumers update read model async
Example: Orders
POST /orders Write to MySQL publish OrderPlaced Kafka
Kafka consumer build summary view in Redis / Elasticsearch
GET /orders/summary reads from fast read store
Pros
- Optimize reads and writes independently
- Scale read and write services separately
- Read model tailored per use case
- Natural fit with Event Sourcing
Cons
- Eventual consistency between read/write models
- Significantly more complex architecture
- Duplicate data, sync must be maintained
- Overkill for simple CRUD apps
Event Sourcing
Architecture Pattern
Core Idea: Store sequence of events, not current state
- Every state change = immutable event appended to event log
- Current state = replay all events from the beginning
- Like Git for your data full history, time travel, audit trail
- Snapshots: periodically save state to avoid full replay from the beginning
Traditional: Event Sourcing:
Account { balance: $150 } Event Log (append-only):
1. AccountOpened {balance: $0}
2. Deposited {amount: $200}
3. Withdrawn {amount: $50}
Current state: replay all $0 + $200 - $50 = $150
Relevant to Alation
Data lineage, audit trail, metadata versioning these are core Alation use cases where event sourcing shines. Know this well.Saga Pattern Distributed Transactions
Distributed
Problem
How do you maintain consistency across multiple microservices without a distributed ACID transaction (2PC)?Two Saga Types
- Choreography: Services react to domain events. No central coordinator. Decoupled but hard to visualize flow.
- Orchestration: Central Saga Orchestrator sends commands to each service and handles failures. Easier to trace, single place for flow logic.
Choreography (Event-driven):
OrderPlaced InventoryReserved PaymentProcessed Shipped
(failure)
PaymentFailed InventoryReleased OrderCancelled
Orchestration:
SagaOrchestrator reserve-inventory process-payment ship done
(on any failure) compensate backwards
Outbox Pattern Reliable Event Publishing
Problem
If you write to DB and then publish to Kafka, what if Kafka publish fails? You have inconsistent state.Solution: Transactional Outbox
DB Transaction (atomic): 1. INSERT INTO orders ... 2. INSERT INTO outbox_events ... Outbox Relay (Debezium CDC or polling):
Kafka downstream consumers
How It Works
- Write business record AND outbox event in same DB transaction (atomic)
- Separate relay process reads outbox table and publishes to Kafka
- Relay uses CDC (Debezium reads WAL) or polling loop
- Guarantees at-least-once delivery; consumer must be idempotent
Resilience Patterns
Circuit Breaker Pattern
Must Know
Problem
If Service B is slow, calls from A to B pile up, threads exhaust, A also fails cascading failure across the system.CLOSED (normal) OPEN (failing) HALF-OPEN (testing)
All requests flow fail fast (no requests) allow 1-2 test requests
failure rate timeout expires fail OPEN succeed CLOSED
exceeds threshold OPEN
Configuration
- Failure threshold: 50% failure rate in 10s window OPEN
- Open duration: 30s (fail fast, don't retry hammering broken service)
- Success threshold to close: 3 consecutive successes in HALF-OPEN
- Fallback: return cached response, default value, or degrade gracefully
- Libraries: Resilience4j (Java), Hystrix (Netflix, deprecated), Polly (.NET)
Retry, Timeout, Bulkhead, Rate Limiting
| Pattern | What It Does | Key Config |
|---|---|---|
| Retry + Backoff | Re-attempt failed requests with exponential backoff + jitter | max_retries=3, delay=2^n * 100ms + random(100ms) |
| Timeout | Fail fast if response not received in time limit | connect_timeout=3s, read_timeout=10s |
| Bulkhead | Isolate thread/connection pools per downstream service | maxConcurrentCalls=10 per downstream |
| Rate Limiting | Limit incoming requests per client/IP/API key | 100 req/min per API key |
| Fallback | Return degraded response instead of error | Return cached data, empty list, default value |
Exponential Backoff + Jitter Formula
delay = min(cap, base * 2^attempt) + random(0, jitter_max). Without jitter all retriers retry simultaneously creates another thundering herd on the recovering service.Rate Limiting Algorithms
| Algorithm | How | Burst | Memory | Notes |
|---|---|---|---|---|
| Token Bucket | Bucket refills at rate R, each request uses 1 token | Yes (burst up to bucket size) | O(1) | Most common; allows burst |
| Leaky Bucket | Queue requests, process at fixed rate | No (smooth output) | O(queue) | Smooth traffic shaping |
| Fixed Window | Count per fixed minute/hour window | Yes (edge burst issue) | O(1) | Simple; 2x burst possible at boundary |
| Sliding Window Log | Store timestamp per request, count within window | Accurate | O(requests) | Accurate but memory-intensive |
| Sliding Window Counter | Blend current + previous window proportionally | Approximate | O(1) | Good balance; Redis-friendly |
Redis Rate Limiting (Sliding Window Counter)
- Key:
rate:{user_id}:{current_minute} - INCR + EXPIRE (atomic with Lua or pipeline)
- Blend: count = current_bucket + previous_bucket * (1 - elapsed/window)
Distributed Systems Concepts
Consistent Hashing
Hash Ring (0 2^32 - 1):
0 (=2^32)
S1 (@ 100)
350 200
S4 [ring] S2
300
S3 (@ 300)
Key lookup: hash(key) % 2^32 walk clockwise to nearest server
Adding S5 between S1-S2: only keys in [hash(S1), hash(S5)] remap
Removing S2: only S2's keys move to S3
Virtual Nodes: S1 maps to 150 positions on ring (S1-vn1, S1-vn2 ...)
Much more uniform distribution even with heterogeneous hardware
Used In
- Cassandra, DynamoDB: partition data across nodes
- Redis Cluster: slot assignment (actually uses fixed 16384 slots)
- CDNs: route requests to nearest/least-loaded edge node
- Load balancers: session affinity without lookup tables
Distributed Locking
| Approach | How | Trade-offs |
|---|---|---|
| Redis SET NX EX | SET lock_key owner NX EX 30 atomic set-if-not-exists with TTL | Fast, simple; Redlock algorithm for multi-node safety |
| Zookeeper | Create ephemeral sequential znodes; lowest sequence = lock holder | Strongly consistent; handles crashes via ephemeral nodes |
| DB Pessimistic Lock | SELECT ... FOR UPDATE on DB row | Simple; slow; doesn't work cross-service |
| Optimistic Locking | version column; UPDATE WHERE version=expected; retry on conflict | No blocking; great for low contention; retry logic needed |
Redis Lock Safety Rule
Only release the lock if YOU own it: check owner == your_id before DEL. Use Lua script for atomic check-and-delete. Auto-expiry prevents deadlock if holder crashes.Load Balancing Algorithms
| Algorithm | How | Best For |
|---|---|---|
| Round Robin | Rotate through servers in order | Uniform servers, stateless requests |
| Weighted Round Robin | Proportional to server capacity | Heterogeneous hardware |
| Least Connections | Route to server with fewest active connections | Long-lived connections, variable request duration |
| IP Hash | hash(client_ip) % N same client same server | Session stickiness (stateful apps) |
| Least Response Time | Lowest latency + fewest connections combined | Latency-sensitive APIs |
L4 vs L7 Load Balancing
- L4 (Transport): routes by IP + TCP/UDP port. Very fast, no HTTP inspection. AWS NLB.
- L7 (Application): routes by HTTP path, headers, cookies. Smarter. /api API servers; /static CDN. AWS ALB, NGINX.
Bloom Filters
Probabilistic data structure: "definitely NOT in set" OR "probably in set"
- Space-efficient: represents 1M items in ~1MB vs hash set in 100MB
- False positives possible, false negatives never
- Cannot delete (use Counting Bloom Filter for deletion)
| Use Case | How Bloom Filter Helps |
|---|---|
| URL Shortener | Check if short code exists before DB query |
| Cache Penetration | Block queries for non-existent keys before they hit DB |
| Email deduplication | Check if email already sent before DB lookup |
| Chrome Safe Browsing | Quick check if URL is malicious |
| Cassandra read path | Check if key exists in SSTable before disk read |
Networking & Communication
REST vs GraphQL vs gRPC
API Design
| Factor | REST | GraphQL | gRPC |
|---|---|---|---|
| Protocol | HTTP/1.1 + JSON | HTTP/1.1 + JSON | HTTP/2 + Protobuf |
| Schema | OpenAPI (optional) | Strongly typed SDL | Strongly typed .proto |
| Over/Under-fetch | Common problem | Client specifies exact fields needed | N/A (binary, compact) |
| Performance | Good | Good (1 round trip for complex data) | Excellent (binary, HTTP/2 multiplexing) |
| Streaming | SSE / WebSockets | Subscriptions (WebSocket) | Bidirectional streaming native |
| Use case | Public APIs, simple CRUD | Mobile BFF, complex nested queries | Internal microservices, high-throughput |
| Caching | Easy (HTTP cache headers) | Complex (POST-based, no HTTP cache) | Custom only |
| Tooling | Excellent (Postman, curl) | Good (GraphiQL, Apollo) | Good (grpcurl, Buf) |
WebSockets vs Long Polling vs SSE
| Approach | How | Latency | Direction | Use Case |
|---|---|---|---|---|
| Short Polling | Client requests every N seconds | Up to N seconds | Pull only | Simple updates where delay is OK |
| Long Polling | Server holds request open until data available | Low | Pull only | Simple push before WebSockets era |
| SSE | Server pushes over persistent HTTP (text/event-stream) | Very low | Server Client only | Dashboards, live feeds, notifications |
| WebSockets | Full-duplex TCP after HTTP upgrade handshake | Very low | Bidirectional | Chat, gaming, collaborative editors |
WebSocket at Scale Challenge
WebSocket servers are stateful (connection on specific server). Solution: Sticky sessions at LB + Redis Pub/Sub to broadcast messages across servers. Any server can publish to Redis; subscriber servers push to their connected users.CDN & DNS
CDN Flow:
User (India) DNS CDN Edge Node (Singapore nearest)
Cache Hit return immediately
Cache Miss fetch from Origin (US) cache return
DNS Resolution:
Browser OS Cache DNS Resolver (ISP/8.8.8.8)
Root NS (".") TLD NS (".com") Authoritative NS
Returns IP Browser connects (TTL cached)
CDN Cache Invalidation
- Versioned URLs (best practice): /app.v3.2.1.js new URL = new cache entry
- TTL-based: Cache-Control: max-age=86400
- Purge API: Explicitly invalidate specific paths on deploy (Cloudflare, Fastly)
Reverse Proxy vs Forward Proxy
- Forward Proxy: Client-controlled. Client Proxy Internet. VPN, content filtering.
- Reverse Proxy: Server-controlled. Internet Proxy Servers. Load balancing, SSL termination, hiding server topology.
Authentication & Security
JWT vs Session Tokens
JWT (Stateless):
Header.Payload.Signature
Header: {"alg":"HS256","typ":"JWT"}
Payload: {"sub":"user123","roles":["admin"],"exp":1234567890}
Signature: HMAC-SHA256(base64(header)+"."+base64(payload), secret)
Server validates signature ONLY no DB lookup
Cannot revoke before expiry (use short TTL: 15min)
Refresh token (long-lived, DB-backed) to get new access token
Session (Stateful):
Client sends: Cookie: session_id=abc123
Server: SELECT * FROM sessions WHERE id='abc123' (Redis for speed)
Can revoke instantly (delete from sessions table)
| JWT | Session Token | |
|---|---|---|
| Scalability | Excellent no shared state | Needs shared Redis/DB across all servers |
| Revocation | Hard need token blacklist in Redis | Easy delete session record |
| Token Size | Large (~500 bytes) | Small (random 32-byte string) |
| Best For | Microservices, stateless APIs, mobile | Traditional web apps, high-security (banking) |
OAuth 2.0 Authorization Code Flow
1. User clicks "Login with Google"
2. App redirects: GET /oauth/authorize
?client_id=APP_ID
&redirect_uri=https://app.com/callback
&scope=email+profile
&response_type=code
&state=random_csrf_token <-- CSRF protection
3. User authenticates & grants permission at Google
4. Google redirects: GET https://app.com/callback
?code=AUTH_CODE
&state=random_csrf_token <-- verify state matches
5. App backend calls: POST /oauth/token
{code: AUTH_CODE, client_secret: SECRET, grant_type: authorization_code}
6. Google returns: {access_token, refresh_token, expires_in}
7. App calls: GET /userinfo with Bearer access_token
8. On expiry: POST /oauth/token {grant_type: refresh_token, refresh_token: RT}
Key Security Points
state param = CSRF protection PKCE replaces client_secret for mobile/SPA access_token = short-lived (15min) refresh_token = long-lived (30 days, stored securely server-side) Never expose client_secret in frontendSystem Design Interview Questions
Interview Framework
1Clarify Requirements (functional + non-functional, scale) 2Estimate Scale (RPS, storage, bandwidth) 3High-Level Design (major components) 4API Design (endpoints, data models) 5Deep Dive (critical components) 6Trade-offs (what you chose and why)
Q1
Design a URL Shortener (TinyURL)
Requirements: Shorten URLs, redirect, analytics, expiry. Write: 100M URLs/day. Read: 1B redirects/day (10x writes).
Estimates: 1B reads/day = 12K RPS. 100M writes/day = 1.2K RPS. Storage: 100 bytes * 100M * 365 days * 5 years = ~18TB.
Short Code: Base62 (a-z, A-Z, 0-9). 7 chars = 62^7 = 3.5 trillion unique codes. Use DB auto-increment ID convert to Base62.
Write: POST /shorten {long_url, expiry}
ShortenerService INSERT into MySQL (id, short_code, long_url, expiry)
Return https://short.ly/abc1234
Read: GET /abc1234
Redis cache lookup (short_code long_url)
HIT: 302 Redirect long_url
MISS: MySQL lookup populate Redis (TTL 24h) 302 Redirect
Analytics (async):
On each redirect Kafka event {short_code, timestamp, ip, user_agent}
Analytics Consumer ClickHouse / DynamoDB for aggregates
Design Decisions
- Auto-increment + Base62: no collision, simple
- Redis cache: 99%+ redirects served from cache
- 302 redirect (not 301): server sees every request accurate analytics
- Async analytics: redirect latency unaffected
Trade-offs
- Auto-increment is guessable (sequential codes)
- Hash-based codes: possible collision need retry logic
- Pre-generated codes pool: batch-generate offline
- Bloom filter: check code non-existence before DB query
Q2
Design a Chat System (WhatsApp / Slack)
Requirements: 1:1 and group chat, online presence, message history, delivery/read receipts, media sharing.
Real-time Transport: WebSocket (full-duplex, bidirectional). Long Polling as fallback for restrictive networks.
Message Storage: Cassandra. RowKey = channel_id, clustering by timestamp. Write-heavy, append-only, time-ordered = perfect fit.
Message Flow (Alice Bob):
1. Alice sends msg via WebSocket Chat Server A
2. Chat Server A:
a) Persist to Cassandra async
b) PUBLISH to Redis channel "user:bob:msgs"
3. Chat Server B (Bob's connection) subscribes receives from Redis
4. Chat Server B delivers to Bob via WebSocket
5. Bob's client sends "delivered" ack Chat Server B Cassandra
6. Alice receives delivery receipt via WebSocket
Group Chat (fan-out):
message Kafka Fan-out consumer deliver to each member's inbox
For large groups (10K+): pull model members fetch on open
Presence Service:
WebSocket connect SET user:online:{userId} 1 EX 60 (heartbeat every 30s)
Heartbeat miss key expires user offline
Key Decisions
- Redis Pub/Sub connects stateful WebSocket servers
- Cassandra: ideal for high-write time-series message history
- Offline inbox: store in DB, deliver on reconnect
- Media: upload to S3, share URL in message
Challenges
- Large groups (100K): fan-out on write too expensive use pull
- Message ordering: Cassandra stores by timestamp; tie-break with UUID
- E2E encryption: client encrypts, server stores ciphertext only
- Message dedup: idempotency key per message to prevent duplicates on retry
Q3
Design Twitter / News Feed
Core Challenge: Fan-out on Write vs Fan-out on Read
- Push (Fan-out on Write): When user tweets, push to all followers' Redis feeds. Fast reads. Bad for celebrities (50M followers).
- Pull (Fan-out on Read): Merge tweets from followed users at read time. Slow reads. DB expensive.
- Hybrid (Twitter's approach): Push for regular users, Pull for celebrities at read time. Best of both.
Tweet Write:
POST /tweet TweetService Cassandra (tweets by userId + timestamp)
Kafka: "tweet-created" event
FanoutWorker (async):
for each follower (if follower_count < 10K):
ZADD feed:{followerId} tweet.timestamp tweet.id
(celebrities skip pulled at read time)
Feed Read:
GET /feed/{userId}
Feed Service:
1. ZRANGE feed:{userId} 0 20 get regular user tweet IDs
2. For each celebrity followed: fetch their latest 20 tweets directly
3. Merge all sort by timestamp return top 20
Like/RT:
Redis: INCR likes:{tweetId}
Async flush to Cassandra every 5min
Architecture Decisions
- Cassandra for tweets: write-heavy, time-series pattern
- Redis Sorted Set for feed: O(log N) insert, O(1) range read
- Async fan-out via Kafka: tweet API returns instantly
- S3 + CDN for media content
Trade-offs
- Push model: more storage (each tweet copied to N followers' feeds)
- Pull model: slow read (query N followed users, merge, sort)
- Celebrity tweets: fan-out to 100M followers takes minutes
- Trending: sliding window count via Redis sorted sets
Q4
Design YouTube / Video Streaming
Upload: Chunked upload to S3 Kafka event Transcoder cluster (FFmpeg) multiple resolutions (360p/720p/1080p/4K) output to CDN
Streaming: HLS (HTTP Live Streaming). Video split into ~10s segments (.ts files). Manifest file (.m3u8) lists URLs. Client adapts resolution based on bandwidth (ABR).
Metadata: MySQL for structured data (title, description, tags, owner). Redis for view counts. Elasticsearch for search.
Upload Flow:
Client Chunked multipart upload (resumable) S3 (raw)
S3 event Kafka: "video-uploaded"
Transcoder workers (GPU instances):
Input: raw.mp4 FFmpeg [360p, 720p, 1080p, 4K] + thumbnails
Upload outputs CDN origin CDN distributes globally
Metadata service: UPDATE videos SET status='published' WHERE id=...
Playback:
GET /video/{id}/manifest return .m3u8 (HLS manifest file)
Client parses fetches segments from CDN edge
Client detects bandwidth drop switches to lower quality segment URL
Key Decisions
- CDN essential: video = 80%+ of internet traffic
- HLS/DASH: industry standard, all browsers support
- Async transcoding: instant upload response, background processing
- Separate read/write paths (CQRS-like for metadata)
Challenges
- Storage: 1 video 5 qualities + thumbnail + captions = 10x raw size
- Transcoding cost: GPU instances expensive; queue management critical
- View count: Redis INCR async flush to DB (approximate but fast)
- Copyright: Content-ID fingerprinting pipeline runs in background
Q5
Design a Distributed Cache (like Redis Cluster)
Data Distribution: Consistent hashing. hash(key) maps key to server. Virtual nodes for uniform distribution.
Replication: 1 primary + 2 replicas per shard. Reads go to replicas (eventual consistency). Writes always to primary.
Eviction: LRU per shard. Track access time. When memory full, evict least-recently-used entries first.
Persistence: RDB (point-in-time snapshot, compact) + AOF (append-only log, durable). Use both for full safety.
Design Decisions
- In-memory: microsecond latency vs millisecond for disk
- Single-threaded event loop: no lock contention
- Pipelining: batch multiple commands in one round-trip
- Lua scripting: atomic multi-command operations
Trade-offs
- Memory limit: working set must fit in RAM (expensive)
- Hot key: shard replicas, local L1 cache in application
- Replication lag: replica may briefly serve stale reads
- Split-brain: use Redis Sentinel or Cluster for HA failover
Q6
Design Search Autocomplete (Typeahead)
Requirements: Top-5 suggestions per prefix, under 100ms latency. 10M users, 10 queries each = 100M queries/day = ~1,200 RPS.
Data Structure: Trie with frequency at each node. Or Elasticsearch prefix query with score boosting.
Ranking: Query frequency + recency + personalization weight.
Query (read, latency-critical):
GET /autocomplete?q=data+cat
CDN (cache popular prefixes like "data", "date", "da")
Redis: ZREVRANGE prefix:data_cat 0 4 [top5 sorted by score]
Trie Service (in-memory, sharded by first char)
Elasticsearch fallback
Update (async):
User search logs Kafka Aggregation Service (5-min windows)
Update term frequency Rebuild top-K per prefix
Push to Redis ZSet (ZADD prefix:data_cat freq "data catalog")
Optimizations
- Trie sharded by first 2 chars (26^2 = 676 shards)
- Redis ZSet caches top-5 per prefix (1-hour TTL)
- CDN caches most common prefixes globally
- Client debounce: only query after 300ms idle
Challenges
- Trie memory: 5M terms avg 8 chars = large in-memory footprint
- Real-time updates to trie are expensive batch updates every 5 min
- Personalization: blend global score + user history with weights
- Multi-language: Unicode = more complex trie / different sharding
Q7
Design a Notification System
Event Sources:
User Action / Business Alert / Marketing Campaign
Notification Service API
(validates payload, applies user preferences, deduplication check)
async
Kafka Topics (separate topics per channel):
-- notifications-email Email Worker SendGrid / SES (retry + DLQ)
-- notifications-sms SMS Worker Twilio (retry + DLQ)
-- notifications-push Push Worker FCM / APNs (retry + DLQ)
-- notifications-inapp InApp Worker WebSocket / DB
Notification DB:
-- Delivery status (sent, delivered, failed, read)
-- User preferences (email opt-in, push enabled, quiet hours 10pm-8am)
Key Decisions
- Idempotency key: prevents duplicate sends across retries
- User preferences service: opt-outs, quiet hours, frequency caps
- Priority lanes: critical alerts bypass normal queue
- Rate limit: max 3 emails/hour per user to prevent spam
Challenges
- Deduplication: same event published twice idempotency key check before send
- Marketing blast to 10M: partition workers, throttle to ~100K/min
- Delivery tracking: webhooks from SendGrid/Twilio update status
- DLQ monitoring: alert on-call when DLQ grows beyond threshold
Q8
Design Uber / Ride-Hailing
Location Tracking: Drivers send GPS every 5s. GEOADD in Redis (O(log N) insert). GEORADIUS for nearest-driver query.
Geohashing: Encode lat/lng as a string. Nearby locations share prefix. Efficient area queries without lat/lng math.
Matching: Find N nearest available drivers send offer simultaneously first accept wins create Trip (MySQL, ACID).
Driver Location:
Driver app WebSocket Location Service GEOADD drivers {lng, lat, driver_id}
Ride Request:
User RideRequest Service:
1. GEORADIUS(user_lat, user_lng, radius=5km) [driverA, driverB, driverC]
2. Filter: available + rating > 4.5 + correct vehicle type
3. Offer sent to top 3 simultaneously (first accept wins)
4. Accepted: INSERT INTO trips (atomic, MySQL) lock driver state
Real-time Trip Tracking:
Driver GPS every 3s Kafka TripTrackingService WebSocket User
Key Decisions
- Redis Geo: O(log N) radius search, in-memory speed
- MySQL for trips: ACID needed for payment integrity
- Cassandra for location history: write-heavy time-series
- Surge pricing: pre-computed by geohash zone demand
Challenges
- Simultaneous accept: optimistic lock on driver status (available in-trip)
- GPS accuracy: Kalman filter to smooth noisy GPS readings
- Driver state machine: available offered in-ride available
- ETA: integrate OSRM / Google Maps API for routing and ETA
Q9
Design a Data Catalog (like Alation)
Core Features: Metadata ingestion from databases/BI tools, search, lineage tracking, data quality scoring, collaboration (stewardship, annotations).
Metadata Storage: Graph DB (Neo4j) for lineage (A B C relationships). Elasticsearch for full-text search. PostgreSQL for structured catalog data.
Connectors: Pull metadata via JDBC/REST from source systems (Snowflake, BigQuery, Tableau). Schedule crawls. Detect schema drift.
Metadata Ingestion:
Connector Framework Source System (Snowflake/BigQuery/Tableau)
Extract: tables, columns, types, stats, query history
Transform: normalize to internal metadata model
Load:
PostgreSQL (structured catalog: tables, columns, owners)
Elasticsearch (search index: name, description, tags)
Neo4j (lineage graph: table derived from upstream table)
Event Store (schema change history for audit)
Search:
User query Elasticsearch (full-text + faceted filtering)
Boost: verified assets, frequently accessed, recently updated
Lineage:
MATCH (a)-[:DERIVED_FROM*]->(b) WHERE a.name='revenue_report'
Show upstream impact analysis for data quality issues
Key Decisions
- Graph DB: lineage relationships are inherently graph-structured
- Elasticsearch: fuzzy search, facets, ranking by relevance
- Event Sourcing: full audit trail of metadata changes
- CQRS: write-optimized ingest path, read-optimized search path
Challenges
- Schema drift: source changes must detect and re-index affected assets
- Scale: 100K+ tables, millions of columns = large search index
- Trust/quality scoring: aggregate from multiple signals (freshness, usage, docs)
- Real-time lineage: parse SQL query logs to extract column-level lineage