Operating Systems, Concurrency & Rate Limiting
Core OS internals, thread synchronization, Java & Python concurrency patterns, and rate-limiting algorithms with real implementations.
OS Fundamentals
Process vs Thread
Must Know
| Process | Thread | |
|---|---|---|
| Definition | Independent program in execution | Lightweight unit of execution within a process |
| Memory | Own address space (heap, stack, code, data) | Shared heap with siblings; own stack |
| Communication | IPC (pipes, sockets, shared memory, signals) | Shared memory directly (needs synchronization) |
| Context switch cost | High (TLB flush, page table switch) | Low (same address space) |
| Failure isolation | Process crash doesn't affect others | Thread crash can kill entire process |
| Creation cost | Heavy (fork) | Light (clone) |
| Use when | Strong isolation needed (microservices, browser tabs) | Shared memory tasks (web server handling, GUI) |
Process Memory Layout:
┌─────────────────────┐ High Address
│ Stack (grows ↓) │ → local vars, function frames
├─────────────────────┤
│ ... │
├─────────────────────┤
│ Heap (grows ↑) │ → malloc/new, dynamic allocation
├─────────────────────┤
│ BSS (uninit data) │ → global/static uninit vars
├─────────────────────┤
│ Data (init data) │ → global/static init vars
├─────────────────────┤
│ Text (code) │ → compiled instructions (read-only)
└─────────────────────┘ Low Address
Thread within Process:
Thread 1: own Stack + PC + registers
Thread 2: own Stack + PC + registers
Both share: Heap, Code, Data, File descriptors
Bottleneck: Too many threads → context switch overhead dominates. Use thread pools and async I/O instead of 1-thread-per-request for high concurrency.
Process/Thread States & CPU Scheduling
Core OS
Process States:
NEW ──(admitted)──► READY ──(CPU assigned)──► RUNNING
▲ │
│ (I/O or wait)
│ ▼
(I/O done) WAITING
│ │
└──────────────────────────┘
RUNNING ──(exit)──► TERMINATED
| Scheduling Algorithm | How | Pros | Cons |
|---|---|---|---|
| FCFS (First Come First Served) | Queue order | Simple, no starvation | Convoy effect — long job blocks all |
| SJF (Shortest Job First) | Shortest burst time first | Optimal average wait time | Starvation of long jobs; hard to predict burst |
| Round Robin | Time quantum (e.g. 10ms) per process, rotate | Fair, low response time | High context switch if quantum too small |
| Priority Scheduling | Highest priority first | Important tasks run first | Starvation of low priority |
| Multilevel Queue | Separate queues per priority class | Flexible, OS uses this | Complex |
| CFS (Linux) | Fair share of CPU time based on weight | Fair for all processes, used in Linux kernel | Complex implementation |
Linux CFS — Completely Fair Scheduler
Uses a red-black tree sorted by virtual runtime. Process with smallest vruntime runs next. Interactive processes get boosted (smaller vruntime increment).Context Switching — What Happens?
OS Internals
CPU saves current process/thread state (PC, registers, stack pointer, flags) → PCB (Process Control Block)
Scheduler picks next process/thread to run
Load saved state from new process's PCB into CPU registers
For process switch: also switch virtual memory map (TLB flush — expensive!)
Cost: ~1-10μs per context switch. At 1000 threads switching 1000 times/sec = 1M switches/sec = possibly 10s of seconds just in overhead. This is why goroutines (Go), green threads, and async I/O exist.
User Space vs Kernel Threads
Kernel threads (1:1): OS manages, real parallelism, expensive context switch — Java threadsGreen threads (M:1): Userspace scheduler, no true parallelism, fast switch — Python (old)
Hybrid (M:N): Many user threads → N kernel threads — Go goroutines, Kotlin coroutines
System Calls, Kernel Mode vs User Mode
OS Internals
User Mode
- Restricted — cannot access hardware directly
- Cannot access kernel memory
- Application code runs here
- Invalid instruction → trap (SIGSEGV)
Kernel Mode
- Full hardware access
- OS kernel runs here
- Manages devices, memory, processes
- Entered via system call (trap/interrupt)
System Call Flow:
App calls read() → glibc wrapper → INT 0x80 / SYSCALL instruction
→ CPU switches to kernel mode
→ Kernel validates args, performs I/O
→ Copies data to user buffer
→ Returns to user mode
→ App receives return value
Common System Calls:
fork() — create new process
exec() — replace process image
open() — open file descriptor
read() — read from fd
write() — write to fd
mmap() — map memory
clone() — create thread (Linux)
wait() — wait for child process
Bottleneck: Frequent system calls = frequent mode switches = overhead. Use buffered I/O (stdio), large read/write batches, sendfile() for zero-copy to minimize syscall cost.
I/O Models — Blocking, Non-blocking, Async
Critical for Interviews
| Model | How it works | Thread blocks? | Examples |
|---|---|---|---|
| Blocking I/O | Thread calls read(), waits until data arrives | Yes — entire thread blocked | Traditional Java IO, Python file reads |
| Non-blocking I/O | read() returns immediately with EAGAIN if no data; app polls | No — but burns CPU polling | O_NONBLOCK sockets |
| I/O Multiplexing | select()/poll()/epoll() monitors multiple fds; blocks until any ready | Yes — on epoll, not on I/O itself | Nginx, Redis, Node.js event loop |
| Signal-driven I/O | Kernel sends SIGIO when fd is ready | No | Rare in practice |
| Async I/O (AIO) | Request I/O, callback/future invoked when done; completely non-blocking | No | Python asyncio, Java NIO2, io_uring |
epoll vs select
select: O(N) scan of all file descriptors, max 1024 fdsepoll: O(1) event-driven, scales to millions of connections — used by Nginx, Redis, Node.js
io_uring (Linux 5.1+): async I/O via shared ring buffer, near zero-copy, zero syscall overhead
Nginx Event Loop (I/O Multiplexing with epoll):
1 master process + N worker processes (= N CPU cores)
Each worker: single-threaded + epoll event loop
Worker:
while(true):
events = epoll_wait(fd_list) // block until any connection has data
for event in events:
if event == new_connection: accept()
if event == data_ready: read() → process → write()
if event == write_ready: flush buffer
→ Handles 10K+ concurrent connections with 1 thread!
IPC — Inter-Process Communication
OS
| IPC Mechanism | Speed | Persistence | Use Case |
|---|---|---|---|
| Pipes (unnamed) | Fast | Ephemeral | Parent→child, shell pipe (cmd1 | cmd2) |
| Named Pipes (FIFO) | Fast | Ephemeral | Unrelated processes on same machine |
| Message Queues | Medium | Kernel-persisted | Structured messages, priority |
| Shared Memory | Fastest | Ephemeral | High-throughput data sharing, needs sync |
| Sockets (Unix domain) | Very fast | Ephemeral | Same-machine cross-process (Redis, nginx) |
| Sockets (TCP/UDP) | Network latency | Ephemeral | Cross-machine (microservices) |
| Signals | Instant | N/A | SIGTERM, SIGKILL, SIGUSR1 notifications |
| Memory-mapped files | Very fast | File-backed | Large data sharing, databases (LMDB) |
Memory Management
Virtual Memory, Paging & TLB
Memory
Virtual Memory
- Each process has its own virtual address space (e.g. 0 to 2⁴⁸ on x86-64)
- OS + MMU translate virtual → physical addresses via page tables
- Enables: isolation between processes, more memory than physical RAM (swap), copy-on-write
Virtual → Physical Translation:
Process 1 Virtual Address 0x7fff1000
→ Page Table (per process)
→ Page Frame Number (physical RAM address 0x3A00000)
→ TLB caches recent translations (L1: ~1ns vs page table walk: ~100ns)
Page Fault:
Access address not in RAM → OS Page Fault handler
→ Load page from disk (swap) → 10ms+ cost!
→ Resume process
Copy-on-Write (fork):
fork() → child shares parent pages (read-only)
On write → OS copies just that page → each has own copy
→ Fast fork() used by shell, Redis BGSAVE
| Concept | Detail |
|---|---|
| Page size | Usually 4KB. Huge pages: 2MB/1GB (reduce TLB misses for large datasets) |
| TLB miss | Must walk page table (~100ns). Hot pages in TLB (~1ns) |
| Thrashing | Too many processes, RAM full → constant page faults → system grinding halt |
| Swap | Pages evicted to disk. Swap access 1000x slower than RAM — avoid swap in production! |
| OOM Killer | Linux kills processes when RAM exhausted; use cgroups + limits in containers |
Heap vs Stack — Deep Dive
Memory
| Stack | Heap | |
|---|---|---|
| Allocation | Automatic (compiler) on function call | Manual (malloc/new) or GC-managed |
| Speed | Very fast (just move stack pointer) | Slower (allocator must find free block) |
| Size | Fixed per thread (default 512KB–8MB) | Limited only by RAM+swap |
| Lifetime | Function scope — freed on return | Until free()/GC collects |
| Thread-safe? | Each thread has own stack | Shared heap needs synchronization |
| Overflow | StackOverflowError (deep recursion) | OutOfMemoryError (too many objects) |
Memory Leak: Allocated heap memory never freed. In Java: objects held in static collections. In C: missing free(). Use heap profilers (jmap, async-profiler, valgrind).
Java Memory Model — Quick Reference
Young Gen (Eden + S0 + S1): New objects allocated here. Minor GC frequent, fast.Old Gen (Tenured): Long-lived objects promoted here. Major GC (stop-the-world) is expensive.
Metaspace: Class metadata (replaced PermGen in Java 8+).
G1GC: Default Java 9+. Concurrent, low pause. For <10ms pauses use ZGC/Shenandoah.
CPU Cache — L1/L2/L3, Cache Lines
Performance
| Level | Size | Latency | Scope |
|---|---|---|---|
| L1 Cache | 32-64 KB | ~1 ns | Per core |
| L2 Cache | 256 KB–2 MB | ~5 ns | Per core |
| L3 Cache (LLC) | 8–64 MB | ~20 ns | Shared across cores |
| Main Memory (RAM) | GB–TB | ~100 ns | Shared |
Cache Line & False Sharing
CPU loads/invalidates memory in 64-byte cache lines. If thread A writes x and thread B writes y, but x and y are on the same cache line, every write invalidates the other core's cache — false sharing. Fix: pad struct fields to 64 bytes, or @Contended in Java.
Cache-Friendly Code
- Access arrays sequentially (row-major order) — spatial locality
- Reuse recently accessed data (temporal locality)
- Avoid pointer-chasing linked lists in hot paths (random cache misses)
- Structure of Arrays (SoA) vs Array of Structures (AoS) — SoA more cache-friendly for column access
Synchronization Primitives
Mutex, Semaphore, Monitor, Spinlock
Must Know
| Primitive | What | Use Case | Trade-off |
|---|---|---|---|
| Mutex | Binary lock — only lock owner can unlock | Protect critical section; mutual exclusion | Only 1 thread at a time; sleeping threads |
| Semaphore | Counter-based: P() decrements, V() increments; blocks when 0 | Limit concurrent access (N connections, N workers) | No ownership — any thread can signal |
| Binary Semaphore | Semaphore initialized to 1 (similar to mutex but no ownership) | Signaling between threads | Can be released by non-owner |
| Monitor | Mutex + condition variables bundled (wait/notify) | Java synchronized blocks, Producer-Consumer | High-level, can have spurious wakeups |
| Spinlock | Busy-wait loop until lock acquired (no sleep) | Very short critical sections (<1μs), kernel code | Burns CPU while waiting |
| RW Lock | Multiple concurrent readers, exclusive writer | Read-heavy data structures (caches, configs) | Writer starvation possible |
Mutex vs Semaphore — Key Difference
Mutex: ownership-based. Thread that locked must unlock. Used for mutual exclusion.Semaphore: signaling-based. Any thread can release. Used for resource counting or thread signaling.
Deadlock — Conditions, Detection, Prevention
Must Know
Coffman Conditions for Deadlock (ALL 4 must hold)
1. Mutual Exclusion — resource can only be held by one thread2. Hold and Wait — thread holds one resource and waits for another
3. No Preemption — resource can't be taken away forcefully
4. Circular Wait — Thread A waits for B, B waits for A (or cycle)
Classic Deadlock:
Thread A: lock(mutex1) → try lock(mutex2) → blocks
Thread B: lock(mutex2) → try lock(mutex1) → blocks
→ DEADLOCK — both wait forever
Prevention — Break Circular Wait:
Always acquire locks in SAME ORDER (total ordering):
Thread A: lock(mutex1) → lock(mutex2)
Thread B: lock(mutex1) → lock(mutex2) ← same order!
→ Thread B waits for A to release mutex1 → no deadlock
| Strategy | How | Trade-off |
|---|---|---|
| Lock Ordering | Always acquire locks in consistent order | Must know all locks upfront |
| Timeout + Retry | tryLock(timeout) — back off and retry | May livelock (retry forever) |
| Lock-free / CAS | Atomic compare-and-swap, no locks | Complex, ABA problem |
| Banker's Algorithm | OS allocates resources only if safe state | Complex, needs advance resource declaration |
Livelock: Threads keep changing state in response to each other but make no progress (both step aside simultaneously).
Starvation: Thread never gets CPU/resource (priority inversion — low-priority holds lock needed by high-priority).
Starvation: Thread never gets CPU/resource (priority inversion — low-priority holds lock needed by high-priority).
Producer-Consumer Problem
Classic
// Java: Producer-Consumer with BlockingQueue (preferred approach)
import java.util.concurrent.*;
public class ProducerConsumer {
private static final BlockingQueue<Integer> queue = new LinkedBlockingQueue<>(10);
static class Producer implements Runnable {
public void run() {
for (int i = 0; i < 100; i++) {
try {
queue.put(i); // blocks if queue full
System.out.println("Produced: " + i);
} catch (InterruptedException e) { Thread.currentThread().interrupt(); }
}
}
}
static class Consumer implements Runnable {
public void run() {
while (true) {
try {
int val = queue.take(); // blocks if queue empty
System.out.println("Consumed: " + val);
Thread.sleep(50);
} catch (InterruptedException e) { Thread.currentThread().interrupt(); break; }
}
}
}
public static void main(String[] args) {
new Thread(new Producer()).start();
new Thread(new Consumer()).start();
new Thread(new Consumer()).start(); // multiple consumers
}
}
With explicit wait/notify (lower level)
synchronized(monitor) {
while (queue.isEmpty()) monitor.wait(); // always use while not if (spurious wakeup)
item = queue.poll();
monitor.notifyAll(); // wake up producers
}
Race Conditions & Memory Visibility
Must Know
Race Condition: Two threads access shared state concurrently and at least one writes. Result depends on execution order — undefined behavior.
// BAD: Race condition — counter++ is NOT atomic
// It's: read counter, increment, write counter — 3 steps!
private int counter = 0;
public void increment() { counter++; } // NOT THREAD SAFE
// FIX 1: synchronized
public synchronized void increment() { counter++; }
// FIX 2: AtomicInteger (lock-free, CAS-based)
private AtomicInteger counter = new AtomicInteger(0);
public void increment() { counter.incrementAndGet(); }
// Memory Visibility: volatile
private volatile boolean running = true; // write from one thread visible to all
// Without volatile: thread may read stale cached value from register/L1 cache
Java Memory Model — happens-before
volatile write happens-before subsequent volatile read.Unlock happens-before subsequent lock.
Thread.start() happens-before any action in started thread.
Without happens-before: compiler/CPU can reorder instructions!
Java Concurrency
Thread Lifecycle & Thread Pool
Java
Thread States:
NEW ──(start())──► RUNNABLE ──(CPU)──► RUNNING
▲ │
│ (sleep/wait/I/O)
│ ▼
(notify) BLOCKED/WAITING
│
RUNNABLE ◄────────────────┘
RUNNING ──(run() returns)──► TERMINATED
// Thread Pool: ALWAYS use instead of creating raw threads
ExecutorService pool = Executors.newFixedThreadPool(
Runtime.getRuntime().availableProcessors() * 2 // rule of thumb for I/O-bound
);
// For CPU-bound tasks: nCPU cores
// For I/O-bound tasks: nCPU * (1 + wait_time / compute_time), often 2x-10x cores
// Submit tasks
Future<String> future = pool.submit(() -> {
Thread.sleep(1000);
return "result";
});
String result = future.get(); // blocks; use get(timeout, unit) in prod
// Custom thread pool
ThreadPoolExecutor executor = new ThreadPoolExecutor(
4, // corePoolSize — always kept alive
16, // maximumPoolSize — max threads
60L, TimeUnit.SECONDS, // keepAliveTime for excess threads
new LinkedBlockingQueue<>(1000), // work queue (bounded!)
new ThreadPoolExecutor.CallerRunsPolicy() // rejection policy
);
// Rejection policies: AbortPolicy(default), CallerRunsPolicy, DiscardPolicy, DiscardOldestPolicy
// Always shutdown properly!
pool.shutdown();
pool.awaitTermination(5, TimeUnit.SECONDS);
Bottleneck: Unbounded queue (LinkedBlockingQueue with no limit) → OOM under load. Always set a queue capacity and define rejection policy.
Java Locks — ReentrantLock, ReadWriteLock, StampedLock
Java
// ReentrantLock — more flexible than synchronized
ReentrantLock lock = new ReentrantLock(true); // fair=true → FIFO order (prevents starvation)
lock.lock();
try {
// critical section
} finally {
lock.unlock(); // ALWAYS in finally!
}
// tryLock with timeout (avoids deadlock)
if (lock.tryLock(500, TimeUnit.MILLISECONDS)) {
try { /* ... */ } finally { lock.unlock(); }
} else {
// couldn't acquire — handle gracefully
}
// ReadWriteLock: many readers OR one writer
ReadWriteLock rwLock = new ReentrantReadWriteLock();
// Multiple readers can hold read lock simultaneously
rwLock.readLock().lock();
try { return map.get(key); } finally { rwLock.readLock().unlock(); }
// Exclusive write lock
rwLock.writeLock().lock();
try { map.put(key, val); } finally { rwLock.writeLock().unlock(); }
// StampedLock (Java 8+): optimistic read without locking
StampedLock sl = new StampedLock();
long stamp = sl.tryOptimisticRead();
int x = this.x; int y = this.y; // read without lock
if (!sl.validate(stamp)) { // check if write happened
stamp = sl.readLock();
try { x = this.x; y = this.y; } finally { sl.unlockRead(stamp); }
}
// ~3x faster than ReadWriteLock for read-heavy workloads
Atomic Classes & Lock-Free Data Structures
Java
// Atomic types — based on CPU CAS (Compare-And-Swap) instruction
AtomicInteger counter = new AtomicInteger(0);
counter.incrementAndGet(); // atomic i++
counter.compareAndSet(expected, newVal); // CAS — update only if current == expected
counter.getAndUpdate(x -> x * 2);
AtomicLong, AtomicBoolean, AtomicReference<T>
// LongAdder — better than AtomicLong under high contention
// Maintains per-CPU-stripe counters, sums on read
LongAdder adder = new LongAdder();
adder.increment(); // contention spread across stripes
long total = adder.sum(); // slightly less precise but much faster
// ConcurrentHashMap — segment-level locking (Java 8: CAS + synchronized per bucket)
ConcurrentHashMap<String, Integer> map = new ConcurrentHashMap<>();
map.putIfAbsent(key, 0);
map.compute(key, (k, v) -> v == null ? 1 : v + 1); // atomic update
// CopyOnWriteArrayList — writes create new copy; reads never block
// Good for: event listener lists (rare writes, many reads)
CopyOnWriteArrayList<Listener> listeners = new CopyOnWriteArrayList<>();
ABA Problem: CAS checks if value is still A, but it may have changed A→B→A. Use AtomicStampedReference (version counter) to solve.
CompletableFuture & Async Pipelines
JavaModern
// CompletableFuture — non-blocking async composition
CompletableFuture<User> userFuture = CompletableFuture.supplyAsync(
() -> userService.findById(userId), // runs on ForkJoinPool.commonPool()
customExecutor // or specify your executor
);
// Chain operations (no blocking!)
CompletableFuture<String> result = userFuture
.thenApply(user -> user.getEmail()) // transform
.thenCompose(email -> sendEmail(email)) // chain another async op
.thenAccept(res -> log.info("Sent: " + res)) // side effect, returns Void
.exceptionally(ex -> { log.error(ex); return null; }); // error handling
// Run multiple in parallel, wait for all
CompletableFuture<User> cf1 = CompletableFuture.supplyAsync(() -> getUser(id));
CompletableFuture<Profile> cf2 = CompletableFuture.supplyAsync(() -> getProfile(id));
CompletableFuture.allOf(cf1, cf2).thenRun(() -> {
User user = cf1.join();
Profile profile = cf2.join();
// both ready here
});
// First one to complete wins
CompletableFuture.anyOf(cf1, cf2).thenAccept(result -> { /* handle */ });
// With timeout (Java 9+)
cf1.orTimeout(2, TimeUnit.SECONDS)
.exceptionally(ex -> DEFAULT_USER);
Virtual Threads — Java 21 (Project Loom)
Java 21Modern
What are Virtual Threads?
JVM-managed lightweight threads. Millions can exist simultaneously. When virtual thread blocks on I/O, underlying OS thread (carrier) is freed for other virtual threads. No more async/callback hell.// Create virtual threads — Java 21
Thread.ofVirtual().start(() -> {
// blocking I/O here doesn't waste OS thread!
String result = httpClient.get("https://api.example.com/data");
});
// Virtual thread executor — best for I/O-bound server workloads
ExecutorService executor = Executors.newVirtualThreadPerTaskExecutor();
// Submit 100,000 tasks — no problem, each gets own virtual thread
for (int i = 0; i < 100_000; i++) {
executor.submit(() -> processRequest()); // cheap!
}
// Spring Boot 3.2+: enable virtual threads
// application.properties: spring.threads.virtual.enabled=true
| Platform Threads | Virtual Threads | |
|---|---|---|
| Count | ~1K–10K max practical | Millions possible |
| Memory | ~1MB stack each | ~few KB stack each |
| Blocking I/O | Wastes OS thread | OS thread freed, remounted |
| CPU-bound | Same as virtual | No advantage over platform threads |
Python Concurrency
The GIL — What It Is and Why It Matters
Must Know
GIL — Global Interpreter Lock
CPython has a mutex (GIL) that only allows ONE thread to execute Python bytecode at a time. This means Python threads cannot achieve true CPU parallelism for Python code. I/O operations release the GIL.| Task Type | Best Tool | Why |
|---|---|---|
| I/O bound (network, file) | threading or asyncio | GIL released during I/O; threads work well |
| CPU bound | multiprocessing | Separate processes = separate GILs = true parallelism |
| Mixed / many connections | asyncio | Single-threaded but non-blocking, high concurrency |
| C extensions | threading | C code can release GIL explicitly |
GIL Removed in Python 3.13+
Experimental "free-threaded" CPython (PEP 703). Disabling GIL allows true multi-threaded CPU parallelism. Still evolving — not default yet.threading, multiprocessing, asyncio — When to Use What
Python
import threading, time, requests
from concurrent.futures import ThreadPoolExecutor
# I/O bound: fetching multiple URLs concurrently
urls = ["https://api1.com", "https://api2.com", "https://api3.com"]
def fetch(url):
r = requests.get(url, timeout=5) # GIL released during network I/O
return r.status_code
# ThreadPoolExecutor — recommended over raw threading
with ThreadPoolExecutor(max_workers=10) as executor:
results = list(executor.map(fetch, urls)) # parallel fetches
# Synchronization
lock = threading.Lock()
shared_counter = 0
def safe_increment():
global shared_counter
with lock: # context manager — auto releases on exception too
shared_counter += 1
# threading.Event for signaling
event = threading.Event()
def worker():
event.wait() # blocks until event.set() called
print("Go!")
t = threading.Thread(target=worker)
t.start()
time.sleep(1)
event.set() # unblock worker
from multiprocessing import Pool, Process, Queue, Manager
import os
# CPU bound: parallel number crunching
def compute(n):
return sum(i * i for i in range(n))
# Process pool — separate GIL per process = real parallelism
with Pool(processes=os.cpu_count()) as pool:
results = pool.map(compute, [10**6, 10**6, 10**6, 10**6])
# Sharing data between processes
manager = Manager()
shared_dict = manager.dict() # proxy object, safe across processes
q = Queue() # inter-process queue
def producer(q):
for i in range(5):
q.put(i)
q.put(None) # sentinel
def consumer(q):
while True:
item = q.get()
if item is None: break
print(f"Got: {item}")
p1 = Process(target=producer, args=(q,))
p2 = Process(target=consumer, args=(q,))
p1.start(); p2.start()
p1.join(); p2.join()
import asyncio, aiohttp
# asyncio: single-threaded, non-blocking, cooperative multitasking
# Best for: many concurrent I/O operations (HTTP, DB, file)
async def fetch(session, url):
async with session.get(url) as response:
return await response.text() # yield control while waiting
async def main():
urls = ["https://api1.com", "https://api2.com", "https://api3.com"]
async with aiohttp.ClientSession() as session:
# Run all fetches concurrently — NOT parallel, but concurrent!
tasks = [fetch(session, url) for url in urls]
results = await asyncio.gather(*tasks) # wait for all
# asyncio.gather with return_exceptions=True to not fail on first error
# Run event loop
asyncio.run(main())
# asyncio Semaphore — limit concurrent operations
sem = asyncio.Semaphore(10) # max 10 concurrent requests
async def rate_limited_fetch(url):
async with sem: # blocks if 10 already in flight
return await fetch(session, url)
# asyncio Queue — async producer-consumer
queue = asyncio.Queue(maxsize=100)
async def producer():
await queue.put(item) # blocks if full
async def consumer():
item = await queue.get() # blocks if empty
queue.task_done()
Common asyncio mistake: Calling blocking code inside async function blocks the entire event loop! Use loop.run_in_executor() to offload blocking calls to a thread pool.
Rate Limiting
Rate Limiting Algorithms — Deep Dive
Must Know
| Algorithm | How It Works | Pros | Cons | Best For |
|---|---|---|---|---|
| Fixed Window Counter | Count requests per window (e.g. 100 req/min). Window resets at :00, :01... | Simple, O(1), low memory | Boundary burst: 100 req at :59 + 100 at :01 = 200 in 2 seconds | Simple APIs, rough limits |
| Sliding Window Log | Keep timestamp log; count entries within last N seconds | Precise — truly N req per N seconds | High memory: O(N) per user, N = max requests | Low-volume, high precision |
| Sliding Window Counter | Current window count + (prev window count × remaining fraction) | Memory efficient, ~accurate, no boundary burst | Approximation — slightly off at window edge | General purpose, high traffic |
| Token Bucket | Bucket holds max N tokens. Tokens added at rate R/sec. Request costs 1 token. | Allows controlled bursts (up to bucket size) | Burst can exceed per-second rate | APIs that need burst allowance (CDN, mobile) |
| Leaky Bucket | Requests queued; processed at fixed rate R/sec (queue leaks at constant rate) | Smooth, constant output rate | Queue can fill → requests dropped or delayed | Egress rate limiting, network QoS |
Token Bucket — Visual
Bucket capacity: 10 tokens
Refill rate: 1 token/second
t=0: tokens=10, request arrives → consume 1 → tokens=9
t=0: 5 more requests → tokens=4
t=1: refill → tokens=5, request → tokens=4
t=0: 10 requests burst → consume all 10 → tokens=0
next: requests REJECTED until refill
Sliding Window Counter
Window: 1 min, limit: 100 req
At time :45 of current minute:
prev_window_count = 84
curr_window_count = 36
weight = (60-45)/60 = 0.25
estimate = 84×0.25 + 36 = 57
→ allow (57 < 100)
Redis Rate Limiting — Production Implementations
CodeImportant
# Fixed Window Counter with Redis
import redis, time
r = redis.Redis()
def is_allowed_fixed_window(user_id: str, limit: int = 100, window_sec: int = 60) -> bool:
# Key includes current window timestamp to auto-expire
window_key = f"rl:fw:{user_id}:{int(time.time() // window_sec)}"
# Atomic pipeline: INCR + EXPIRE
pipe = r.pipeline()
pipe.incr(window_key)
pipe.expire(window_key, window_sec)
results = pipe.execute()
current_count = results[0]
return current_count <= limit
# Usage in API handler
def api_handler(user_id: str):
if not is_allowed_fixed_window(user_id, limit=100):
raise RateLimitError("429 Too Many Requests")
# ... handle request
# Sliding Window Log with Redis Sorted Set
import redis, time
r = redis.Redis()
def is_allowed_sliding_log(user_id: str, limit: int = 100, window_sec: int = 60) -> bool:
key = f"rl:swl:{user_id}"
now = time.time()
window_start = now - window_sec
pipe = r.pipeline()
# Remove old entries outside window
pipe.zremrangebyscore(key, '-inf', window_start)
# Count remaining entries in window
pipe.zcard(key)
# Add current request with timestamp as score
pipe.zadd(key, {str(now): now})
# Set key expiry (cleanup)
pipe.expire(key, window_sec + 1)
results = pipe.execute()
count_after_prune = results[1]
return count_after_prune < limit # check BEFORE adding current request
# NOTE: This is not fully atomic — use Lua script for strict correctness
# Token Bucket via Redis Lua script — ATOMIC
import redis
r = redis.Redis()
TOKEN_BUCKET_LUA = """
local key = KEYS[1]
local now = tonumber(ARGV[1])
local capacity = tonumber(ARGV[2])
local refill_rate = tonumber(ARGV[3]) -- tokens per second
local tokens_requested = tonumber(ARGV[4])
local data = redis.call('HMGET', key, 'tokens', 'last_refill')
local tokens = tonumber(data[1]) or capacity
local last_refill = tonumber(data[2]) or now
-- Calculate how many tokens to add since last refill
local elapsed = math.max(0, now - last_refill)
local new_tokens = math.min(capacity, tokens + elapsed * refill_rate)
-- Check if we have enough tokens
if new_tokens >= tokens_requested then
new_tokens = new_tokens - tokens_requested
redis.call('HMSET', key, 'tokens', new_tokens, 'last_refill', now)
redis.call('EXPIRE', key, 3600)
return 1 -- allowed
else
redis.call('HMSET', key, 'tokens', new_tokens, 'last_refill', now)
redis.call('EXPIRE', key, 3600)
return 0 -- denied
end
"""
token_bucket_script = r.register_script(TOKEN_BUCKET_LUA)
def is_allowed_token_bucket(user_id: str, capacity=10, refill_rate=1) -> bool:
key = f"rl:tb:{user_id}"
result = token_bucket_script(
keys=[key],
args=[time.time(), capacity, refill_rate, 1]
)
return result == 1
// Rate Limiting Middleware in Java (Spring Boot Filter)
@Component
public class RateLimitFilter extends OncePerRequestFilter {
private final StringRedisTemplate redis;
private static final int LIMIT = 100;
private static final int WINDOW_SECONDS = 60;
@Override
protected void doFilterInternal(HttpServletRequest req,
HttpServletResponse res,
FilterChain chain) throws IOException, ServletException {
String userId = extractUserId(req); // from JWT or API key
String key = "rl:" + userId + ":" + (System.currentTimeMillis() / 1000 / WINDOW_SECONDS);
Long count = redis.opsForValue().increment(key);
if (count == 1) {
redis.expire(key, WINDOW_SECONDS, TimeUnit.SECONDS);
}
// Set rate limit headers (good practice)
res.setHeader("X-RateLimit-Limit", String.valueOf(LIMIT));
res.setHeader("X-RateLimit-Remaining", String.valueOf(Math.max(0, LIMIT - count)));
if (count > LIMIT) {
res.setStatus(HttpStatus.TOO_MANY_REQUESTS.value());
res.setHeader("Retry-After", String.valueOf(WINDOW_SECONDS));
res.getWriter().write("{\"error\": \"Rate limit exceeded\"}");
return;
}
chain.doFilter(req, res);
}
}
Distributed Rate Limiting — Design & Trade-offs
System Design
Distributed Rate Limiting:
Request → Load Balancer → API Server (any node)
↓
Redis Cluster (shared state)
- key: user_id:window
- value: request count
↓
All nodes see same counters
→ consistent rate enforcement
Without shared state:
3 servers, limit=100
Each server: limit=100 → user can make 300 requests!
OR each server limit=33 → user may be blocked at 33 if on same server
| Approach | Consistency | Latency Added | Failure Mode |
|---|---|---|---|
| Centralized Redis | Strong | ~1ms Redis RTT | Redis down = limit enforcement fails |
| Redis Cluster | Strong (with slot ownership) | ~1ms | High availability |
| Local (per-node) + gossip sync | Eventual | None | Slight over-limit possible |
| Sticky sessions (LB routes same user to same node) | Strong per-user | None | Node down = user reset |
Trade-off: Strict distributed rate limiting adds ~1ms Redis latency to every request. For very high-throughput APIs, use local counters with periodic sync or token bucket approximation.
Rate Limit Headers (RFC 6585)
X-RateLimit-Limit: 100 — max requestsX-RateLimit-Remaining: 43 — remaining in windowX-RateLimit-Reset: 1716825600 — Unix timestamp when window resetsRetry-After: 30 — seconds until next allowed request (on 429)
Rate Limiting in Real Systems — Design Interview Guide
Interview
Requirements: What kind? Per-user, per-IP, per-API-key, per-endpoint? Hard or soft limit? What on violation: drop, queue, degrade?
Algorithm: Token bucket (burst-friendly) for user-facing APIs. Sliding window counter for strict enforcement. Leaky bucket for egress smoothing.
Storage: Redis for distributed. In-memory for single-node. Lua scripts for atomicity (no race condition between read-check-write).
Where to enforce: API Gateway (centralized, easiest) → Service mesh (sidecar) → Individual service (fine-grained). Multiple layers = defense in depth.
Client handling: Return 429, Retry-After header. Client uses exponential backoff + jitter. Don't immediately retry = thundering herd.
Monitoring: Track rate limit hits per user/endpoint. Alert on sudden spike = scraper/DDoS. Dashboard shows top offenders.
Multi-Tier Rate Limiting
- L1 — IP level: 1000 req/min per IP (CDN/WAF, stops DDoS)
- L2 — API key: 100 req/min per API key (stops abuse)
- L3 — User: 60 req/min per user account (fair use)
- L4 — Endpoint: /login: 5 req/min (prevent brute force)