OS, Concurrency & Rate Limiting

Operating Systems, Concurrency & Rate Limiting

Core OS internals, thread synchronization, Java & Python concurrency patterns, and rate-limiting algorithms with real implementations.

OS Fundamentals

Process vs Thread

Must Know
ProcessThread
DefinitionIndependent program in executionLightweight unit of execution within a process
MemoryOwn address space (heap, stack, code, data)Shared heap with siblings; own stack
CommunicationIPC (pipes, sockets, shared memory, signals)Shared memory directly (needs synchronization)
Context switch costHigh (TLB flush, page table switch)Low (same address space)
Failure isolation Process crash doesn't affect others Thread crash can kill entire process
Creation costHeavy (fork)Light (clone)
Use whenStrong isolation needed (microservices, browser tabs)Shared memory tasks (web server handling, GUI)
Process Memory Layout: ┌─────────────────────┐ High Address │ Stack (grows ↓) │ → local vars, function frames ├─────────────────────┤ │ ... │ ├─────────────────────┤ │ Heap (grows ↑) │ → malloc/new, dynamic allocation ├─────────────────────┤ │ BSS (uninit data) │ → global/static uninit vars ├─────────────────────┤ │ Data (init data) │ → global/static init vars ├─────────────────────┤ │ Text (code) │ → compiled instructions (read-only) └─────────────────────┘ Low Address Thread within Process: Thread 1: own Stack + PC + registers Thread 2: own Stack + PC + registers Both share: Heap, Code, Data, File descriptors
Bottleneck: Too many threads → context switch overhead dominates. Use thread pools and async I/O instead of 1-thread-per-request for high concurrency.

Process/Thread States & CPU Scheduling

Core OS
Process States: NEW ──(admitted)──► READY ──(CPU assigned)──► RUNNING ▲ │ │ (I/O or wait) │ ▼ (I/O done) WAITING │ │ └──────────────────────────┘ RUNNING ──(exit)──► TERMINATED
Scheduling AlgorithmHowProsCons
FCFS (First Come First Served)Queue orderSimple, no starvationConvoy effect — long job blocks all
SJF (Shortest Job First)Shortest burst time firstOptimal average wait timeStarvation of long jobs; hard to predict burst
Round RobinTime quantum (e.g. 10ms) per process, rotateFair, low response timeHigh context switch if quantum too small
Priority SchedulingHighest priority firstImportant tasks run firstStarvation of low priority
Multilevel QueueSeparate queues per priority classFlexible, OS uses thisComplex
CFS (Linux)Fair share of CPU time based on weightFair for all processes, used in Linux kernelComplex implementation
Linux CFS — Completely Fair Scheduler
Uses a red-black tree sorted by virtual runtime. Process with smallest vruntime runs next. Interactive processes get boosted (smaller vruntime increment).

Context Switching — What Happens?

OS Internals
CPU saves current process/thread state (PC, registers, stack pointer, flags) → PCB (Process Control Block)
Scheduler picks next process/thread to run
Load saved state from new process's PCB into CPU registers
For process switch: also switch virtual memory map (TLB flush — expensive!)
Cost: ~1-10μs per context switch. At 1000 threads switching 1000 times/sec = 1M switches/sec = possibly 10s of seconds just in overhead. This is why goroutines (Go), green threads, and async I/O exist.
User Space vs Kernel Threads
Kernel threads (1:1): OS manages, real parallelism, expensive context switch — Java threads
Green threads (M:1): Userspace scheduler, no true parallelism, fast switch — Python (old)
Hybrid (M:N): Many user threads → N kernel threads — Go goroutines, Kotlin coroutines

System Calls, Kernel Mode vs User Mode

OS Internals

User Mode

  • Restricted — cannot access hardware directly
  • Cannot access kernel memory
  • Application code runs here
  • Invalid instruction → trap (SIGSEGV)

Kernel Mode

  • Full hardware access
  • OS kernel runs here
  • Manages devices, memory, processes
  • Entered via system call (trap/interrupt)
System Call Flow: App calls read() → glibc wrapper → INT 0x80 / SYSCALL instruction → CPU switches to kernel mode → Kernel validates args, performs I/O → Copies data to user buffer → Returns to user mode → App receives return value Common System Calls: fork() — create new process exec() — replace process image open() — open file descriptor read() — read from fd write() — write to fd mmap() — map memory clone() — create thread (Linux) wait() — wait for child process
Bottleneck: Frequent system calls = frequent mode switches = overhead. Use buffered I/O (stdio), large read/write batches, sendfile() for zero-copy to minimize syscall cost.

I/O Models — Blocking, Non-blocking, Async

Critical for Interviews
ModelHow it worksThread blocks?Examples
Blocking I/OThread calls read(), waits until data arrivesYes — entire thread blockedTraditional Java IO, Python file reads
Non-blocking I/Oread() returns immediately with EAGAIN if no data; app pollsNo — but burns CPU pollingO_NONBLOCK sockets
I/O Multiplexingselect()/poll()/epoll() monitors multiple fds; blocks until any readyYes — on epoll, not on I/O itselfNginx, Redis, Node.js event loop
Signal-driven I/OKernel sends SIGIO when fd is readyNoRare in practice
Async I/O (AIO)Request I/O, callback/future invoked when done; completely non-blockingNoPython asyncio, Java NIO2, io_uring
epoll vs select
select: O(N) scan of all file descriptors, max 1024 fds
epoll: O(1) event-driven, scales to millions of connections — used by Nginx, Redis, Node.js
io_uring (Linux 5.1+): async I/O via shared ring buffer, near zero-copy, zero syscall overhead
Nginx Event Loop (I/O Multiplexing with epoll): 1 master process + N worker processes (= N CPU cores) Each worker: single-threaded + epoll event loop Worker: while(true): events = epoll_wait(fd_list) // block until any connection has data for event in events: if event == new_connection: accept() if event == data_ready: read() → process → write() if event == write_ready: flush buffer → Handles 10K+ concurrent connections with 1 thread!

IPC — Inter-Process Communication

OS
IPC MechanismSpeedPersistenceUse Case
Pipes (unnamed)FastEphemeralParent→child, shell pipe (cmd1 | cmd2)
Named Pipes (FIFO)FastEphemeralUnrelated processes on same machine
Message QueuesMediumKernel-persistedStructured messages, priority
Shared MemoryFastestEphemeralHigh-throughput data sharing, needs sync
Sockets (Unix domain)Very fastEphemeralSame-machine cross-process (Redis, nginx)
Sockets (TCP/UDP)Network latencyEphemeralCross-machine (microservices)
SignalsInstantN/ASIGTERM, SIGKILL, SIGUSR1 notifications
Memory-mapped filesVery fastFile-backedLarge data sharing, databases (LMDB)
Memory Management

Virtual Memory, Paging & TLB

Memory

Virtual Memory

  • Each process has its own virtual address space (e.g. 0 to 2⁴⁸ on x86-64)
  • OS + MMU translate virtual → physical addresses via page tables
  • Enables: isolation between processes, more memory than physical RAM (swap), copy-on-write
Virtual → Physical Translation: Process 1 Virtual Address 0x7fff1000 → Page Table (per process) → Page Frame Number (physical RAM address 0x3A00000) → TLB caches recent translations (L1: ~1ns vs page table walk: ~100ns) Page Fault: Access address not in RAM → OS Page Fault handler → Load page from disk (swap) → 10ms+ cost! → Resume process Copy-on-Write (fork): fork() → child shares parent pages (read-only) On write → OS copies just that page → each has own copy → Fast fork() used by shell, Redis BGSAVE
ConceptDetail
Page sizeUsually 4KB. Huge pages: 2MB/1GB (reduce TLB misses for large datasets)
TLB missMust walk page table (~100ns). Hot pages in TLB (~1ns)
ThrashingToo many processes, RAM full → constant page faults → system grinding halt
SwapPages evicted to disk. Swap access 1000x slower than RAM — avoid swap in production!
OOM KillerLinux kills processes when RAM exhausted; use cgroups + limits in containers

Heap vs Stack — Deep Dive

Memory
StackHeap
AllocationAutomatic (compiler) on function callManual (malloc/new) or GC-managed
SpeedVery fast (just move stack pointer)Slower (allocator must find free block)
SizeFixed per thread (default 512KB–8MB)Limited only by RAM+swap
LifetimeFunction scope — freed on returnUntil free()/GC collects
Thread-safe? Each thread has own stack Shared heap needs synchronization
OverflowStackOverflowError (deep recursion)OutOfMemoryError (too many objects)
Memory Leak: Allocated heap memory never freed. In Java: objects held in static collections. In C: missing free(). Use heap profilers (jmap, async-profiler, valgrind).
Java Memory Model — Quick Reference
Young Gen (Eden + S0 + S1): New objects allocated here. Minor GC frequent, fast.
Old Gen (Tenured): Long-lived objects promoted here. Major GC (stop-the-world) is expensive.
Metaspace: Class metadata (replaced PermGen in Java 8+).
G1GC: Default Java 9+. Concurrent, low pause. For <10ms pauses use ZGC/Shenandoah.

CPU Cache — L1/L2/L3, Cache Lines

Performance
LevelSizeLatencyScope
L1 Cache32-64 KB~1 nsPer core
L2 Cache256 KB–2 MB~5 nsPer core
L3 Cache (LLC)8–64 MB~20 nsShared across cores
Main Memory (RAM)GB–TB~100 nsShared
Cache Line & False Sharing
CPU loads/invalidates memory in 64-byte cache lines. If thread A writes x and thread B writes y, but x and y are on the same cache line, every write invalidates the other core's cache — false sharing. Fix: pad struct fields to 64 bytes, or @Contended in Java.

Cache-Friendly Code

  • Access arrays sequentially (row-major order) — spatial locality
  • Reuse recently accessed data (temporal locality)
  • Avoid pointer-chasing linked lists in hot paths (random cache misses)
  • Structure of Arrays (SoA) vs Array of Structures (AoS) — SoA more cache-friendly for column access
Synchronization Primitives

Mutex, Semaphore, Monitor, Spinlock

Must Know
PrimitiveWhatUse CaseTrade-off
MutexBinary lock — only lock owner can unlockProtect critical section; mutual exclusionOnly 1 thread at a time; sleeping threads
SemaphoreCounter-based: P() decrements, V() increments; blocks when 0Limit concurrent access (N connections, N workers)No ownership — any thread can signal
Binary SemaphoreSemaphore initialized to 1 (similar to mutex but no ownership)Signaling between threadsCan be released by non-owner
MonitorMutex + condition variables bundled (wait/notify)Java synchronized blocks, Producer-ConsumerHigh-level, can have spurious wakeups
SpinlockBusy-wait loop until lock acquired (no sleep)Very short critical sections (<1μs), kernel codeBurns CPU while waiting
RW LockMultiple concurrent readers, exclusive writerRead-heavy data structures (caches, configs)Writer starvation possible
Mutex vs Semaphore — Key Difference
Mutex: ownership-based. Thread that locked must unlock. Used for mutual exclusion.
Semaphore: signaling-based. Any thread can release. Used for resource counting or thread signaling.

Deadlock — Conditions, Detection, Prevention

Must Know
Coffman Conditions for Deadlock (ALL 4 must hold)
1. Mutual Exclusion — resource can only be held by one thread
2. Hold and Wait — thread holds one resource and waits for another
3. No Preemption — resource can't be taken away forcefully
4. Circular Wait — Thread A waits for B, B waits for A (or cycle)
Classic Deadlock: Thread A: lock(mutex1) → try lock(mutex2) → blocks Thread B: lock(mutex2) → try lock(mutex1) → blocks → DEADLOCK — both wait forever Prevention — Break Circular Wait: Always acquire locks in SAME ORDER (total ordering): Thread A: lock(mutex1) → lock(mutex2) Thread B: lock(mutex1) → lock(mutex2) ← same order! → Thread B waits for A to release mutex1 → no deadlock
StrategyHowTrade-off
Lock OrderingAlways acquire locks in consistent orderMust know all locks upfront
Timeout + RetrytryLock(timeout) — back off and retryMay livelock (retry forever)
Lock-free / CASAtomic compare-and-swap, no locksComplex, ABA problem
Banker's AlgorithmOS allocates resources only if safe stateComplex, needs advance resource declaration
Livelock: Threads keep changing state in response to each other but make no progress (both step aside simultaneously).
Starvation: Thread never gets CPU/resource (priority inversion — low-priority holds lock needed by high-priority).

Producer-Consumer Problem

Classic
// Java: Producer-Consumer with BlockingQueue (preferred approach)
import java.util.concurrent.*;

public class ProducerConsumer {
    private static final BlockingQueue<Integer> queue = new LinkedBlockingQueue<>(10);

    static class Producer implements Runnable {
        public void run() {
            for (int i = 0; i < 100; i++) {
                try {
                    queue.put(i);  // blocks if queue full
                    System.out.println("Produced: " + i);
                } catch (InterruptedException e) { Thread.currentThread().interrupt(); }
            }
        }
    }

    static class Consumer implements Runnable {
        public void run() {
            while (true) {
                try {
                    int val = queue.take();  // blocks if queue empty
                    System.out.println("Consumed: " + val);
                    Thread.sleep(50);
                } catch (InterruptedException e) { Thread.currentThread().interrupt(); break; }
            }
        }
    }

    public static void main(String[] args) {
        new Thread(new Producer()).start();
        new Thread(new Consumer()).start();
        new Thread(new Consumer()).start(); // multiple consumers
    }
}

With explicit wait/notify (lower level)

synchronized(monitor) {
    while (queue.isEmpty()) monitor.wait();   // always use while not if (spurious wakeup)
    item = queue.poll();
    monitor.notifyAll();                      // wake up producers
}

Race Conditions & Memory Visibility

Must Know
Race Condition: Two threads access shared state concurrently and at least one writes. Result depends on execution order — undefined behavior.
// BAD: Race condition — counter++ is NOT atomic
// It's: read counter, increment, write counter — 3 steps!
private int counter = 0;
public void increment() { counter++; } // NOT THREAD SAFE

// FIX 1: synchronized
public synchronized void increment() { counter++; }

// FIX 2: AtomicInteger (lock-free, CAS-based)
private AtomicInteger counter = new AtomicInteger(0);
public void increment() { counter.incrementAndGet(); }

// Memory Visibility: volatile
private volatile boolean running = true;  // write from one thread visible to all
// Without volatile: thread may read stale cached value from register/L1 cache
Java Memory Model — happens-before
volatile write happens-before subsequent volatile read.
Unlock happens-before subsequent lock.
Thread.start() happens-before any action in started thread.
Without happens-before: compiler/CPU can reorder instructions!
Java Concurrency

Thread Lifecycle & Thread Pool

Java
Thread States: NEW ──(start())──► RUNNABLE ──(CPU)──► RUNNING ▲ │ │ (sleep/wait/I/O) │ ▼ (notify) BLOCKED/WAITING │ RUNNABLE ◄────────────────┘ RUNNING ──(run() returns)──► TERMINATED
// Thread Pool: ALWAYS use instead of creating raw threads
ExecutorService pool = Executors.newFixedThreadPool(
    Runtime.getRuntime().availableProcessors() * 2  // rule of thumb for I/O-bound
);

// For CPU-bound tasks: nCPU cores
// For I/O-bound tasks: nCPU * (1 + wait_time / compute_time), often 2x-10x cores

// Submit tasks
Future<String> future = pool.submit(() -> {
    Thread.sleep(1000);
    return "result";
});
String result = future.get(); // blocks; use get(timeout, unit) in prod

// Custom thread pool
ThreadPoolExecutor executor = new ThreadPoolExecutor(
    4,           // corePoolSize — always kept alive
    16,          // maximumPoolSize — max threads
    60L, TimeUnit.SECONDS,  // keepAliveTime for excess threads
    new LinkedBlockingQueue<>(1000),  // work queue (bounded!)
    new ThreadPoolExecutor.CallerRunsPolicy() // rejection policy
);
// Rejection policies: AbortPolicy(default), CallerRunsPolicy, DiscardPolicy, DiscardOldestPolicy

// Always shutdown properly!
pool.shutdown();
pool.awaitTermination(5, TimeUnit.SECONDS);
Bottleneck: Unbounded queue (LinkedBlockingQueue with no limit) → OOM under load. Always set a queue capacity and define rejection policy.

Java Locks — ReentrantLock, ReadWriteLock, StampedLock

Java
// ReentrantLock — more flexible than synchronized
ReentrantLock lock = new ReentrantLock(true); // fair=true → FIFO order (prevents starvation)

lock.lock();
try {
    // critical section
} finally {
    lock.unlock(); // ALWAYS in finally!
}

// tryLock with timeout (avoids deadlock)
if (lock.tryLock(500, TimeUnit.MILLISECONDS)) {
    try { /* ... */ } finally { lock.unlock(); }
} else {
    // couldn't acquire — handle gracefully
}

// ReadWriteLock: many readers OR one writer
ReadWriteLock rwLock = new ReentrantReadWriteLock();
// Multiple readers can hold read lock simultaneously
rwLock.readLock().lock();
try { return map.get(key); } finally { rwLock.readLock().unlock(); }

// Exclusive write lock
rwLock.writeLock().lock();
try { map.put(key, val); } finally { rwLock.writeLock().unlock(); }

// StampedLock (Java 8+): optimistic read without locking
StampedLock sl = new StampedLock();
long stamp = sl.tryOptimisticRead();
int x = this.x; int y = this.y;           // read without lock
if (!sl.validate(stamp)) {                 // check if write happened
    stamp = sl.readLock();
    try { x = this.x; y = this.y; } finally { sl.unlockRead(stamp); }
}
// ~3x faster than ReadWriteLock for read-heavy workloads

Atomic Classes & Lock-Free Data Structures

Java
// Atomic types — based on CPU CAS (Compare-And-Swap) instruction
AtomicInteger counter = new AtomicInteger(0);
counter.incrementAndGet();                    // atomic i++
counter.compareAndSet(expected, newVal);      // CAS — update only if current == expected
counter.getAndUpdate(x -> x * 2);

AtomicLong, AtomicBoolean, AtomicReference<T>

// LongAdder — better than AtomicLong under high contention
// Maintains per-CPU-stripe counters, sums on read
LongAdder adder = new LongAdder();
adder.increment();  // contention spread across stripes
long total = adder.sum(); // slightly less precise but much faster

// ConcurrentHashMap — segment-level locking (Java 8: CAS + synchronized per bucket)
ConcurrentHashMap<String, Integer> map = new ConcurrentHashMap<>();
map.putIfAbsent(key, 0);
map.compute(key, (k, v) -> v == null ? 1 : v + 1); // atomic update

// CopyOnWriteArrayList — writes create new copy; reads never block
// Good for: event listener lists (rare writes, many reads)
CopyOnWriteArrayList<Listener> listeners = new CopyOnWriteArrayList<>();
ABA Problem: CAS checks if value is still A, but it may have changed A→B→A. Use AtomicStampedReference (version counter) to solve.

CompletableFuture & Async Pipelines

JavaModern
// CompletableFuture — non-blocking async composition
CompletableFuture<User> userFuture = CompletableFuture.supplyAsync(
    () -> userService.findById(userId),  // runs on ForkJoinPool.commonPool()
    customExecutor                        // or specify your executor
);

// Chain operations (no blocking!)
CompletableFuture<String> result = userFuture
    .thenApply(user -> user.getEmail())           // transform
    .thenCompose(email -> sendEmail(email))        // chain another async op
    .thenAccept(res -> log.info("Sent: " + res))  // side effect, returns Void
    .exceptionally(ex -> { log.error(ex); return null; }); // error handling

// Run multiple in parallel, wait for all
CompletableFuture<User>    cf1 = CompletableFuture.supplyAsync(() -> getUser(id));
CompletableFuture<Profile> cf2 = CompletableFuture.supplyAsync(() -> getProfile(id));

CompletableFuture.allOf(cf1, cf2).thenRun(() -> {
    User user = cf1.join();
    Profile profile = cf2.join();
    // both ready here
});

// First one to complete wins
CompletableFuture.anyOf(cf1, cf2).thenAccept(result -> { /* handle */ });

// With timeout (Java 9+)
cf1.orTimeout(2, TimeUnit.SECONDS)
   .exceptionally(ex -> DEFAULT_USER);

Virtual Threads — Java 21 (Project Loom)

Java 21Modern
What are Virtual Threads?
JVM-managed lightweight threads. Millions can exist simultaneously. When virtual thread blocks on I/O, underlying OS thread (carrier) is freed for other virtual threads. No more async/callback hell.
// Create virtual threads — Java 21
Thread.ofVirtual().start(() -> {
    // blocking I/O here doesn't waste OS thread!
    String result = httpClient.get("https://api.example.com/data");
});

// Virtual thread executor — best for I/O-bound server workloads
ExecutorService executor = Executors.newVirtualThreadPerTaskExecutor();
// Submit 100,000 tasks — no problem, each gets own virtual thread
for (int i = 0; i < 100_000; i++) {
    executor.submit(() -> processRequest()); // cheap!
}

// Spring Boot 3.2+: enable virtual threads
// application.properties: spring.threads.virtual.enabled=true
Platform ThreadsVirtual Threads
Count~1K–10K max practicalMillions possible
Memory~1MB stack each~few KB stack each
Blocking I/OWastes OS threadOS thread freed, remounted
CPU-boundSame as virtualNo advantage over platform threads
Python Concurrency

The GIL — What It Is and Why It Matters

Must Know
GIL — Global Interpreter Lock
CPython has a mutex (GIL) that only allows ONE thread to execute Python bytecode at a time. This means Python threads cannot achieve true CPU parallelism for Python code. I/O operations release the GIL.
Task TypeBest ToolWhy
I/O bound (network, file)threading or asyncioGIL released during I/O; threads work well
CPU boundmultiprocessingSeparate processes = separate GILs = true parallelism
Mixed / many connectionsasyncioSingle-threaded but non-blocking, high concurrency
C extensionsthreadingC code can release GIL explicitly
GIL Removed in Python 3.13+
Experimental "free-threaded" CPython (PEP 703). Disabling GIL allows true multi-threaded CPU parallelism. Still evolving — not default yet.

threading, multiprocessing, asyncio — When to Use What

Python
import threading, time, requests
from concurrent.futures import ThreadPoolExecutor

# I/O bound: fetching multiple URLs concurrently
urls = ["https://api1.com", "https://api2.com", "https://api3.com"]

def fetch(url):
    r = requests.get(url, timeout=5)  # GIL released during network I/O
    return r.status_code

# ThreadPoolExecutor — recommended over raw threading
with ThreadPoolExecutor(max_workers=10) as executor:
    results = list(executor.map(fetch, urls))  # parallel fetches

# Synchronization
lock = threading.Lock()
shared_counter = 0

def safe_increment():
    global shared_counter
    with lock:  # context manager — auto releases on exception too
        shared_counter += 1

# threading.Event for signaling
event = threading.Event()
def worker():
    event.wait()  # blocks until event.set() called
    print("Go!")

t = threading.Thread(target=worker)
t.start()
time.sleep(1)
event.set()  # unblock worker
from multiprocessing import Pool, Process, Queue, Manager
import os

# CPU bound: parallel number crunching
def compute(n):
    return sum(i * i for i in range(n))

# Process pool — separate GIL per process = real parallelism
with Pool(processes=os.cpu_count()) as pool:
    results = pool.map(compute, [10**6, 10**6, 10**6, 10**6])

# Sharing data between processes
manager = Manager()
shared_dict = manager.dict()   # proxy object, safe across processes

q = Queue()  # inter-process queue

def producer(q):
    for i in range(5):
        q.put(i)
    q.put(None)  # sentinel

def consumer(q):
    while True:
        item = q.get()
        if item is None: break
        print(f"Got: {item}")

p1 = Process(target=producer, args=(q,))
p2 = Process(target=consumer, args=(q,))
p1.start(); p2.start()
p1.join(); p2.join()
import asyncio, aiohttp

# asyncio: single-threaded, non-blocking, cooperative multitasking
# Best for: many concurrent I/O operations (HTTP, DB, file)

async def fetch(session, url):
    async with session.get(url) as response:
        return await response.text()  # yield control while waiting

async def main():
    urls = ["https://api1.com", "https://api2.com", "https://api3.com"]
    async with aiohttp.ClientSession() as session:
        # Run all fetches concurrently — NOT parallel, but concurrent!
        tasks = [fetch(session, url) for url in urls]
        results = await asyncio.gather(*tasks)   # wait for all
        # asyncio.gather with return_exceptions=True to not fail on first error

# Run event loop
asyncio.run(main())

# asyncio Semaphore — limit concurrent operations
sem = asyncio.Semaphore(10)  # max 10 concurrent requests
async def rate_limited_fetch(url):
    async with sem:  # blocks if 10 already in flight
        return await fetch(session, url)

# asyncio Queue — async producer-consumer
queue = asyncio.Queue(maxsize=100)
async def producer():
    await queue.put(item)  # blocks if full

async def consumer():
    item = await queue.get()  # blocks if empty
    queue.task_done()
Common asyncio mistake: Calling blocking code inside async function blocks the entire event loop! Use loop.run_in_executor() to offload blocking calls to a thread pool.
Rate Limiting

Rate Limiting Algorithms — Deep Dive

Must Know
AlgorithmHow It WorksProsConsBest For
Fixed Window CounterCount requests per window (e.g. 100 req/min). Window resets at :00, :01...Simple, O(1), low memoryBoundary burst: 100 req at :59 + 100 at :01 = 200 in 2 secondsSimple APIs, rough limits
Sliding Window LogKeep timestamp log; count entries within last N secondsPrecise — truly N req per N secondsHigh memory: O(N) per user, N = max requestsLow-volume, high precision
Sliding Window CounterCurrent window count + (prev window count × remaining fraction)Memory efficient, ~accurate, no boundary burstApproximation — slightly off at window edgeGeneral purpose, high traffic
Token BucketBucket holds max N tokens. Tokens added at rate R/sec. Request costs 1 token.Allows controlled bursts (up to bucket size)Burst can exceed per-second rateAPIs that need burst allowance (CDN, mobile)
Leaky BucketRequests queued; processed at fixed rate R/sec (queue leaks at constant rate)Smooth, constant output rateQueue can fill → requests dropped or delayedEgress rate limiting, network QoS

Token Bucket — Visual

Bucket capacity: 10 tokens Refill rate: 1 token/second t=0: tokens=10, request arrives → consume 1 → tokens=9 t=0: 5 more requests → tokens=4 t=1: refill → tokens=5, request → tokens=4 t=0: 10 requests burst → consume all 10 → tokens=0 next: requests REJECTED until refill

Sliding Window Counter

Window: 1 min, limit: 100 req At time :45 of current minute: prev_window_count = 84 curr_window_count = 36 weight = (60-45)/60 = 0.25 estimate = 84×0.25 + 36 = 57 → allow (57 < 100)

Redis Rate Limiting — Production Implementations

CodeImportant
# Fixed Window Counter with Redis
import redis, time

r = redis.Redis()

def is_allowed_fixed_window(user_id: str, limit: int = 100, window_sec: int = 60) -> bool:
    # Key includes current window timestamp to auto-expire
    window_key = f"rl:fw:{user_id}:{int(time.time() // window_sec)}"
    
    # Atomic pipeline: INCR + EXPIRE
    pipe = r.pipeline()
    pipe.incr(window_key)
    pipe.expire(window_key, window_sec)
    results = pipe.execute()
    
    current_count = results[0]
    return current_count <= limit

# Usage in API handler
def api_handler(user_id: str):
    if not is_allowed_fixed_window(user_id, limit=100):
        raise RateLimitError("429 Too Many Requests")
    # ... handle request
# Sliding Window Log with Redis Sorted Set
import redis, time

r = redis.Redis()

def is_allowed_sliding_log(user_id: str, limit: int = 100, window_sec: int = 60) -> bool:
    key = f"rl:swl:{user_id}"
    now = time.time()
    window_start = now - window_sec
    
    pipe = r.pipeline()
    # Remove old entries outside window
    pipe.zremrangebyscore(key, '-inf', window_start)
    # Count remaining entries in window
    pipe.zcard(key)
    # Add current request with timestamp as score
    pipe.zadd(key, {str(now): now})
    # Set key expiry (cleanup)
    pipe.expire(key, window_sec + 1)
    results = pipe.execute()
    
    count_after_prune = results[1]
    return count_after_prune < limit  # check BEFORE adding current request

# NOTE: This is not fully atomic — use Lua script for strict correctness
# Token Bucket via Redis Lua script — ATOMIC
import redis

r = redis.Redis()

TOKEN_BUCKET_LUA = """
local key = KEYS[1]
local now = tonumber(ARGV[1])
local capacity = tonumber(ARGV[2])
local refill_rate = tonumber(ARGV[3])  -- tokens per second
local tokens_requested = tonumber(ARGV[4])

local data = redis.call('HMGET', key, 'tokens', 'last_refill')
local tokens = tonumber(data[1]) or capacity
local last_refill = tonumber(data[2]) or now

-- Calculate how many tokens to add since last refill
local elapsed = math.max(0, now - last_refill)
local new_tokens = math.min(capacity, tokens + elapsed * refill_rate)

-- Check if we have enough tokens
if new_tokens >= tokens_requested then
    new_tokens = new_tokens - tokens_requested
    redis.call('HMSET', key, 'tokens', new_tokens, 'last_refill', now)
    redis.call('EXPIRE', key, 3600)
    return 1  -- allowed
else
    redis.call('HMSET', key, 'tokens', new_tokens, 'last_refill', now)
    redis.call('EXPIRE', key, 3600)
    return 0  -- denied
end
"""

token_bucket_script = r.register_script(TOKEN_BUCKET_LUA)

def is_allowed_token_bucket(user_id: str, capacity=10, refill_rate=1) -> bool:
    key = f"rl:tb:{user_id}"
    result = token_bucket_script(
        keys=[key],
        args=[time.time(), capacity, refill_rate, 1]
    )
    return result == 1
// Rate Limiting Middleware in Java (Spring Boot Filter)
@Component
public class RateLimitFilter extends OncePerRequestFilter {
    private final StringRedisTemplate redis;
    private static final int LIMIT = 100;
    private static final int WINDOW_SECONDS = 60;

    @Override
    protected void doFilterInternal(HttpServletRequest req,
                                    HttpServletResponse res,
                                    FilterChain chain) throws IOException, ServletException {
        String userId = extractUserId(req); // from JWT or API key
        String key = "rl:" + userId + ":" + (System.currentTimeMillis() / 1000 / WINDOW_SECONDS);
        
        Long count = redis.opsForValue().increment(key);
        if (count == 1) {
            redis.expire(key, WINDOW_SECONDS, TimeUnit.SECONDS);
        }
        
        // Set rate limit headers (good practice)
        res.setHeader("X-RateLimit-Limit", String.valueOf(LIMIT));
        res.setHeader("X-RateLimit-Remaining", String.valueOf(Math.max(0, LIMIT - count)));
        
        if (count > LIMIT) {
            res.setStatus(HttpStatus.TOO_MANY_REQUESTS.value());
            res.setHeader("Retry-After", String.valueOf(WINDOW_SECONDS));
            res.getWriter().write("{\"error\": \"Rate limit exceeded\"}");
            return;
        }
        
        chain.doFilter(req, res);
    }
}

Distributed Rate Limiting — Design & Trade-offs

System Design
Distributed Rate Limiting: Request → Load Balancer → API Server (any node) ↓ Redis Cluster (shared state) - key: user_id:window - value: request count ↓ All nodes see same counters → consistent rate enforcement Without shared state: 3 servers, limit=100 Each server: limit=100 → user can make 300 requests! OR each server limit=33 → user may be blocked at 33 if on same server
ApproachConsistencyLatency AddedFailure Mode
Centralized RedisStrong~1ms Redis RTTRedis down = limit enforcement fails
Redis ClusterStrong (with slot ownership)~1msHigh availability
Local (per-node) + gossip syncEventualNoneSlight over-limit possible
Sticky sessions (LB routes same user to same node)Strong per-userNoneNode down = user reset
Trade-off: Strict distributed rate limiting adds ~1ms Redis latency to every request. For very high-throughput APIs, use local counters with periodic sync or token bucket approximation.
Rate Limit Headers (RFC 6585)
X-RateLimit-Limit: 100 — max requests
X-RateLimit-Remaining: 43 — remaining in window
X-RateLimit-Reset: 1716825600 — Unix timestamp when window resets
Retry-After: 30 — seconds until next allowed request (on 429)

Rate Limiting in Real Systems — Design Interview Guide

Interview
Requirements: What kind? Per-user, per-IP, per-API-key, per-endpoint? Hard or soft limit? What on violation: drop, queue, degrade?
Algorithm: Token bucket (burst-friendly) for user-facing APIs. Sliding window counter for strict enforcement. Leaky bucket for egress smoothing.
Storage: Redis for distributed. In-memory for single-node. Lua scripts for atomicity (no race condition between read-check-write).
Where to enforce: API Gateway (centralized, easiest) → Service mesh (sidecar) → Individual service (fine-grained). Multiple layers = defense in depth.
Client handling: Return 429, Retry-After header. Client uses exponential backoff + jitter. Don't immediately retry = thundering herd.
Monitoring: Track rate limit hits per user/endpoint. Alert on sudden spike = scraper/DDoS. Dashboard shows top offenders.

Multi-Tier Rate Limiting

  • L1 — IP level: 1000 req/min per IP (CDN/WAF, stops DDoS)
  • L2 — API key: 100 req/min per API key (stops abuse)
  • L3 — User: 60 req/min per user account (fair use)
  • L4 — Endpoint: /login: 5 req/min (prevent brute force)