Operating Systems, Concurrency & Rate Limiting

Core OS internals, thread synchronization, Java & Python concurrency patterns, and rate-limiting algorithms with real implementations.

OS Fundamentals Memory Synchronization Java Concurrency Python Concurrency ⏱ Rate Limiting

OS Fundamentals

Process vs Thread

Must Know

▶

	Process	Thread
Definition	Independent program in execution	Lightweight unit of execution within a process
Memory	Own address space (heap, stack, code, data)	Shared heap with siblings; own stack
Communication	IPC (pipes, sockets, shared memory, signals)	Shared memory directly (needs synchronization)
Context switch cost	High (TLB flush, page table switch)	Low (same address space)
Failure isolation	Process crash doesn't affect others	Thread crash can kill entire process
Creation cost	Heavy (fork)	Light (clone)
Use when	Strong isolation needed (microservices, browser tabs)	Shared memory tasks (web server handling, GUI)

Process Memory Layout: ┌─────────────────────┐ High Address │ Stack (grows ↓) │ → local vars, function frames ├─────────────────────┤ │ ... │ ├─────────────────────┤ │ Heap (grows ↑) │ → malloc/new, dynamic allocation ├─────────────────────┤ │ BSS (uninit data) │ → global/static uninit vars ├─────────────────────┤ │ Data (init data) │ → global/static init vars ├─────────────────────┤ │ Text (code) │ → compiled instructions (read-only) └─────────────────────┘ Low Address Thread within Process: Thread 1: own Stack + PC + registers Thread 2: own Stack + PC + registers Both share: Heap, Code, Data, File descriptors

Bottleneck: Too many threads → context switch overhead dominates. Use thread pools and async I/O instead of 1-thread-per-request for high concurrency.

Process/Thread States & CPU Scheduling

Core OS

▶

Process States: NEW ──(admitted)──► READY ──(CPU assigned)──► RUNNING ▲ │ │ (I/O or wait) │ ▼ (I/O done) WAITING │ │ └──────────────────────────┘ RUNNING ──(exit)──► TERMINATED

Scheduling Algorithm	How	Pros	Cons
FCFS (First Come First Served)	Queue order	Simple, no starvation	Convoy effect — long job blocks all
SJF (Shortest Job First)	Shortest burst time first	Optimal average wait time	Starvation of long jobs; hard to predict burst
Round Robin	Time quantum (e.g. 10ms) per process, rotate	Fair, low response time	High context switch if quantum too small
Priority Scheduling	Highest priority first	Important tasks run first	Starvation of low priority
Multilevel Queue	Separate queues per priority class	Flexible, OS uses this	Complex
CFS (Linux)	Fair share of CPU time based on weight	Fair for all processes, used in Linux kernel	Complex implementation

Linux CFS — Completely Fair Scheduler

Uses a red-black tree sorted by virtual runtime. Process with smallest vruntime runs next. Interactive processes get boosted (smaller vruntime increment).

Context Switching — What Happens?

OS Internals

▶

CPU saves current process/thread state (PC, registers, stack pointer, flags) → PCB (Process Control Block)

Scheduler picks next process/thread to run

Load saved state from new process's PCB into CPU registers

For process switch: also switch virtual memory map (TLB flush — expensive!)

Cost: ~1-10μs per context switch. At 1000 threads switching 1000 times/sec = 1M switches/sec = possibly 10s of seconds just in overhead. This is why goroutines (Go), green threads, and async I/O exist.

User Space vs Kernel Threads

Kernel threads (1:1): OS manages, real parallelism, expensive context switch — Java threads
Green threads (M:1): Userspace scheduler, no true parallelism, fast switch — Python (old)
Hybrid (M:N): Many user threads → N kernel threads — Go goroutines, Kotlin coroutines

System Calls, Kernel Mode vs User Mode

OS Internals

▶

User Mode

Restricted — cannot access hardware directly
Cannot access kernel memory
Application code runs here
Invalid instruction → trap (SIGSEGV)

Kernel Mode

Full hardware access
OS kernel runs here
Manages devices, memory, processes
Entered via system call (trap/interrupt)

System Call Flow: App calls read() → glibc wrapper → INT 0x80 / SYSCALL instruction → CPU switches to kernel mode → Kernel validates args, performs I/O → Copies data to user buffer → Returns to user mode → App receives return value Common System Calls: fork() — create new process exec() — replace process image open() — open file descriptor read() — read from fd write() — write to fd mmap() — map memory clone() — create thread (Linux) wait() — wait for child process

Bottleneck: Frequent system calls = frequent mode switches = overhead. Use buffered I/O (stdio), large read/write batches, sendfile() for zero-copy to minimize syscall cost.

I/O Models — Blocking, Non-blocking, Async

Critical for Interviews

▶

Model	How it works	Thread blocks?	Examples
Blocking I/O	Thread calls read(), waits until data arrives	Yes — entire thread blocked	Traditional Java IO, Python file reads
Non-blocking I/O	read() returns immediately with EAGAIN if no data; app polls	No — but burns CPU polling	O_NONBLOCK sockets
I/O Multiplexing	select()/poll()/epoll() monitors multiple fds; blocks until any ready	Yes — on epoll, not on I/O itself	Nginx, Redis, Node.js event loop
Signal-driven I/O	Kernel sends SIGIO when fd is ready	No	Rare in practice
Async I/O (AIO)	Request I/O, callback/future invoked when done; completely non-blocking	No	Python asyncio, Java NIO2, io_uring

epoll vs select

select: O(N) scan of all file descriptors, max 1024 fds
epoll: O(1) event-driven, scales to millions of connections — used by Nginx, Redis, Node.js
io_uring (Linux 5.1+): async I/O via shared ring buffer, near zero-copy, zero syscall overhead

Nginx Event Loop (I/O Multiplexing with epoll): 1 master process + N worker processes (= N CPU cores) Each worker: single-threaded + epoll event loop Worker: while(true): events = epoll_wait(fd_list) // block until any connection has data for event in events: if event == new_connection: accept() if event == data_ready: read() → process → write() if event == write_ready: flush buffer → Handles 10K+ concurrent connections with 1 thread!

IPC — Inter-Process Communication

▶

IPC Mechanism	Speed	Persistence	Use Case
Pipes (unnamed)	Fast	Ephemeral	Parent→child, shell pipe (cmd1 \| cmd2)
Named Pipes (FIFO)	Fast	Ephemeral	Unrelated processes on same machine
Message Queues	Medium	Kernel-persisted	Structured messages, priority
Shared Memory	Fastest	Ephemeral	High-throughput data sharing, needs sync
Sockets (Unix domain)	Very fast	Ephemeral	Same-machine cross-process (Redis, nginx)
Sockets (TCP/UDP)	Network latency	Ephemeral	Cross-machine (microservices)
Signals	Instant	N/A	SIGTERM, SIGKILL, SIGUSR1 notifications
Memory-mapped files	Very fast	File-backed	Large data sharing, databases (LMDB)

Memory Management

Virtual Memory, Paging & TLB

Memory

▶

Virtual Memory

Each process has its own virtual address space (e.g. 0 to 2⁴⁸ on x86-64)
OS + MMU translate virtual → physical addresses via page tables
Enables: isolation between processes, more memory than physical RAM (swap), copy-on-write

Virtual → Physical Translation: Process 1 Virtual Address 0x7fff1000 → Page Table (per process) → Page Frame Number (physical RAM address 0x3A00000) → TLB caches recent translations (L1: ~1ns vs page table walk: ~100ns) Page Fault: Access address not in RAM → OS Page Fault handler → Load page from disk (swap) → 10ms+ cost! → Resume process Copy-on-Write (fork): fork() → child shares parent pages (read-only) On write → OS copies just that page → each has own copy → Fast fork() used by shell, Redis BGSAVE

Concept	Detail
Page size	Usually 4KB. Huge pages: 2MB/1GB (reduce TLB misses for large datasets)
TLB miss	Must walk page table (~100ns). Hot pages in TLB (~1ns)
Thrashing	Too many processes, RAM full → constant page faults → system grinding halt
Swap	Pages evicted to disk. Swap access 1000x slower than RAM — avoid swap in production!
OOM Killer	Linux kills processes when RAM exhausted; use cgroups + limits in containers

Heap vs Stack — Deep Dive

Memory

▶

	Stack	Heap
Allocation	Automatic (compiler) on function call	Manual (malloc/new) or GC-managed
Speed	Very fast (just move stack pointer)	Slower (allocator must find free block)
Size	Fixed per thread (default 512KB–8MB)	Limited only by RAM+swap
Lifetime	Function scope — freed on return	Until free()/GC collects
Thread-safe?	Each thread has own stack	Shared heap needs synchronization
Overflow	StackOverflowError (deep recursion)	OutOfMemoryError (too many objects)

Memory Leak: Allocated heap memory never freed. In Java: objects held in static collections. In C: missing free(). Use heap profilers (jmap, async-profiler, valgrind).

Java Memory Model — Quick Reference

Young Gen (Eden + S0 + S1): New objects allocated here. Minor GC frequent, fast.
Old Gen (Tenured): Long-lived objects promoted here. Major GC (stop-the-world) is expensive.
Metaspace: Class metadata (replaced PermGen in Java 8+).
G1GC: Default Java 9+. Concurrent, low pause. For <10ms pauses use ZGC/Shenandoah.

CPU Cache — L1/L2/L3, Cache Lines

Performance

▶

Level	Size	Latency	Scope
L1 Cache	32-64 KB	~1 ns	Per core
L2 Cache	256 KB–2 MB	~5 ns	Per core
L3 Cache (LLC)	8–64 MB	~20 ns	Shared across cores
Main Memory (RAM)	GB–TB	~100 ns	Shared

Cache Line & False Sharing

CPU loads/invalidates memory in 64-byte cache lines. If thread A writes x and thread B writes y, but x and y are on the same cache line, every write invalidates the other core's cache — false sharing. Fix: pad struct fields to 64 bytes, or @Contended in Java.

Cache-Friendly Code

Access arrays sequentially (row-major order) — spatial locality
Reuse recently accessed data (temporal locality)
Avoid pointer-chasing linked lists in hot paths (random cache misses)
Structure of Arrays (SoA) vs Array of Structures (AoS) — SoA more cache-friendly for column access

Synchronization Primitives

Mutex, Semaphore, Monitor, Spinlock

Must Know

▶

Primitive	What	Use Case	Trade-off
Mutex	Binary lock — only lock owner can unlock	Protect critical section; mutual exclusion	Only 1 thread at a time; sleeping threads
Semaphore	Counter-based: P() decrements, V() increments; blocks when 0	Limit concurrent access (N connections, N workers)	No ownership — any thread can signal
Binary Semaphore	Semaphore initialized to 1 (similar to mutex but no ownership)	Signaling between threads	Can be released by non-owner
Monitor	Mutex + condition variables bundled (wait/notify)	Java synchronized blocks, Producer-Consumer	High-level, can have spurious wakeups
Spinlock	Busy-wait loop until lock acquired (no sleep)	Very short critical sections (<1μs), kernel code	Burns CPU while waiting
RW Lock	Multiple concurrent readers, exclusive writer	Read-heavy data structures (caches, configs)	Writer starvation possible

Mutex vs Semaphore — Key Difference

Mutex: ownership-based. Thread that locked must unlock. Used for mutual exclusion.
Semaphore: signaling-based. Any thread can release. Used for resource counting or thread signaling.

Deadlock — Conditions, Detection, Prevention

Must Know

▶

Coffman Conditions for Deadlock (ALL 4 must hold)

1. Mutual Exclusion — resource can only be held by one thread
2. Hold and Wait — thread holds one resource and waits for another
3. No Preemption — resource can't be taken away forcefully
4. Circular Wait — Thread A waits for B, B waits for A (or cycle)

Classic Deadlock: Thread A: lock(mutex1) → try lock(mutex2) → blocks Thread B: lock(mutex2) → try lock(mutex1) → blocks → DEADLOCK — both wait forever Prevention — Break Circular Wait: Always acquire locks in SAME ORDER (total ordering): Thread A: lock(mutex1) → lock(mutex2) Thread B: lock(mutex1) → lock(mutex2) ← same order! → Thread B waits for A to release mutex1 → no deadlock

Strategy	How	Trade-off
Lock Ordering	Always acquire locks in consistent order	Must know all locks upfront
Timeout + Retry	tryLock(timeout) — back off and retry	May livelock (retry forever)
Lock-free / CAS	Atomic compare-and-swap, no locks	Complex, ABA problem
Banker's Algorithm	OS allocates resources only if safe state	Complex, needs advance resource declaration

Livelock: Threads keep changing state in response to each other but make no progress (both step aside simultaneously).
Starvation: Thread never gets CPU/resource (priority inversion — low-priority holds lock needed by high-priority).

Producer-Consumer Problem

Classic

▶

// Java: Producer-Consumer with BlockingQueue (preferred approach)
import java.util.concurrent.*;

public class ProducerConsumer {
    private static final BlockingQueue<Integer> queue = new LinkedBlockingQueue<>(10);

    static class Producer implements Runnable {
        public void run() {
            for (int i = 0; i < 100; i++) {
                try {
                    queue.put(i);  // blocks if queue full
                    System.out.println("Produced: " + i);
                } catch (InterruptedException e) { Thread.currentThread().interrupt(); }
            }
        }
    }

    static class Consumer implements Runnable {
        public void run() {
            while (true) {
                try {
                    int val = queue.take();  // blocks if queue empty
                    System.out.println("Consumed: " + val);
                    Thread.sleep(50);
                } catch (InterruptedException e) { Thread.currentThread().interrupt(); break; }
            }
        }
    }

    public static void main(String[] args) {
        new Thread(new Producer()).start();
        new Thread(new Consumer()).start();
        new Thread(new Consumer()).start(); // multiple consumers
    }
}

With explicit wait/notify (lower level)

synchronized(monitor) {
    while (queue.isEmpty()) monitor.wait();   // always use while not if (spurious wakeup)
    item = queue.poll();
    monitor.notifyAll();                      // wake up producers
}

Race Conditions & Memory Visibility

Must Know

▶

Race Condition: Two threads access shared state concurrently and at least one writes. Result depends on execution order — undefined behavior.

// BAD: Race condition — counter++ is NOT atomic
// It's: read counter, increment, write counter — 3 steps!
private int counter = 0;
public void increment() { counter++; } // NOT THREAD SAFE

// FIX 1: synchronized
public synchronized void increment() { counter++; }

// FIX 2: AtomicInteger (lock-free, CAS-based)
private AtomicInteger counter = new AtomicInteger(0);
public void increment() { counter.incrementAndGet(); }

// Memory Visibility: volatile
private volatile boolean running = true;  // write from one thread visible to all
// Without volatile: thread may read stale cached value from register/L1 cache

Java Memory Model — happens-before

volatile write happens-before subsequent volatile read.
Unlock happens-before subsequent lock.
Thread.start() happens-before any action in started thread.
Without happens-before: compiler/CPU can reorder instructions!

Java Concurrency

Thread Lifecycle & Thread Pool

Java

▶

Thread States: NEW ──(start())──► RUNNABLE ──(CPU)──► RUNNING ▲ │ │ (sleep/wait/I/O) │ ▼ (notify) BLOCKED/WAITING │ RUNNABLE ◄────────────────┘ RUNNING ──(run() returns)──► TERMINATED

// Thread Pool: ALWAYS use instead of creating raw threads
ExecutorService pool = Executors.newFixedThreadPool(
    Runtime.getRuntime().availableProcessors() * 2  // rule of thumb for I/O-bound
);

// For CPU-bound tasks: nCPU cores
// For I/O-bound tasks: nCPU * (1 + wait_time / compute_time), often 2x-10x cores

// Submit tasks
Future<String> future = pool.submit(() -> {
    Thread.sleep(1000);
    return "result";
});
String result = future.get(); // blocks; use get(timeout, unit) in prod

// Custom thread pool
ThreadPoolExecutor executor = new ThreadPoolExecutor(
    4,           // corePoolSize — always kept alive
    16,          // maximumPoolSize — max threads
    60L, TimeUnit.SECONDS,  // keepAliveTime for excess threads
    new LinkedBlockingQueue<>(1000),  // work queue (bounded!)
    new ThreadPoolExecutor.CallerRunsPolicy() // rejection policy
);
// Rejection policies: AbortPolicy(default), CallerRunsPolicy, DiscardPolicy, DiscardOldestPolicy

// Always shutdown properly!
pool.shutdown();
pool.awaitTermination(5, TimeUnit.SECONDS);

Bottleneck: Unbounded queue (LinkedBlockingQueue with no limit) → OOM under load. Always set a queue capacity and define rejection policy.

Java Locks — ReentrantLock, ReadWriteLock, StampedLock

Java

▶

// ReentrantLock — more flexible than synchronized
ReentrantLock lock = new ReentrantLock(true); // fair=true → FIFO order (prevents starvation)

lock.lock();
try {
    // critical section
} finally {
    lock.unlock(); // ALWAYS in finally!
}

// tryLock with timeout (avoids deadlock)
if (lock.tryLock(500, TimeUnit.MILLISECONDS)) {
    try { /* ... */ } finally { lock.unlock(); }
} else {
    // couldn't acquire — handle gracefully
}

// ReadWriteLock: many readers OR one writer
ReadWriteLock rwLock = new ReentrantReadWriteLock();
// Multiple readers can hold read lock simultaneously
rwLock.readLock().lock();
try { return map.get(key); } finally { rwLock.readLock().unlock(); }

// Exclusive write lock
rwLock.writeLock().lock();
try { map.put(key, val); } finally { rwLock.writeLock().unlock(); }

// StampedLock (Java 8+): optimistic read without locking
StampedLock sl = new StampedLock();
long stamp = sl.tryOptimisticRead();
int x = this.x; int y = this.y;           // read without lock
if (!sl.validate(stamp)) {                 // check if write happened
    stamp = sl.readLock();
    try { x = this.x; y = this.y; } finally { sl.unlockRead(stamp); }
}
// ~3x faster than ReadWriteLock for read-heavy workloads

Atomic Classes & Lock-Free Data Structures

Java

▶

// Atomic types — based on CPU CAS (Compare-And-Swap) instruction
AtomicInteger counter = new AtomicInteger(0);
counter.incrementAndGet();                    // atomic i++
counter.compareAndSet(expected, newVal);      // CAS — update only if current == expected
counter.getAndUpdate(x -> x * 2);

AtomicLong, AtomicBoolean, AtomicReference<T>

// LongAdder — better than AtomicLong under high contention
// Maintains per-CPU-stripe counters, sums on read
LongAdder adder = new LongAdder();
adder.increment();  // contention spread across stripes
long total = adder.sum(); // slightly less precise but much faster

// ConcurrentHashMap — segment-level locking (Java 8: CAS + synchronized per bucket)
ConcurrentHashMap<String, Integer> map = new ConcurrentHashMap<>();
map.putIfAbsent(key, 0);
map.compute(key, (k, v) -> v == null ? 1 : v + 1); // atomic update

// CopyOnWriteArrayList — writes create new copy; reads never block
// Good for: event listener lists (rare writes, many reads)
CopyOnWriteArrayList<Listener> listeners = new CopyOnWriteArrayList<>();

ABA Problem: CAS checks if value is still A, but it may have changed A→B→A. Use AtomicStampedReference (version counter) to solve.

CompletableFuture & Async Pipelines

JavaModern

▶

// CompletableFuture — non-blocking async composition
CompletableFuture<User> userFuture = CompletableFuture.supplyAsync(
    () -> userService.findById(userId),  // runs on ForkJoinPool.commonPool()
    customExecutor                        // or specify your executor
);

// Chain operations (no blocking!)
CompletableFuture<String> result = userFuture
    .thenApply(user -> user.getEmail())           // transform
    .thenCompose(email -> sendEmail(email))        // chain another async op
    .thenAccept(res -> log.info("Sent: " + res))  // side effect, returns Void
    .exceptionally(ex -> { log.error(ex); return null; }); // error handling

// Run multiple in parallel, wait for all
CompletableFuture<User>    cf1 = CompletableFuture.supplyAsync(() -> getUser(id));
CompletableFuture<Profile> cf2 = CompletableFuture.supplyAsync(() -> getProfile(id));

CompletableFuture.allOf(cf1, cf2).thenRun(() -> {
    User user = cf1.join();
    Profile profile = cf2.join();
    // both ready here
});

// First one to complete wins
CompletableFuture.anyOf(cf1, cf2).thenAccept(result -> { /* handle */ });

// With timeout (Java 9+)
cf1.orTimeout(2, TimeUnit.SECONDS)
   .exceptionally(ex -> DEFAULT_USER);

Virtual Threads — Java 21 (Project Loom)

Java 21Modern

▶

What are Virtual Threads?

JVM-managed lightweight threads. Millions can exist simultaneously. When virtual thread blocks on I/O, underlying OS thread (carrier) is freed for other virtual threads. No more async/callback hell.

// Create virtual threads — Java 21
Thread.ofVirtual().start(() -> {
    // blocking I/O here doesn't waste OS thread!
    String result = httpClient.get("https://api.example.com/data");
});

// Virtual thread executor — best for I/O-bound server workloads
ExecutorService executor = Executors.newVirtualThreadPerTaskExecutor();
// Submit 100,000 tasks — no problem, each gets own virtual thread
for (int i = 0; i < 100_000; i++) {
    executor.submit(() -> processRequest()); // cheap!
}

// Spring Boot 3.2+: enable virtual threads
// application.properties: spring.threads.virtual.enabled=true

	Platform Threads	Virtual Threads
Count	~1K–10K max practical	Millions possible
Memory	~1MB stack each	~few KB stack each
Blocking I/O	Wastes OS thread	OS thread freed, remounted
CPU-bound	Same as virtual	No advantage over platform threads

Python Concurrency

The GIL — What It Is and Why It Matters

Must Know

▶

GIL — Global Interpreter Lock

CPython has a mutex (GIL) that only allows ONE thread to execute Python bytecode at a time. This means Python threads cannot achieve true CPU parallelism for Python code. I/O operations release the GIL.

Task Type	Best Tool	Why
I/O bound (network, file)	threading or asyncio	GIL released during I/O; threads work well
CPU bound	multiprocessing	Separate processes = separate GILs = true parallelism
Mixed / many connections	asyncio	Single-threaded but non-blocking, high concurrency
C extensions	threading	C code can release GIL explicitly

GIL Removed in Python 3.13+

Experimental "free-threaded" CPython (PEP 703). Disabling GIL allows true multi-threaded CPU parallelism. Still evolving — not default yet.

threading, multiprocessing, asyncio — When to Use What

Python

▶

import threading, time, requests
from concurrent.futures import ThreadPoolExecutor

# I/O bound: fetching multiple URLs concurrently
urls = ["https://api1.com", "https://api2.com", "https://api3.com"]

def fetch(url):
    r = requests.get(url, timeout=5)  # GIL released during network I/O
    return r.status_code

# ThreadPoolExecutor — recommended over raw threading
with ThreadPoolExecutor(max_workers=10) as executor:
    results = list(executor.map(fetch, urls))  # parallel fetches

# Synchronization
lock = threading.Lock()
shared_counter = 0

def safe_increment():
    global shared_counter
    with lock:  # context manager — auto releases on exception too
        shared_counter += 1

# threading.Event for signaling
event = threading.Event()
def worker():
    event.wait()  # blocks until event.set() called
    print("Go!")

t = threading.Thread(target=worker)
t.start()
time.sleep(1)
event.set()  # unblock worker

from multiprocessing import Pool, Process, Queue, Manager
import os

# CPU bound: parallel number crunching
def compute(n):
    return sum(i * i for i in range(n))

# Process pool — separate GIL per process = real parallelism
with Pool(processes=os.cpu_count()) as pool:
    results = pool.map(compute, [10**6, 10**6, 10**6, 10**6])

# Sharing data between processes
manager = Manager()
shared_dict = manager.dict()   # proxy object, safe across processes

q = Queue()  # inter-process queue

def producer(q):
    for i in range(5):
        q.put(i)
    q.put(None)  # sentinel

def consumer(q):
    while True:
        item = q.get()
        if item is None: break
        print(f"Got: {item}")

p1 = Process(target=producer, args=(q,))
p2 = Process(target=consumer, args=(q,))
p1.start(); p2.start()
p1.join(); p2.join()

import asyncio, aiohttp

# asyncio: single-threaded, non-blocking, cooperative multitasking
# Best for: many concurrent I/O operations (HTTP, DB, file)

async def fetch(session, url):
    async with session.get(url) as response:
        return await response.text()  # yield control while waiting

async def main():
    urls = ["https://api1.com", "https://api2.com", "https://api3.com"]
    async with aiohttp.ClientSession() as session:
        # Run all fetches concurrently — NOT parallel, but concurrent!
        tasks = [fetch(session, url) for url in urls]
        results = await asyncio.gather(*tasks)   # wait for all
        # asyncio.gather with return_exceptions=True to not fail on first error

# Run event loop
asyncio.run(main())

# asyncio Semaphore — limit concurrent operations
sem = asyncio.Semaphore(10)  # max 10 concurrent requests
async def rate_limited_fetch(url):
    async with sem:  # blocks if 10 already in flight
        return await fetch(session, url)

# asyncio Queue — async producer-consumer
queue = asyncio.Queue(maxsize=100)
async def producer():
    await queue.put(item)  # blocks if full

async def consumer():
    item = await queue.get()  # blocks if empty
    queue.task_done()

Common asyncio mistake: Calling blocking code inside async function blocks the entire event loop! Use loop.run_in_executor() to offload blocking calls to a thread pool.

Rate Limiting

Rate Limiting Algorithms — Deep Dive

Must Know

▶

Algorithm	How It Works	Pros	Cons	Best For
Fixed Window Counter	Count requests per window (e.g. 100 req/min). Window resets at :00, :01...	Simple, O(1), low memory	Boundary burst: 100 req at :59 + 100 at :01 = 200 in 2 seconds	Simple APIs, rough limits
Sliding Window Log	Keep timestamp log; count entries within last N seconds	Precise — truly N req per N seconds	High memory: O(N) per user, N = max requests	Low-volume, high precision
Sliding Window Counter	Current window count + (prev window count × remaining fraction)	Memory efficient, ~accurate, no boundary burst	Approximation — slightly off at window edge	General purpose, high traffic
Token Bucket	Bucket holds max N tokens. Tokens added at rate R/sec. Request costs 1 token.	Allows controlled bursts (up to bucket size)	Burst can exceed per-second rate	APIs that need burst allowance (CDN, mobile)
Leaky Bucket	Requests queued; processed at fixed rate R/sec (queue leaks at constant rate)	Smooth, constant output rate	Queue can fill → requests dropped or delayed	Egress rate limiting, network QoS

Token Bucket — Visual

Bucket capacity: 10 tokens Refill rate: 1 token/second t=0: tokens=10, request arrives → consume 1 → tokens=9 t=0: 5 more requests → tokens=4 t=1: refill → tokens=5, request → tokens=4 t=0: 10 requests burst → consume all 10 → tokens=0 next: requests REJECTED until refill

Sliding Window Counter

Window: 1 min, limit: 100 req At time :45 of current minute: prev_window_count = 84 curr_window_count = 36 weight = (60-45)/60 = 0.25 estimate = 84×0.25 + 36 = 57 → allow (57 < 100)

Redis Rate Limiting — Production Implementations

CodeImportant

▶

# Fixed Window Counter with Redis
import redis, time

r = redis.Redis()

def is_allowed_fixed_window(user_id: str, limit: int = 100, window_sec: int = 60) -> bool:
    # Key includes current window timestamp to auto-expire
    window_key = f"rl:fw:{user_id}:{int(time.time() // window_sec)}"
    
    # Atomic pipeline: INCR + EXPIRE
    pipe = r.pipeline()
    pipe.incr(window_key)
    pipe.expire(window_key, window_sec)
    results = pipe.execute()
    
    current_count = results[0]
    return current_count <= limit

# Usage in API handler
def api_handler(user_id: str):
    if not is_allowed_fixed_window(user_id, limit=100):
        raise RateLimitError("429 Too Many Requests")
    # ... handle request

# Sliding Window Log with Redis Sorted Set
import redis, time

r = redis.Redis()

def is_allowed_sliding_log(user_id: str, limit: int = 100, window_sec: int = 60) -> bool:
    key = f"rl:swl:{user_id}"
    now = time.time()
    window_start = now - window_sec
    
    pipe = r.pipeline()
    # Remove old entries outside window
    pipe.zremrangebyscore(key, '-inf', window_start)
    # Count remaining entries in window
    pipe.zcard(key)
    # Add current request with timestamp as score
    pipe.zadd(key, {str(now): now})
    # Set key expiry (cleanup)
    pipe.expire(key, window_sec + 1)
    results = pipe.execute()
    
    count_after_prune = results[1]
    return count_after_prune < limit  # check BEFORE adding current request

# NOTE: This is not fully atomic — use Lua script for strict correctness

# Token Bucket via Redis Lua script — ATOMIC
import redis

r = redis.Redis()

TOKEN_BUCKET_LUA = """
local key = KEYS[1]
local now = tonumber(ARGV[1])
local capacity = tonumber(ARGV[2])
local refill_rate = tonumber(ARGV[3])  -- tokens per second
local tokens_requested = tonumber(ARGV[4])

local data = redis.call('HMGET', key, 'tokens', 'last_refill')
local tokens = tonumber(data[1]) or capacity
local last_refill = tonumber(data[2]) or now

-- Calculate how many tokens to add since last refill
local elapsed = math.max(0, now - last_refill)
local new_tokens = math.min(capacity, tokens + elapsed * refill_rate)

-- Check if we have enough tokens
if new_tokens >= tokens_requested then
    new_tokens = new_tokens - tokens_requested
    redis.call('HMSET', key, 'tokens', new_tokens, 'last_refill', now)
    redis.call('EXPIRE', key, 3600)
    return 1  -- allowed
else
    redis.call('HMSET', key, 'tokens', new_tokens, 'last_refill', now)
    redis.call('EXPIRE', key, 3600)
    return 0  -- denied
end
"""

token_bucket_script = r.register_script(TOKEN_BUCKET_LUA)

def is_allowed_token_bucket(user_id: str, capacity=10, refill_rate=1) -> bool:
    key = f"rl:tb:{user_id}"
    result = token_bucket_script(
        keys=[key],
        args=[time.time(), capacity, refill_rate, 1]
    )
    return result == 1

// Rate Limiting Middleware in Java (Spring Boot Filter)
@Component
public class RateLimitFilter extends OncePerRequestFilter {
    private final StringRedisTemplate redis;
    private static final int LIMIT = 100;
    private static final int WINDOW_SECONDS = 60;

    @Override
    protected void doFilterInternal(HttpServletRequest req,
                                    HttpServletResponse res,
                                    FilterChain chain) throws IOException, ServletException {
        String userId = extractUserId(req); // from JWT or API key
        String key = "rl:" + userId + ":" + (System.currentTimeMillis() / 1000 / WINDOW_SECONDS);
        
        Long count = redis.opsForValue().increment(key);
        if (count == 1) {
            redis.expire(key, WINDOW_SECONDS, TimeUnit.SECONDS);
        }
        
        // Set rate limit headers (good practice)
        res.setHeader("X-RateLimit-Limit", String.valueOf(LIMIT));
        res.setHeader("X-RateLimit-Remaining", String.valueOf(Math.max(0, LIMIT - count)));
        
        if (count > LIMIT) {
            res.setStatus(HttpStatus.TOO_MANY_REQUESTS.value());
            res.setHeader("Retry-After", String.valueOf(WINDOW_SECONDS));
            res.getWriter().write("{\"error\": \"Rate limit exceeded\"}");
            return;
        }
        
        chain.doFilter(req, res);
    }
}

Distributed Rate Limiting — Design & Trade-offs

System Design

▶

Distributed Rate Limiting: Request → Load Balancer → API Server (any node) ↓ Redis Cluster (shared state) - key: user_id:window - value: request count ↓ All nodes see same counters → consistent rate enforcement Without shared state: 3 servers, limit=100 Each server: limit=100 → user can make 300 requests! OR each server limit=33 → user may be blocked at 33 if on same server

Approach	Consistency	Latency Added	Failure Mode
Centralized Redis	Strong	~1ms Redis RTT	Redis down = limit enforcement fails
Redis Cluster	Strong (with slot ownership)	~1ms	High availability
Local (per-node) + gossip sync	Eventual	None	Slight over-limit possible
Sticky sessions (LB routes same user to same node)	Strong per-user	None	Node down = user reset

Trade-off: Strict distributed rate limiting adds ~1ms Redis latency to every request. For very high-throughput APIs, use local counters with periodic sync or token bucket approximation.

Rate Limit Headers (RFC 6585)

X-RateLimit-Limit: 100 — max requests
X-RateLimit-Remaining: 43 — remaining in window
X-RateLimit-Reset: 1716825600 — Unix timestamp when window resets
Retry-After: 30 — seconds until next allowed request (on 429)

Rate Limiting in Real Systems — Design Interview Guide

Interview

▶

Requirements: What kind? Per-user, per-IP, per-API-key, per-endpoint? Hard or soft limit? What on violation: drop, queue, degrade?

Algorithm: Token bucket (burst-friendly) for user-facing APIs. Sliding window counter for strict enforcement. Leaky bucket for egress smoothing.

Storage: Redis for distributed. In-memory for single-node. Lua scripts for atomicity (no race condition between read-check-write).

Where to enforce: API Gateway (centralized, easiest) → Service mesh (sidecar) → Individual service (fine-grained). Multiple layers = defense in depth.

Client handling: Return 429, Retry-After header. Client uses exponential backoff + jitter. Don't immediately retry = thundering herd.

Monitoring: Track rate limit hits per user/endpoint. Alert on sudden spike = scraper/DDoS. Dashboard shows top offenders.

Multi-Tier Rate Limiting

L1 — IP level: 1000 req/min per IP (CDN/WAF, stops DDoS)
L2 — API key: 100 req/min per API key (stops abuse)
L3 — User: 60 req/min per user account (fair use)
L4 — Endpoint: /login: 5 req/min (prevent brute force)

OS, Concurrency & Rate Limiting

Operating Systems, Concurrency & Rate Limiting

Process vs Thread

Process/Thread States & CPU Scheduling

Context Switching — What Happens?

System Calls, Kernel Mode vs User Mode

User Mode

Kernel Mode

I/O Models — Blocking, Non-blocking, Async

IPC — Inter-Process Communication

Virtual Memory, Paging & TLB

Virtual Memory

Heap vs Stack — Deep Dive

CPU Cache — L1/L2/L3, Cache Lines

Cache-Friendly Code

Mutex, Semaphore, Monitor, Spinlock

Deadlock — Conditions, Detection, Prevention

Producer-Consumer Problem

With explicit wait/notify (lower level)

Race Conditions & Memory Visibility

Thread Lifecycle & Thread Pool

Java Locks — ReentrantLock, ReadWriteLock, StampedLock

Atomic Classes & Lock-Free Data Structures

CompletableFuture & Async Pipelines

Virtual Threads — Java 21 (Project Loom)

The GIL — What It Is and Why It Matters

threading, multiprocessing, asyncio — When to Use What

Rate Limiting Algorithms — Deep Dive

Token Bucket — Visual

Sliding Window Counter

Redis Rate Limiting — Production Implementations

Distributed Rate Limiting — Design & Trade-offs

Rate Limiting in Real Systems — Design Interview Guide

Multi-Tier Rate Limiting