SQL, NoSQL, and Distributed Databases
Indexes, query optimization, ACID guarantees, isolation levels, normalization, window functions, CAP theorem, and deep dives into MongoDB, Cassandra, and Redis.
Indexes & EXPLAIN
Window Functions
JOINs
ACID & Isolation
Normalization
CTEs
CAP Theorem
MongoDB
Cassandra
Redis
SQL Indexes & Query Optimization
Index Types — B-tree, Hash, Composite, Covering, Partial
Must KnowSQL
| Type | Structure | Best For | Not For |
|---|---|---|---|
| B-tree (default) | Balanced tree; sorted | Range queries, ORDER BY, equality | Low-cardinality columns (sex, bool) |
| Hash | Hash table | Equality only (=) | Range queries, ORDER BY |
| Composite | B-tree on multiple columns | Multi-column WHERE clauses | Queries that skip the leading column |
| Covering | Index includes all needed columns | Avoid table lookups (index-only scan) | Write-heavy tables |
| Partial | Index on subset of rows (WHERE clause) | Index active records only | Full-table queries |
| Full-text | Inverted index on text tokens | LIKE '%word%', text search | Exact matches (use B-tree) |
-- B-tree index (default in PostgreSQL/MySQL)
CREATE INDEX idx_orders_user_date ON orders (user_id, created_at DESC);
-- Covering index: includes all columns query needs
CREATE INDEX idx_users_email_cover ON users (email)
INCLUDE (name, status); -- PostgreSQL syntax; avoids heap fetch
-- Partial index: only active users (sparse, smaller)
CREATE INDEX idx_active_users ON users (email) WHERE status = 'active';
-- Expression index
CREATE INDEX idx_lower_email ON users (LOWER(email));
-- Enables: WHERE LOWER(email) = 'alice@example.com'
-- EXPLAIN ANALYZE to see index usage
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT * FROM orders WHERE user_id = 42 AND created_at > NOW() - INTERVAL '30 days';
-- Look for: Index Scan vs Seq Scan, cost estimates, actual rows
B-tree Index Structure:
Root node: [100 | 250 | 500]
Left: [20, 45, 80] Mid: [110, 180, 220] Right: [300, 400, 460]
Range scan WHERE id BETWEEN 45 AND 180:
→ Navigate to leaf 45, scan right until > 180
→ O(log n) to find start + O(k) to scan range
Composite Index (user_id, created_at):
WHERE user_id = 42 -- uses index
WHERE user_id = 42 AND created_at > X -- uses index
WHERE created_at > X -- CANNOT use index (skips leading col)
→ always put highest-cardinality, most-filtered col first
N+1 Query Problem
Fetching 100 users then SELECT posts WHERE user_id = ? for each = 101 queries.Fix: JOIN (single query) or batch load: SELECT * FROM posts WHERE user_id IN (1,2,...100)
ORMs: SQLAlchemy selectinload(), Django select_related(), Hibernate @BatchSize
Window Functions
ROW_NUMBER, RANK, LEAD, LAG, SUM OVER PARTITION
Must KnowSQL
Window Functions vs GROUP BY
GROUP BY collapses rows into groups (aggregates). Window functions compute values across related rows while KEEPING each row in the result set.-- Sample: employees(id, name, dept, salary, hire_date)
-- ── Ranking functions ─────────────────────────────────────
SELECT
name, dept, salary,
ROW_NUMBER() OVER (PARTITION BY dept ORDER BY salary DESC) AS row_num,
-- No gaps, no ties: 1,2,3,4
RANK() OVER (PARTITION BY dept ORDER BY salary DESC) AS rank,
-- Gaps for ties: 1,2,2,4
DENSE_RANK() OVER (PARTITION BY dept ORDER BY salary DESC) AS dense_rank,
-- No gaps: 1,2,2,3
NTILE(4) OVER (ORDER BY salary) AS quartile
-- Divides into n equal buckets
FROM employees;
-- ── Lead / Lag — access adjacent rows ────────────────────
SELECT
name, hire_date, salary,
LAG(salary, 1, 0) OVER (PARTITION BY dept ORDER BY hire_date) AS prev_salary,
LEAD(salary, 1, 0) OVER (PARTITION BY dept ORDER BY hire_date) AS next_salary,
salary - LAG(salary, 1) OVER (ORDER BY hire_date) AS salary_change
FROM employees;
-- ── Running totals and moving averages ────────────────────
SELECT
name, dept, salary, hire_date,
SUM(salary) OVER (PARTITION BY dept ORDER BY hire_date
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS running_total,
AVG(salary) OVER (PARTITION BY dept ORDER BY hire_date
ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS moving_avg_3,
COUNT(*) OVER (PARTITION BY dept) AS dept_headcount,
MAX(salary) OVER (PARTITION BY dept) AS dept_max_salary,
salary / SUM(salary) OVER (PARTITION BY dept) * 100 AS pct_of_dept_payroll
FROM employees;
-- ── Top-N per group (classic interview problem) ───────────
SELECT * FROM (
SELECT *, ROW_NUMBER() OVER (PARTITION BY dept ORDER BY salary DESC) AS rn
FROM employees
) ranked
WHERE rn <= 3; -- top 3 earners per department
-- ── First/Last value ──────────────────────────────────────
SELECT
name, dept, salary,
FIRST_VALUE(name) OVER (PARTITION BY dept ORDER BY salary DESC) AS top_earner,
LAST_VALUE(salary) OVER (
PARTITION BY dept ORDER BY salary DESC
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
) AS lowest_in_dept
FROM employees;
SQL JOINs
JOIN Types — INNER, LEFT, RIGHT, FULL OUTER, CROSS, SELF
SQL
Tables: Users(id, name) Orders(id, user_id, amount)
INNER JOIN — only rows with matches in BOTH tables
Users: [1:Alice, 2:Bob, 3:Carol]
Orders: [u1:$50, u1:$30, u2:$20]
Result: Alice→$50, Alice→$30, Bob→$20 (Carol excluded — no orders)
LEFT JOIN — all left rows + matched right rows (NULL if no match)
Result: Alice→$50, Alice→$30, Bob→$20, Carol→NULL
RIGHT JOIN — all right rows + matched left rows
(Rarely used; swap table order and use LEFT JOIN instead)
FULL OUTER JOIN — all rows from both, NULL on unmatched side
(Not supported in MySQL — use UNION of LEFT + RIGHT)
CROSS JOIN — cartesian product (m×n rows)
Use for generating combinations; no ON clause needed
SELF JOIN — join table with itself
SELECT e.name, m.name AS manager
FROM employees e JOIN employees m ON e.manager_id = m.id
-- Find users with no orders (LEFT JOIN + IS NULL pattern)
SELECT u.id, u.name
FROM users u
LEFT JOIN orders o ON u.id = o.user_id
WHERE o.id IS NULL;
-- Alternative: NOT EXISTS (often faster with proper index)
SELECT id, name FROM users u
WHERE NOT EXISTS (SELECT 1 FROM orders WHERE user_id = u.id);
-- Self join: employee + manager
SELECT e.name AS employee, m.name AS manager
FROM employees e
LEFT JOIN employees m ON e.manager_id = m.id; -- LEFT to include top-level (CEO)
-- Multiple joins
SELECT o.id, u.name, p.name AS product, o.quantity
FROM orders o
JOIN users u ON o.user_id = u.id
JOIN products p ON o.product_id = p.id
WHERE o.created_at >= CURRENT_DATE - INTERVAL '7 days';
ACID & Transaction Isolation Levels
ACID Properties & Transaction Isolation
Must KnowTransactions
ACID
- Atomicity: Transaction is all-or-nothing. If any statement fails, the entire transaction rolls back. (WAL / undo logs)
- Consistency: Database moves from one valid state to another. Constraints, triggers, cascades are enforced.
- Isolation: Concurrent transactions don't see each other's uncommitted changes (configurable level).
- Durability: Committed transactions survive crashes. (Write-Ahead Log / fsync to disk)
| Isolation Level | Dirty Read | Non-Repeatable Read | Phantom Read | Performance |
|---|---|---|---|---|
| READ UNCOMMITTED | Possible | Possible | Possible | Fastest |
| READ COMMITTED (default PG) | Prevented | Possible | Possible | Good |
| REPEATABLE READ (default MySQL) | Prevented | Prevented | Possible | Moderate |
| SERIALIZABLE | Prevented | Prevented | Prevented | Slowest |
Anomalies explained:
Dirty Read: T2 reads T1's uncommitted change, T1 rolls back → T2 used invalid data
Non-Repeatable Read:
T1: SELECT salary FROM emp WHERE id=1 → $5000
T2: UPDATE emp SET salary=6000 WHERE id=1; COMMIT
T1: SELECT salary FROM emp WHERE id=1 → $6000 (changed!)
Phantom Read:
T1: SELECT COUNT(*) FROM orders WHERE total > 100 → 5
T2: INSERT INTO orders VALUES (200); COMMIT
T1: SELECT COUNT(*) FROM orders WHERE total > 100 → 6 (new row appeared!)
-- Set isolation level
BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE;
-- critical financial operation
UPDATE accounts SET balance = balance - 500 WHERE id = 1;
UPDATE accounts SET balance = balance + 500 WHERE id = 2;
COMMIT;
-- Optimistic locking (avoid long-held locks)
BEGIN;
SELECT version FROM inventory WHERE product_id = 42; -- version=5
-- ... do business logic ...
UPDATE inventory SET qty = qty - 1, version = version + 1
WHERE product_id = 42 AND version = 5; -- fails if version changed
-- check affected rows; if 0, retry
COMMIT;
Normalization — 1NF through BCNF
Normal Forms — 1NF, 2NF, 3NF, BCNF
SQLSchema Design
| Form | Rule | Violation Example |
|---|---|---|
| 1NF | Atomic values; no repeating groups; primary key exists | phone = "555-1234, 555-5678" (multi-value) |
| 2NF | 1NF + no partial dependencies on composite PK | Order(order_id, product_id, product_name) — product_name depends only on product_id |
| 3NF | 2NF + no transitive dependencies (non-key → non-key) | Employee(emp_id, dept_id, dept_name) — dept_name depends on dept_id, not emp_id |
| BCNF | Every determinant is a candidate key | Rare — stricter than 3NF; split when 3NF allows anomalies |
De-normalization trade-offs:
Normalized (3NF): JOIN orders o → products p → categories c
Pros: no data redundancy, easier updates
Cons: complex queries, JOIN overhead
De-normalized (for read performance):
Add category_name, product_name to orders table directly
Pros: single-table reads, no JOINs
Cons: update anomalies (must update all order rows if category renames)
Decision: Normalize first; denormalize only when profiling shows JOIN bottleneck.
OLTP: normalize (frequent writes). OLAP/DWH: denormalize (star/snowflake schema).
CTEs & Recursive CTEs
CTEs — Common Table Expressions & Recursive Queries
SQLAdvanced
-- ── Basic CTE ─────────────────────────────────────────────
WITH dept_avg AS (
SELECT dept_id, AVG(salary) AS avg_salary
FROM employees
GROUP BY dept_id
),
high_earners AS (
SELECT e.*, d.avg_salary,
e.salary - d.avg_salary AS above_avg
FROM employees e
JOIN dept_avg d ON e.dept_id = d.dept_id
WHERE e.salary > d.avg_salary
)
SELECT name, dept_id, salary, above_avg
FROM high_earners
ORDER BY above_avg DESC;
-- ── Recursive CTE — org chart (manager hierarchy) ─────────
WITH RECURSIVE org_tree AS (
-- Base case: top-level managers (no manager)
SELECT id, name, manager_id, 1 AS level, CAST(name AS VARCHAR(1000)) AS path
FROM employees
WHERE manager_id IS NULL
UNION ALL
-- Recursive case: employees with a manager
SELECT e.id, e.name, e.manager_id, ot.level + 1,
CAST(ot.path || ' > ' || e.name AS VARCHAR(1000))
FROM employees e
JOIN org_tree ot ON e.manager_id = ot.id
)
SELECT id, name, level, path
FROM org_tree
ORDER BY path;
-- ── Recursive CTE — find all paths in a graph ──────────────
WITH RECURSIVE paths AS (
SELECT from_node, to_node, ARRAY[from_node, to_node] AS path
FROM edges
WHERE from_node = 1 -- start node
UNION ALL
SELECT p.from_node, e.to_node, p.path || e.to_node
FROM paths p
JOIN edges e ON p.to_node = e.from_node
WHERE NOT (e.to_node = ANY(p.path)) -- avoid cycles
)
SELECT * FROM paths WHERE to_node = 10; -- find all paths to node 10
CTE vs Subquery vs View
CTE: Named, reusable within one query. Improves readability. PostgreSQL materializes CTEs (fence optimization). MySQL 8+ supports recursive CTEs.Subquery: Inline, not reusable. Can be correlated (references outer query).
View: Persistent named query. Can be indexed (materialized view).
NoSQL & CAP Theorem
CAP Theorem & NoSQL Database Categories
Must KnowDistributed
CAP Theorem
A distributed system can only guarantee two of three properties simultaneously:
- Consistency (C): Every read receives the most recent write (or an error).
- Availability (A): Every request receives a response (not necessarily latest data).
- Partition Tolerance (P): System continues operating despite network partition.
In reality, network partitions happen, so you choose CP or AP.
BASE (NoSQL alternative to ACID): Basically Available, Soft state, Eventually consistent.
| Database | Type | CAP | Best For |
|---|---|---|---|
| PostgreSQL, MySQL | Relational (RDBMS) | CP | Transactions, reporting, joins |
| MongoDB | Document | CP (configurable) | Flexible schemas, hierarchical data |
| Cassandra | Wide-column | AP | High-write, time-series, sensor data |
| Redis | In-memory KV | CP (cluster mode) | Caching, sessions, rate limiting |
| DynamoDB | KV/Document | AP (eventual) / CP (strong) | AWS-native, serverless scale |
| Neo4j | Graph | CA (single-node) | Relationship queries (fraud, social) |
| Elasticsearch | Search/Document | AP | Full-text search, log analytics |
| InfluxDB | Time-series | CP | Metrics, monitoring, IoT |
MongoDB
MongoDB — Document Model & Aggregation Pipeline
MongoDBNoSQL
from pymongo import MongoClient, ASCENDING, DESCENDING
from datetime import datetime, timedelta
client = MongoClient("mongodb://localhost:27017")
db = client["ecommerce"]
orders = db["orders"]
# ── CRUD ─────────────────────────────────────────────────
# Insert
doc_id = orders.insert_one({
"user_id": "u123",
"items": [
{"product": "Laptop", "qty": 1, "price": 999.99},
{"product": "Mouse", "qty": 2, "price": 29.99},
],
"status": "pending",
"created_at": datetime.utcnow(),
"total": 1059.97
}).inserted_id
# Query
recent = list(orders.find(
{"status": "shipped", "created_at": {"$gte": datetime.utcnow() - timedelta(days=7)}},
{"user_id": 1, "total": 1, "_id": 0} # projection
).sort("created_at", DESCENDING).limit(20))
# Update — atomic operators
orders.update_one(
{"_id": doc_id},
{
"$set": {"status": "shipped"},
"$push": {"tracking": {"event": "dispatched", "ts": datetime.utcnow()}},
"$inc": {"update_count": 1}
}
)
# ── Aggregation Pipeline ──────────────────────────────────
pipeline = [
# Stage 1: Filter
{"$match": {
"created_at": {"$gte": datetime(2024, 1, 1)},
"status": {"$in": ["shipped", "delivered"]}
}},
# Stage 2: Unwind array
{"$unwind": "$items"},
# Stage 3: Group
{"$group": {
"_id": "$items.product",
"total_revenue": {"$sum": {"$multiply": ["$items.qty", "$items.price"]}},
"units_sold": {"$sum": "$items.qty"},
"order_count": {"$addToSet": "$_id"}
}},
# Stage 4: Add computed field
{"$addFields": {
"unique_orders": {"$size": "$order_count"},
"avg_price": {"$divide": ["$total_revenue", "$units_sold"]}
}},
# Stage 5: Sort
{"$sort": {"total_revenue": -1}},
# Stage 6: Limit
{"$limit": 10},
# Stage 7: Project output
{"$project": {
"product": "$_id", "_id": 0,
"total_revenue": {"$round": ["$total_revenue", 2]},
"units_sold": 1, "unique_orders": 1
}}
]
results = list(orders.aggregate(pipeline))
# ── Indexes ────────────────────────────────────────────────
orders.create_index([("user_id", ASCENDING), ("created_at", DESCENDING)])
orders.create_index([("status", ASCENDING)], partial_filter_expression={"status": {"$ne": "deleted"}})
orders.create_index([("items.product", "text")]) # full-text index
MongoDB Schema Design Patterns
- Embed vs Reference: Embed when data is accessed together and bounded in size (1:1, 1:few). Reference for unbounded 1:many (user → orders) and many-to-many.
- Bucket pattern: Group time-series data into buckets (e.g., 1 doc = 1 hour of sensor readings) to reduce document count.
- Computed pattern: Pre-compute aggregates (total_orders) on write to speed up reads.
- Outlier pattern: For documents that rarely exceed a threshold but occasionally balloon, use an overflow flag with linked documents.
Apache Cassandra
Cassandra — Wide-Column Model, Partition Design & Consistency
CassandraNoSQL
Core Concepts
- Partition key: Determines which node holds the data (hashed). All rows with same partition key are co-located.
- Clustering columns: Define sort order within a partition. Enables efficient range queries within a partition.
- No JOINs, no transactions across partitions — denormalize; design tables around queries.
- Write path: Write to commit log (WAL) + memtable (in-memory). Memtable flushes to SSTable on disk periodically.
- Read path: Check memtable → row cache → bloom filter → SSTable (merge results). Compaction merges SSTables periodically.
-- CQL (Cassandra Query Language)
-- Schema design: query-first approach
-- Q: "Get all events for a user, newest first"
CREATE TABLE user_events (
user_id UUID,
event_ts TIMESTAMP,
event_id UUID,
type TEXT,
payload MAP,
PRIMARY KEY (user_id, event_ts, event_id)
) WITH CLUSTERING ORDER BY (event_ts DESC, event_id ASC)
AND default_time_to_live = 7776000; -- auto-expire after 90 days
-- Partition key = user_id (all events for a user on one node)
-- Clustering = event_ts DESC (newest first without sorting)
-- Q: "Get events for user in last 24 hours"
SELECT * FROM user_events
WHERE user_id = f47ac10b-...
AND event_ts >= toTimestamp(now()) - 86400s
LIMIT 100;
-- Tunable consistency (per-query)
-- Write with QUORUM (majority of replicas)
INSERT INTO user_events ... USING CONSISTENCY QUORUM;
-- Read with ONE (fastest, eventual consistency)
SELECT ... FROM user_events WHERE ... CONSISTENCY ONE;
-- QUORUM for both = strong consistency (safe but slower)
-- Replication factor 3: QUORUM = 2 replicas must ack
Cassandra Consistency Levels (replication_factor = 3):
ANY — at least 1 node (weakest; even hinted handoff counts)
ONE — 1 replica responds
TWO — 2 replicas respond
QUORUM — ceil(RF/2)+1 = 2 of 3 (most common for strong consistency)
ALL — all 3 replicas (strongest, least available)
Strong consistency: Write QUORUM + Read QUORUM
(At least one replica overlaps between write and read set)
Use case fit:
✅ Time-series / IoT sensor data
✅ Activity feeds, audit logs (append-heavy)
✅ User session data with TTL
❌ Complex relationships needing JOINs
❌ Transactions across multiple partition keys
Redis
Redis Data Types, Patterns & Persistence
Must KnowRedisCaching
| Data Type | Commands | Use Cases |
|---|---|---|
| String | GET/SET/INCR/EXPIRE | Cache, counters, rate limiting, session tokens |
| Hash | HGET/HSET/HMGET | User profiles, object fields |
| List | LPUSH/RPUSH/LRANGE/BLPOP | Message queues, activity feeds, recent N items |
| Set | SADD/SMEMBERS/SINTER/SUNION | Unique visitors, tags, social graph |
| Sorted Set | ZADD/ZRANGE/ZRANGEBYSCORE | Leaderboards, rate limiting (sliding window), priority queues |
| Stream | XADD/XREAD/XGROUP | Event sourcing, log aggregation (Kafka-lite) |
| Bitmap | SETBIT/GETBIT/BITCOUNT | Daily active users (1 bit per user), feature flags |
| HyperLogLog | PFADD/PFCOUNT | Approximate unique count (12KB for any cardinality) |
import redis, time, json
r = redis.Redis(host='localhost', port=6379, decode_responses=True)
# ── Caching with TTL ───────────────────────────────────────
def get_user(user_id: str) -> dict:
key = f"user:{user_id}"
cached = r.get(key)
if cached:
return json.loads(cached)
user = db.query_user(user_id) # slow DB query
r.setex(key, 3600, json.dumps(user)) # cache 1 hour
return user
# ── Distributed Rate Limiting (token bucket) ──────────────
def rate_limit(user_id: str, limit: int = 100, window: int = 60) -> bool:
key = f"rate:{user_id}"
pipe = r.pipeline()
now = time.time()
pipe.zremrangebyscore(key, 0, now - window) # remove old requests
pipe.zadd(key, {str(now): now}) # add current request
pipe.zcard(key) # count in window
pipe.expire(key, window)
_, _, count, _ = pipe.execute()
return count <= limit
# ── Distributed Lock (SETNX pattern) ──────────────────────
import uuid
def acquire_lock(resource: str, ttl: int = 10) -> str | None:
lock_id = str(uuid.uuid4())
acquired = r.set(f"lock:{resource}", lock_id, nx=True, ex=ttl)
return lock_id if acquired else None
def release_lock(resource: str, lock_id: str) -> bool:
# Lua script for atomic check-and-delete
lua = """
if redis.call("get", KEYS[1]) == ARGV[1] then
return redis.call("del", KEYS[1])
else return 0 end"""
return bool(r.eval(lua, 1, f"lock:{resource}", lock_id))
# ── Leaderboard (Sorted Set) ───────────────────────────────
r.zadd("leaderboard:2024", {"alice": 9500, "bob": 8700, "carol": 9200})
# Top 10
r.zrevrange("leaderboard:2024", 0, 9, withscores=True)
# User rank
r.zrevrank("leaderboard:2024", "alice") # 0-indexed rank
# ── Pub/Sub ────────────────────────────────────────────────
# Publisher
r.publish("notifications:user:u123", json.dumps({"type": "order_shipped", "order_id": "o456"}))
# Subscriber (blocking)
pubsub = r.pubsub()
pubsub.subscribe("notifications:user:u123")
for message in pubsub.listen():
if message["type"] == "message":
print(json.loads(message["data"]))
Redis Persistence Modes
- RDB (snapshots): Periodic binary dump. Fast restart, compact file. Risk: lose data since last snapshot (typically minutes).
- AOF (Append Only File): Log every write command. Configurable fsync: always (safe but slow), everysec (1s data loss risk), no (OS decides). Larger file, slower restart but better durability.
- RDB + AOF: Best of both — use AOF for durability, RDB for fast restart. Redis 7+ uses newer AOF + RDB hybrid format.
- No persistence: Pure cache — fastest, full data loss on restart. Acceptable when data can be rebuilt from primary store.
Redis Cluster vs Sentinel vs Standalone
Standalone: Single node. Simplest; no HA.Sentinel: HA setup — 3+ sentinel processes monitor master + replicas. Auto-failover. One shard (no horizontal scale).
Cluster: Horizontal sharding across 3-N master nodes (16384 hash slots). Each master has replicas. Scales writes; some multi-key commands require same slot (use hash tags {user}:*).