SQL, NoSQL, and Distributed Databases

Indexes, query optimization, ACID guarantees, isolation levels, normalization, window functions, CAP theorem, and deep dives into MongoDB, Cassandra, and Redis.

Indexes & EXPLAIN Window Functions JOINs ACID & Isolation Normalization CTEs CAP Theorem MongoDB Cassandra Redis

SQL Indexes & Query Optimization

Index Types — B-tree, Hash, Composite, Covering, Partial

Must KnowSQL

▶

Type	Structure	Best For	Not For
B-tree (default)	Balanced tree; sorted	Range queries, ORDER BY, equality	Low-cardinality columns (sex, bool)
Hash	Hash table	Equality only (=)	Range queries, ORDER BY
Composite	B-tree on multiple columns	Multi-column WHERE clauses	Queries that skip the leading column
Covering	Index includes all needed columns	Avoid table lookups (index-only scan)	Write-heavy tables
Partial	Index on subset of rows (WHERE clause)	Index active records only	Full-table queries
Full-text	Inverted index on text tokens	LIKE '%word%', text search	Exact matches (use B-tree)

-- B-tree index (default in PostgreSQL/MySQL)
CREATE INDEX idx_orders_user_date ON orders (user_id, created_at DESC);

-- Covering index: includes all columns query needs
CREATE INDEX idx_users_email_cover ON users (email)
  INCLUDE (name, status);    -- PostgreSQL syntax; avoids heap fetch

-- Partial index: only active users (sparse, smaller)
CREATE INDEX idx_active_users ON users (email) WHERE status = 'active';

-- Expression index
CREATE INDEX idx_lower_email ON users (LOWER(email));
-- Enables: WHERE LOWER(email) = 'alice@example.com'

-- EXPLAIN ANALYZE to see index usage
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT * FROM orders WHERE user_id = 42 AND created_at > NOW() - INTERVAL '30 days';

-- Look for: Index Scan vs Seq Scan, cost estimates, actual rows

B-tree Index Structure: Root node: [100 | 250 | 500] Left: [20, 45, 80] Mid: [110, 180, 220] Right: [300, 400, 460] Range scan WHERE id BETWEEN 45 AND 180: → Navigate to leaf 45, scan right until > 180 → O(log n) to find start + O(k) to scan range Composite Index (user_id, created_at): WHERE user_id = 42 -- uses index WHERE user_id = 42 AND created_at > X -- uses index WHERE created_at > X -- CANNOT use index (skips leading col) → always put highest-cardinality, most-filtered col first

N+1 Query Problem

Fetching 100 users then SELECT posts WHERE user_id = ? for each = 101 queries.
Fix: JOIN (single query) or batch load: SELECT * FROM posts WHERE user_id IN (1,2,...100)
ORMs: SQLAlchemy selectinload(), Django select_related(), Hibernate @BatchSize

Window Functions

ROW_NUMBER, RANK, LEAD, LAG, SUM OVER PARTITION

Must KnowSQL

▶

Window Functions vs GROUP BY

GROUP BY collapses rows into groups (aggregates). Window functions compute values across related rows while KEEPING each row in the result set.

-- Sample: employees(id, name, dept, salary, hire_date)

-- ── Ranking functions ─────────────────────────────────────
SELECT
  name, dept, salary,
  ROW_NUMBER() OVER (PARTITION BY dept ORDER BY salary DESC) AS row_num,
  -- No gaps, no ties: 1,2,3,4
  RANK()       OVER (PARTITION BY dept ORDER BY salary DESC) AS rank,
  -- Gaps for ties: 1,2,2,4
  DENSE_RANK() OVER (PARTITION BY dept ORDER BY salary DESC) AS dense_rank,
  -- No gaps: 1,2,2,3
  NTILE(4)     OVER (ORDER BY salary) AS quartile
  -- Divides into n equal buckets
FROM employees;

-- ── Lead / Lag — access adjacent rows ────────────────────
SELECT
  name, hire_date, salary,
  LAG(salary,  1, 0) OVER (PARTITION BY dept ORDER BY hire_date) AS prev_salary,
  LEAD(salary, 1, 0) OVER (PARTITION BY dept ORDER BY hire_date) AS next_salary,
  salary - LAG(salary, 1) OVER (ORDER BY hire_date) AS salary_change
FROM employees;

-- ── Running totals and moving averages ────────────────────
SELECT
  name, dept, salary, hire_date,
  SUM(salary)  OVER (PARTITION BY dept ORDER BY hire_date
                     ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS running_total,
  AVG(salary)  OVER (PARTITION BY dept ORDER BY hire_date
                     ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS moving_avg_3,
  COUNT(*)     OVER (PARTITION BY dept) AS dept_headcount,
  MAX(salary)  OVER (PARTITION BY dept) AS dept_max_salary,
  salary / SUM(salary) OVER (PARTITION BY dept) * 100 AS pct_of_dept_payroll
FROM employees;

-- ── Top-N per group (classic interview problem) ───────────
SELECT * FROM (
  SELECT *, ROW_NUMBER() OVER (PARTITION BY dept ORDER BY salary DESC) AS rn
  FROM employees
) ranked
WHERE rn <= 3;   -- top 3 earners per department

-- ── First/Last value ──────────────────────────────────────
SELECT
  name, dept, salary,
  FIRST_VALUE(name) OVER (PARTITION BY dept ORDER BY salary DESC) AS top_earner,
  LAST_VALUE(salary) OVER (
    PARTITION BY dept ORDER BY salary DESC
    ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
  ) AS lowest_in_dept
FROM employees;

SQL JOINs

JOIN Types — INNER, LEFT, RIGHT, FULL OUTER, CROSS, SELF

SQL

▶

Tables: Users(id, name) Orders(id, user_id, amount) INNER JOIN — only rows with matches in BOTH tables Users: [1:Alice, 2:Bob, 3:Carol] Orders: [u1:$50, u1:$30, u2:$20] Result: Alice→$50, Alice→$30, Bob→$20 (Carol excluded — no orders) LEFT JOIN — all left rows + matched right rows (NULL if no match) Result: Alice→$50, Alice→$30, Bob→$20, Carol→NULL RIGHT JOIN — all right rows + matched left rows (Rarely used; swap table order and use LEFT JOIN instead) FULL OUTER JOIN — all rows from both, NULL on unmatched side (Not supported in MySQL — use UNION of LEFT + RIGHT) CROSS JOIN — cartesian product (m×n rows) Use for generating combinations; no ON clause needed SELF JOIN — join table with itself SELECT e.name, m.name AS manager FROM employees e JOIN employees m ON e.manager_id = m.id

-- Find users with no orders (LEFT JOIN + IS NULL pattern)
SELECT u.id, u.name
FROM users u
LEFT JOIN orders o ON u.id = o.user_id
WHERE o.id IS NULL;
-- Alternative: NOT EXISTS (often faster with proper index)
SELECT id, name FROM users u
WHERE NOT EXISTS (SELECT 1 FROM orders WHERE user_id = u.id);

-- Self join: employee + manager
SELECT e.name AS employee, m.name AS manager
FROM employees e
LEFT JOIN employees m ON e.manager_id = m.id;  -- LEFT to include top-level (CEO)

-- Multiple joins
SELECT o.id, u.name, p.name AS product, o.quantity
FROM orders o
JOIN users    u ON o.user_id    = u.id
JOIN products p ON o.product_id = p.id
WHERE o.created_at >= CURRENT_DATE - INTERVAL '7 days';

ACID & Transaction Isolation Levels

ACID Properties & Transaction Isolation

Must KnowTransactions

▶

ACID

Atomicity: Transaction is all-or-nothing. If any statement fails, the entire transaction rolls back. (WAL / undo logs)
Consistency: Database moves from one valid state to another. Constraints, triggers, cascades are enforced.
Isolation: Concurrent transactions don't see each other's uncommitted changes (configurable level).
Durability: Committed transactions survive crashes. (Write-Ahead Log / fsync to disk)

Isolation Level	Dirty Read	Non-Repeatable Read	Phantom Read	Performance
READ UNCOMMITTED	Possible	Possible	Possible	Fastest
READ COMMITTED (default PG)	Prevented	Possible	Possible	Good
REPEATABLE READ (default MySQL)	Prevented	Prevented	Possible	Moderate
SERIALIZABLE	Prevented	Prevented	Prevented	Slowest

Anomalies explained: Dirty Read: T2 reads T1's uncommitted change, T1 rolls back → T2 used invalid data Non-Repeatable Read: T1: SELECT salary FROM emp WHERE id=1 → $5000 T2: UPDATE emp SET salary=6000 WHERE id=1; COMMIT T1: SELECT salary FROM emp WHERE id=1 → $6000 (changed!) Phantom Read: T1: SELECT COUNT(*) FROM orders WHERE total > 100 → 5 T2: INSERT INTO orders VALUES (200); COMMIT T1: SELECT COUNT(*) FROM orders WHERE total > 100 → 6 (new row appeared!)

-- Set isolation level
BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE;
  -- critical financial operation
  UPDATE accounts SET balance = balance - 500 WHERE id = 1;
  UPDATE accounts SET balance = balance + 500 WHERE id = 2;
COMMIT;

-- Optimistic locking (avoid long-held locks)
BEGIN;
  SELECT version FROM inventory WHERE product_id = 42;  -- version=5
  -- ... do business logic ...
  UPDATE inventory SET qty = qty - 1, version = version + 1
  WHERE product_id = 42 AND version = 5;  -- fails if version changed
  -- check affected rows; if 0, retry
COMMIT;

Normalization — 1NF through BCNF

Normal Forms — 1NF, 2NF, 3NF, BCNF

SQLSchema Design

▶

Form	Rule	Violation Example
1NF	Atomic values; no repeating groups; primary key exists	phone = "555-1234, 555-5678" (multi-value)
2NF	1NF + no partial dependencies on composite PK	Order(order_id, product_id, product_name) — product_name depends only on product_id
3NF	2NF + no transitive dependencies (non-key → non-key)	Employee(emp_id, dept_id, dept_name) — dept_name depends on dept_id, not emp_id
BCNF	Every determinant is a candidate key	Rare — stricter than 3NF; split when 3NF allows anomalies

De-normalization trade-offs: Normalized (3NF): JOIN orders o → products p → categories c Pros: no data redundancy, easier updates Cons: complex queries, JOIN overhead De-normalized (for read performance): Add category_name, product_name to orders table directly Pros: single-table reads, no JOINs Cons: update anomalies (must update all order rows if category renames) Decision: Normalize first; denormalize only when profiling shows JOIN bottleneck. OLTP: normalize (frequent writes). OLAP/DWH: denormalize (star/snowflake schema).

CTEs & Recursive CTEs

CTEs — Common Table Expressions & Recursive Queries

SQLAdvanced

▶

-- ── Basic CTE ─────────────────────────────────────────────
WITH dept_avg AS (
  SELECT dept_id, AVG(salary) AS avg_salary
  FROM employees
  GROUP BY dept_id
),
high_earners AS (
  SELECT e.*, d.avg_salary,
         e.salary - d.avg_salary AS above_avg
  FROM employees e
  JOIN dept_avg d ON e.dept_id = d.dept_id
  WHERE e.salary > d.avg_salary
)
SELECT name, dept_id, salary, above_avg
FROM high_earners
ORDER BY above_avg DESC;

-- ── Recursive CTE — org chart (manager hierarchy) ─────────
WITH RECURSIVE org_tree AS (
  -- Base case: top-level managers (no manager)
  SELECT id, name, manager_id, 1 AS level, CAST(name AS VARCHAR(1000)) AS path
  FROM employees
  WHERE manager_id IS NULL

  UNION ALL

  -- Recursive case: employees with a manager
  SELECT e.id, e.name, e.manager_id, ot.level + 1,
         CAST(ot.path || ' > ' || e.name AS VARCHAR(1000))
  FROM employees e
  JOIN org_tree ot ON e.manager_id = ot.id
)
SELECT id, name, level, path
FROM org_tree
ORDER BY path;

-- ── Recursive CTE — find all paths in a graph ──────────────
WITH RECURSIVE paths AS (
  SELECT from_node, to_node, ARRAY[from_node, to_node] AS path
  FROM edges
  WHERE from_node = 1    -- start node

  UNION ALL

  SELECT p.from_node, e.to_node, p.path || e.to_node
  FROM paths p
  JOIN edges e ON p.to_node = e.from_node
  WHERE NOT (e.to_node = ANY(p.path))   -- avoid cycles
)
SELECT * FROM paths WHERE to_node = 10;  -- find all paths to node 10

CTE vs Subquery vs View

CTE: Named, reusable within one query. Improves readability. PostgreSQL materializes CTEs (fence optimization). MySQL 8+ supports recursive CTEs.
Subquery: Inline, not reusable. Can be correlated (references outer query).
View: Persistent named query. Can be indexed (materialized view).

NoSQL & CAP Theorem

CAP Theorem & NoSQL Database Categories

Must KnowDistributed

▶

CAP Theorem

A distributed system can only guarantee two of three properties simultaneously:

Consistency (C): Every read receives the most recent write (or an error).
Availability (A): Every request receives a response (not necessarily latest data).
Partition Tolerance (P): System continues operating despite network partition.

In reality, network partitions happen, so you choose CP or AP.

BASE (NoSQL alternative to ACID): Basically Available, Soft state, Eventually consistent.

Database	Type	CAP	Best For
PostgreSQL, MySQL	Relational (RDBMS)	CP	Transactions, reporting, joins
MongoDB	Document	CP (configurable)	Flexible schemas, hierarchical data
Cassandra	Wide-column	AP	High-write, time-series, sensor data
Redis	In-memory KV	CP (cluster mode)	Caching, sessions, rate limiting
DynamoDB	KV/Document	AP (eventual) / CP (strong)	AWS-native, serverless scale
Neo4j	Graph	CA (single-node)	Relationship queries (fraud, social)
Elasticsearch	Search/Document	AP	Full-text search, log analytics
InfluxDB	Time-series	CP	Metrics, monitoring, IoT

MongoDB

MongoDB — Document Model & Aggregation Pipeline

MongoDBNoSQL

▶

from pymongo import MongoClient, ASCENDING, DESCENDING
from datetime import datetime, timedelta

client = MongoClient("mongodb://localhost:27017")
db     = client["ecommerce"]
orders = db["orders"]

# ── CRUD ─────────────────────────────────────────────────
# Insert
doc_id = orders.insert_one({
    "user_id": "u123",
    "items": [
        {"product": "Laptop", "qty": 1, "price": 999.99},
        {"product": "Mouse",  "qty": 2, "price": 29.99},
    ],
    "status": "pending",
    "created_at": datetime.utcnow(),
    "total": 1059.97
}).inserted_id

# Query
recent = list(orders.find(
    {"status": "shipped", "created_at": {"$gte": datetime.utcnow() - timedelta(days=7)}},
    {"user_id": 1, "total": 1, "_id": 0}  # projection
).sort("created_at", DESCENDING).limit(20))

# Update — atomic operators
orders.update_one(
    {"_id": doc_id},
    {
        "$set":  {"status": "shipped"},
        "$push": {"tracking": {"event": "dispatched", "ts": datetime.utcnow()}},
        "$inc":  {"update_count": 1}
    }
)

# ── Aggregation Pipeline ──────────────────────────────────
pipeline = [
    # Stage 1: Filter
    {"$match": {
        "created_at": {"$gte": datetime(2024, 1, 1)},
        "status": {"$in": ["shipped", "delivered"]}
    }},
    # Stage 2: Unwind array
    {"$unwind": "$items"},
    # Stage 3: Group
    {"$group": {
        "_id": "$items.product",
        "total_revenue": {"$sum": {"$multiply": ["$items.qty", "$items.price"]}},
        "units_sold":    {"$sum": "$items.qty"},
        "order_count":   {"$addToSet": "$_id"}
    }},
    # Stage 4: Add computed field
    {"$addFields": {
        "unique_orders": {"$size": "$order_count"},
        "avg_price": {"$divide": ["$total_revenue", "$units_sold"]}
    }},
    # Stage 5: Sort
    {"$sort": {"total_revenue": -1}},
    # Stage 6: Limit
    {"$limit": 10},
    # Stage 7: Project output
    {"$project": {
        "product": "$_id", "_id": 0,
        "total_revenue": {"$round": ["$total_revenue", 2]},
        "units_sold": 1, "unique_orders": 1
    }}
]
results = list(orders.aggregate(pipeline))

# ── Indexes ────────────────────────────────────────────────
orders.create_index([("user_id", ASCENDING), ("created_at", DESCENDING)])
orders.create_index([("status", ASCENDING)], partial_filter_expression={"status": {"$ne": "deleted"}})
orders.create_index([("items.product", "text")])  # full-text index

MongoDB Schema Design Patterns

Embed vs Reference: Embed when data is accessed together and bounded in size (1:1, 1:few). Reference for unbounded 1:many (user → orders) and many-to-many.
Bucket pattern: Group time-series data into buckets (e.g., 1 doc = 1 hour of sensor readings) to reduce document count.
Computed pattern: Pre-compute aggregates (total_orders) on write to speed up reads.
Outlier pattern: For documents that rarely exceed a threshold but occasionally balloon, use an overflow flag with linked documents.

Apache Cassandra

Cassandra — Wide-Column Model, Partition Design & Consistency

CassandraNoSQL

▶

Core Concepts

Partition key: Determines which node holds the data (hashed). All rows with same partition key are co-located.
Clustering columns: Define sort order within a partition. Enables efficient range queries within a partition.
No JOINs, no transactions across partitions — denormalize; design tables around queries.
Write path: Write to commit log (WAL) + memtable (in-memory). Memtable flushes to SSTable on disk periodically.
Read path: Check memtable → row cache → bloom filter → SSTable (merge results). Compaction merges SSTables periodically.

-- CQL (Cassandra Query Language)

-- Schema design: query-first approach
-- Q: "Get all events for a user, newest first"
CREATE TABLE user_events (
  user_id   UUID,
  event_ts  TIMESTAMP,
  event_id  UUID,
  type      TEXT,
  payload   MAP,
  PRIMARY KEY (user_id, event_ts, event_id)
) WITH CLUSTERING ORDER BY (event_ts DESC, event_id ASC)
  AND default_time_to_live = 7776000;  -- auto-expire after 90 days

-- Partition key = user_id (all events for a user on one node)
-- Clustering = event_ts DESC (newest first without sorting)

-- Q: "Get events for user in last 24 hours"
SELECT * FROM user_events
WHERE user_id = f47ac10b-...
  AND event_ts >= toTimestamp(now()) - 86400s
LIMIT 100;

-- Tunable consistency (per-query)
-- Write with QUORUM (majority of replicas)
INSERT INTO user_events ... USING CONSISTENCY QUORUM;

-- Read with ONE (fastest, eventual consistency)
SELECT ... FROM user_events WHERE ... CONSISTENCY ONE;

-- QUORUM for both = strong consistency (safe but slower)
-- Replication factor 3: QUORUM = 2 replicas must ack

Cassandra Consistency Levels (replication_factor = 3): ANY — at least 1 node (weakest; even hinted handoff counts) ONE — 1 replica responds TWO — 2 replicas respond QUORUM — ceil(RF/2)+1 = 2 of 3 (most common for strong consistency) ALL — all 3 replicas (strongest, least available) Strong consistency: Write QUORUM + Read QUORUM (At least one replica overlaps between write and read set) Use case fit: ✅ Time-series / IoT sensor data ✅ Activity feeds, audit logs (append-heavy) ✅ User session data with TTL ❌ Complex relationships needing JOINs ❌ Transactions across multiple partition keys

Redis

Redis Data Types, Patterns & Persistence

Must KnowRedisCaching

▶

Data Type	Commands	Use Cases
String	GET/SET/INCR/EXPIRE	Cache, counters, rate limiting, session tokens
Hash	HGET/HSET/HMGET	User profiles, object fields
List	LPUSH/RPUSH/LRANGE/BLPOP	Message queues, activity feeds, recent N items
Set	SADD/SMEMBERS/SINTER/SUNION	Unique visitors, tags, social graph
Sorted Set	ZADD/ZRANGE/ZRANGEBYSCORE	Leaderboards, rate limiting (sliding window), priority queues
Stream	XADD/XREAD/XGROUP	Event sourcing, log aggregation (Kafka-lite)
Bitmap	SETBIT/GETBIT/BITCOUNT	Daily active users (1 bit per user), feature flags
HyperLogLog	PFADD/PFCOUNT	Approximate unique count (12KB for any cardinality)

import redis, time, json

r = redis.Redis(host='localhost', port=6379, decode_responses=True)

# ── Caching with TTL ───────────────────────────────────────
def get_user(user_id: str) -> dict:
    key = f"user:{user_id}"
    cached = r.get(key)
    if cached:
        return json.loads(cached)
    user = db.query_user(user_id)           # slow DB query
    r.setex(key, 3600, json.dumps(user))    # cache 1 hour
    return user

# ── Distributed Rate Limiting (token bucket) ──────────────
def rate_limit(user_id: str, limit: int = 100, window: int = 60) -> bool:
    key = f"rate:{user_id}"
    pipe = r.pipeline()
    now  = time.time()
    pipe.zremrangebyscore(key, 0, now - window)    # remove old requests
    pipe.zadd(key, {str(now): now})                # add current request
    pipe.zcard(key)                                # count in window
    pipe.expire(key, window)
    _, _, count, _ = pipe.execute()
    return count <= limit

# ── Distributed Lock (SETNX pattern) ──────────────────────
import uuid
def acquire_lock(resource: str, ttl: int = 10) -> str | None:
    lock_id = str(uuid.uuid4())
    acquired = r.set(f"lock:{resource}", lock_id, nx=True, ex=ttl)
    return lock_id if acquired else None

def release_lock(resource: str, lock_id: str) -> bool:
    # Lua script for atomic check-and-delete
    lua = """
    if redis.call("get", KEYS[1]) == ARGV[1] then
        return redis.call("del", KEYS[1])
    else return 0 end"""
    return bool(r.eval(lua, 1, f"lock:{resource}", lock_id))

# ── Leaderboard (Sorted Set) ───────────────────────────────
r.zadd("leaderboard:2024", {"alice": 9500, "bob": 8700, "carol": 9200})
# Top 10
r.zrevrange("leaderboard:2024", 0, 9, withscores=True)
# User rank
r.zrevrank("leaderboard:2024", "alice")  # 0-indexed rank

# ── Pub/Sub ────────────────────────────────────────────────
# Publisher
r.publish("notifications:user:u123", json.dumps({"type": "order_shipped", "order_id": "o456"}))
# Subscriber (blocking)
pubsub = r.pubsub()
pubsub.subscribe("notifications:user:u123")
for message in pubsub.listen():
    if message["type"] == "message":
        print(json.loads(message["data"]))

Redis Persistence Modes

RDB (snapshots): Periodic binary dump. Fast restart, compact file. Risk: lose data since last snapshot (typically minutes).
AOF (Append Only File): Log every write command. Configurable fsync: always (safe but slow), everysec (1s data loss risk), no (OS decides). Larger file, slower restart but better durability.
RDB + AOF: Best of both — use AOF for durability, RDB for fast restart. Redis 7+ uses newer AOF + RDB hybrid format.
No persistence: Pure cache — fastest, full data loss on restart. Acceptable when data can be rebuilt from primary store.

Redis Cluster vs Sentinel vs Standalone

Standalone: Single node. Simplest; no HA.
Sentinel: HA setup — 3+ sentinel processes monitor master + replicas. Auto-failover. One shard (no horizontal scale).
Cluster: Horizontal sharding across 3-N master nodes (16384 hash slots). Each master has replicas. Scales writes; some multi-key commands require same slot (use hash tags {user}:*).