Alation Focus

Alation — Company Overview

What is Alation?

Core Product: Data Catalog

Enterprise data catalog & data intelligence platform
Helps organizations discover, understand, and govern their data
Like "Google for your company's data" — search all tables, columns, reports
Data lineage: track where data came from and how it transforms
Collaboration: data stewards annotate metadata, others consume it
Connects to: Snowflake, Redshift, BigQuery, Oracle, MySQL, S3, Tableau, etc.

Tech Stack (Known)

Python (backend services), React (frontend), PostgreSQL, Elasticsearch, Kafka, Redis, Kubernetes, AWS

Domain Concepts to Know

Term	Meaning
Metadata	Data about data (schema, owner, tags, description, usage stats)
Data Lineage	Graph of data transformations: table A joins B → creates C
Data Steward	Person responsible for data quality/documentation
Data Governance	Policies about who can see/use what data
Catalog Object	Any indexed entity: table, column, query, report, user
Endorsement	User certifies a dataset as trustworthy
Column-level Lineage	Which source columns feed into each target column
Data Quality	Metrics: completeness, accuracy, freshness, uniqueness

Most Asked DSA at Alation

Prioritize These Topics

Based on interview reports: Graph BFS/DFS, LRU Cache, Clone Graph, Trie operations, Sliding Window, Top-K with Heap, DP (Coin Change, LCS)

Graph Problems HOT

Why Graph for Alation?

Data lineage is a DAG (Directed Acyclic Graph)
Table relationships = graph nodes + edges
Impact analysis = DFS/BFS on lineage graph

Problem	Pattern
Clone Graph	BFS + HashMap
Course Schedule	Topological Sort (Cycle Detection)
Number of Islands	DFS/BFS + visited
Word Ladder	BFS Shortest Path
All Paths from Source to Target	DFS Backtracking
Find if Path Exists in Graph	BFS/Union-Find

Design Problems HOT

Problem	Key Concept
LRU Cache	OrderedDict or DLL+HashMap
LFU Cache	Two HashMaps + min frequency tracker
Design Search Autocomplete	Trie with top-K
Implement Trie (Prefix Tree)	Node with children dict + is_end
Design Hit Counter	Sliding window / deque
Design File System	Trie / HashMap of paths

LFU Cache — Hard Design Problem

HardDesignTwo HashMaps

▶

from collections import defaultdict, OrderedDict

class LFUCache:
    def __init__(self, capacity):
        self.cap = capacity
        self.min_freq = 0
        self.key_val = {}           # key -> value
        self.key_freq = {}          # key -> freq
        self.freq_keys = defaultdict(OrderedDict)  # freq -> OrderedDict of keys

    def _update_freq(self, key):
        freq = self.key_freq[key]
        self.key_freq[key] = freq + 1
        del self.freq_keys[freq][key]
        if not self.freq_keys[freq] and freq == self.min_freq:
            self.min_freq += 1
        self.freq_keys[freq + 1][key] = None

    def get(self, key):
        if key not in self.key_val:
            return -1
        self._update_freq(key)
        return self.key_val[key]

    def put(self, key, value):
        if self.cap <= 0: return
        if key in self.key_val:
            self.key_val[key] = value
            self._update_freq(key)
        else:
            if len(self.key_val) >= self.cap:
                # Evict least frequent (oldest if tie)
                evict_key, _ = self.freq_keys[self.min_freq].popitem(last=False)
                del self.key_val[evict_key]
                del self.key_freq[evict_key]
            self.key_val[key] = value
            self.key_freq[key] = 1
            self.freq_keys[1][key] = None
            self.min_freq = 1

Trie (Prefix Tree) — Implement & Use

MediumTrieSearch

▶

class TrieNode:
    def __init__(self):
        self.children = {}  # char -> TrieNode
        self.is_end = False
        self.count = 0  # number of strings ending here

class Trie:
    def __init__(self):
        self.root = TrieNode()

    def insert(self, word):
        node = self.root
        for char in word:
            if char not in node.children:
                node.children[char] = TrieNode()
            node = node.children[char]
        node.is_end = True
        node.count += 1

    def search(self, word):
        node = self.root
        for char in word:
            if char not in node.children:
                return False
            node = node.children[char]
        return node.is_end

    def startsWith(self, prefix):
        node = self.root
        for char in prefix:
            if char not in node.children:
                return False
            node = node.children[char]
        return True

    def autocomplete(self, prefix):
        """Return all words with given prefix"""
        node = self.root
        for char in prefix:
            if char not in node.children:
                return []
            node = node.children[char]
        results = []
        self._dfs(node, prefix, results)
        return results

    def _dfs(self, node, path, results):
        if node.is_end:
            results.append(path)
        for char, child in node.children.items():
            self._dfs(child, path + char, results)

LLD Problems — Alation Domain

Design Metadata Catalog System (LLD)

Must KnowAlation Core

▶

If Asked "Design a Data Catalog LLD"

This is Alation's product. Knowing the domain deeply will impress the interviewer.

// Core Entities
enum ObjectType { DATABASE, SCHEMA, TABLE, COLUMN, QUERY, REPORT, DASHBOARD }

class CatalogObject {
    private long id;
    private ObjectType type;
    private String name;
    private String description;
    private String owner;
    private List<Tag> tags;
    private Map<String, String> customFields;
    private LocalDateTime createdAt, updatedAt;
    private User steward;
}

class Tag {
    private long id;
    private String name;
    private String color;
}

// Table Metadata
class TableMetadata extends CatalogObject {
    private String databaseName;
    private String schemaName;
    private List<ColumnMetadata> columns;
    private long rowCount;
    private LocalDateTime lastModified;
    private String location;  // e.g., S3 path for data lake
    private EndorsementStatus endorsement;
    private List<String> popularQueries;
}

class ColumnMetadata extends CatalogObject {
    private String dataType;
    private boolean nullable;
    private boolean isPrimaryKey;
    private boolean isForeignKey;
    private String referencedTable;
    private SensitivityLevel sensitivity;  // PII, sensitive, public
}

// Data Lineage Graph
class LineageNode {
    private CatalogObject object;
    private List<LineageEdge> upstreamEdges;   // where data comes from
    private List<LineageEdge> downstreamEdges; // where data goes to
}

class LineageEdge {
    private LineageNode source;
    private LineageNode target;
    private String transformationSQL;
    private String jobName;  // ETL job that created this edge
    private LocalDateTime capturedAt;
}

// Catalog Service
class CatalogService {
    private Map<Long, CatalogObject> objectStore;
    private SearchEngine searchEngine;  // wraps Elasticsearch
    private LineageGraph lineageGraph;

    // Search across all metadata
    public List<CatalogObject> search(SearchQuery query) {
        return searchEngine.search(query.getKeyword(), query.getFilters());
    }

    // Get upstream lineage (where does this table get data from?)
    public List<LineageNode> getUpstreamLineage(long tableId, int depth) {
        LineageNode node = lineageGraph.getNode(tableId);
        return lineageGraph.bfsUpstream(node, depth);
    }

    // Impact analysis: if table X changes, what breaks?
    public List<CatalogObject> getDownstreamImpact(long tableId) {
        LineageNode node = lineageGraph.getNode(tableId);
        return lineageGraph.dfsDownstream(node);
    }

    // Add custom description/tag
    public void annotate(long objectId, String userId, Annotation annotation) {
        CatalogObject obj = objectStore.get(objectId);
        obj.addAnnotation(annotation);
        searchEngine.reindex(obj);  // update search index
        auditLog(userId, "ANNOTATE", objectId);
    }
}

// Lineage Graph using BFS/DFS
class LineageGraph {
    private Map<Long, LineageNode> nodes;

    public List<CatalogObject> bfsUpstream(LineageNode start, int maxDepth) {
        List<CatalogObject> result = new ArrayList<>();
        Queue<LineageNode> queue = new LinkedList<>();
        Set<Long> visited = new HashSet<>();
        queue.offer(start);
        int depth = 0;
        while (!queue.isEmpty() && depth < maxDepth) {
            int size = queue.size();
            for (int i = 0; i < size; i++) {
                LineageNode curr = queue.poll();
                for (LineageEdge edge : curr.getUpstreamEdges()) {
                    LineageNode upstream = edge.getSource();
                    if (!visited.contains(upstream.getId())) {
                        visited.add(upstream.getId());
                        result.add(upstream.getObject());
                        queue.offer(upstream);
                    }
                }
            }
            depth++;
        }
        return result;
    }
}

Common Interview Questions & Answers

Q: How would you design the search feature for a data catalog? ▶

Answer framework:

Clarify: What types of objects? Tables, columns, reports? Real-time or batch? How many objects?

Search Engine: Elasticsearch with custom analyzer. Index: table name, description, column names, tags, owner, data types

Ranking: Boost by: endorsement score, usage frequency, how recently updated, user's team

Faceted Search: Filter by data source, owner, data type, tags, sensitivity level

Autocomplete: Trie or Elasticsearch suggest API on table/column names

Indexing: When metadata changes (Kafka event) → Elasticsearch consumer updates index

Q: Explain how data lineage can be represented and queried efficiently. ▶

Data lineage as a DAG (Directed Acyclic Graph):

Nodes = catalog objects (tables, columns, ETL jobs, reports)
Edges = transformations/dependencies with metadata (SQL, job name, timestamp)
Direction: from source to destination

Storage:

Graph DB (Neo4j) for complex traversal queries
Or: adjacency list in PostgreSQL for simpler setups

Queries:

Upstream lineage: BFS/DFS backwards from target node
Downstream impact: BFS/DFS forwards from source node
Column-level lineage: Track which source columns map to target columns
Neo4j Cypher: MATCH (t:Table {id:123})-[:FEEDS_INTO*1..5]->(d) RETURN d

Q: How would you handle real-time metadata updates at scale? ▶

CDC (Change Data Capture): Use Debezium to stream DB transaction logs → Kafka

Event Sourcing: Every metadata change is an event (SchemaChanged, OwnerUpdated, TagAdded)

Kafka Topics: Separate topics per entity type, partitioned by entity ID for ordering

Consumer Groups: Search indexer, lineage updater, notification service each have their own group → process independently

Consistency: Eventual consistency is acceptable — search index may be seconds behind

Idempotency: Each consumer is idempotent — safe to replay events if processing fails

Q: Design a system to detect when a dataset has not been refreshed in 24 hours (data freshness alert). ▶

Approach:

Store last_refreshed_at timestamp in catalog metadata DB
Scheduled job (cron every 30 min) queries: SELECT * FROM tables WHERE expected_refresh = 'daily' AND last_refreshed_at < NOW() - INTERVAL '24 hours'
Stale datasets → publish to Kafka topic "freshness-alerts"
Notification worker consumes → sends email/Slack to data owner
Idempotency: track which alerts were sent (alert_sent_at column) to avoid duplicate alerts
Scale: with millions of tables, partition alerts by owner team for parallel processing

Q: What OOP design would you use for a metadata connector system (supporting MySQL, Snowflake, BigQuery)? ▶

// Abstract Connector (Template Method pattern)
abstract class MetadataConnector {
    // Template Method — defines the algorithm skeleton
    public final List<TableMetadata> crawl() {
        connect();
        authenticate();
        List<TableMetadata> metadata = fetchMetadata();
        disconnect();
        return metadata;
    }

    protected abstract void connect();
    protected abstract void authenticate();
    protected abstract List<TableMetadata> fetchMetadata();
    protected abstract void disconnect();
    public abstract String getConnectorType();
}

class SnowflakeConnector extends MetadataConnector {
    private String account, username, password;
    protected void connect()    { /* JDBC connection */ }
    protected void authenticate() { /* OAuth or password auth */ }
    protected List<TableMetadata> fetchMetadata() {
        // Query INFORMATION_SCHEMA.TABLES
        return List.of();
    }
    protected void disconnect() { /* close connection */ }
    public String getConnectorType() { return "SNOWFLAKE"; }
}

class BigQueryConnector extends MetadataConnector {
    protected void connect()    { /* Google Cloud client */ }
    protected void authenticate() { /* Service account */ }
    protected List<TableMetadata> fetchMetadata() {
        // Use BigQuery API to list datasets/tables
        return List.of();
    }
    protected void disconnect() { /* cleanup */ }
    public String getConnectorType() { return "BIGQUERY"; }
}

// Factory to create connectors
class ConnectorFactory {
    public static MetadataConnector create(ConnectionConfig config) {
        switch (config.getType()) {
            case "SNOWFLAKE": return new SnowflakeConnector(config);
            case "BIGQUERY":  return new BigQueryConnector(config);
            case "MYSQL":     return new MySQLConnector(config);
            default: throw new IllegalArgumentException("Unknown connector: " + config.getType());
        }
    }
}

Pre-Interview Checklist

Python Syntax I Always Forget

Quick Reference

▶

# ─── Collections ────────────────────────────────
from collections import defaultdict, Counter, deque, OrderedDict
import heapq, bisect

# defaultdict: no KeyError on missing key
dd = defaultdict(list)   # default = empty list
dd = defaultdict(int)    # default = 0
dd = defaultdict(set)    # default = empty set

# Counter: frequency map
c = Counter("hello")          # {'l':2,'h':1,'e':1,'o':1}
c.most_common(2)              # [('l',2),('h',1)]
c.update("world")             # add more counts
c1 + c2                       # merge counters

# deque: O(1) append/pop from both ends
dq = deque([1,2,3])
dq.appendleft(0)              # [0,1,2,3]
dq.popleft()                  # O(1) pop from front

# ─── Heap ────────────────────────────────────────
heap = []
heapq.heappush(heap, 5)
heapq.heappop(heap)           # min element
heapq.heappushpop(heap, x)    # push then pop (efficient)
heapq.nlargest(3, arr)        # 3 largest elements
heapq.nsmallest(3, arr)       # 3 smallest elements
# MAX HEAP: negate values
heapq.heappush(heap, -val); -heapq.heappop(heap)

# Heap with tuples: sorted by first element
heapq.heappush(heap, (distance, node_id, node))
dist, nid, node = heapq.heappop(heap)

# ─── Sorting ────────────────────────────────────
arr.sort(key=lambda x: (x[1], x[0]))     # sort by [1], then [0]
arr.sort(key=lambda x: -x[0])            # descending
arr.sort(key=lambda x: x[0], reverse=True)

# ─── Binary Search ──────────────────────────────
import bisect
bisect.bisect_left(arr, x)   # leftmost index where x can be inserted
bisect.bisect_right(arr, x)  # rightmost index
bisect.insort(arr, x)        # insert x keeping arr sorted

# ─── String ────────────────────────────────────
s.split()                    # split on whitespace
s.split(',')                 # split on comma
','.join(['a','b','c'])      # "a,b,c"
s.strip(); s.lstrip(); s.rstrip()
s.startswith('abc'); s.endswith('xyz')
s.replace('old','new')
s[::-1]                      # reverse string
ord('a') = 97; chr(97) = 'a'

# ─── Dict / Set ─────────────────────────────────
d = {}
d.get(key, default)          # safe get
d.setdefault(key, []).append(v)
{k:v for k,v in d.items() if v > 0}   # dict comprehension
s = set(); s.add(x); s.discard(x)
s1 & s2  # intersection
s1 | s2  # union
s1 - s2  # difference

# ─── List / Matrix ──────────────────────────────
matrix = [[0]*cols for _ in range(rows)]  # 2D array
flat = [x for row in matrix for x in row]  # flatten
zip(*matrix)                               # transpose

# ─── Enumerate / Zip ────────────────────────────
for i, val in enumerate(arr):            # (index, value)
for a, b in zip(arr1, arr2):            # parallel iteration

# ─── Graph Template ─────────────────────────────
from collections import defaultdict, deque

def bfs(graph, start):
    visited = {start}
    queue = deque([start])
    while queue:
        node = queue.popleft()
        for neighbor in graph[node]:
            if neighbor not in visited:
                visited.add(neighbor)
                queue.append(neighbor)

def dfs(graph, node, visited=None):
    if visited is None: visited = set()
    visited.add(node)
    for neighbor in graph[node]:
        if neighbor not in visited:
            dfs(graph, neighbor, visited)

Java Syntax I Always Forget

Quick Reference

▶

// ─── Collections ──────────────────────────────────
import java.util.*;
import java.util.stream.*;

List<Integer> list = new ArrayList<>();
list.add(5); list.get(0); list.size();
Collections.sort(list);
Collections.sort(list, (a, b) -> b - a);  // descending
list.sort(Comparator.comparingInt(x -> x[0]));

Map<String, Integer> map = new HashMap<>();
map.put("key", 1);
map.get("key");              // null if not found
map.getOrDefault("key", 0); // safe get
map.putIfAbsent("key", 0);
map.containsKey("key");
for (Map.Entry<String, Integer> e : map.entrySet())
    System.out.println(e.getKey() + " = " + e.getValue());

Set<String> set = new HashSet<>();
set.add("x"); set.contains("x"); set.remove("x");

// ─── PriorityQueue (Min Heap) ──────────────────
PriorityQueue<Integer> minHeap = new PriorityQueue<>();
PriorityQueue<Integer> maxHeap = new PriorityQueue<>(Collections.reverseOrder());
PriorityQueue<int[]> pq = new PriorityQueue<>((a, b) -> a[0] - b[0]);  // sort by [0]
pq.offer(arr); pq.poll(); pq.peek();

// ─── Queue / Stack ────────────────────────────
Queue<Integer> queue = new LinkedList<>();
queue.offer(1); queue.poll(); queue.peek();

Deque<Integer> deque = new ArrayDeque<>();
deque.addFirst(x); deque.addLast(x);
deque.removeFirst(); deque.removeLast();
deque.peekFirst(); deque.peekLast();

Stack<Integer> stack = new Stack<>();  // or use Deque
stack.push(1); stack.pop(); stack.peek();

// ─── Arrays ───────────────────────────────────
int[] arr = new int[n];
Arrays.fill(arr, 0);
Arrays.sort(arr);
Arrays.sort(arr, (a,b) -> a[1] - b[1]);    // for Object arrays
int[] copy = Arrays.copyOf(arr, arr.length);
int[][] matrix = new int[rows][cols];

// ─── String ────────────────────────────────────
String s = "hello";
s.charAt(0); s.length(); s.substring(1,3);
s.contains("el"); s.startsWith("he"); s.endsWith("lo");
s.toLowerCase(); s.toUpperCase();
String.valueOf(123);        // int to String
Integer.parseInt("123");    // String to int
char[] chars = s.toCharArray();
String joined = String.join(",", "a","b","c");

StringBuilder sb = new StringBuilder();
sb.append("hello"); sb.toString();

// ─── Streams ────────────────────────────────────
List<Integer> result = list.stream()
    .filter(x -> x > 0)
    .map(x -> x * 2)
    .sorted()
    .collect(Collectors.toList());

long count = list.stream().filter(x -> x > 0).count();
Optional<Integer> max = list.stream().max(Integer::compareTo);
int sum = list.stream().mapToInt(Integer::intValue).sum();

// ─── Graph BFS Template ────────────────────────
public List<Integer> bfs(Map<Integer, List<Integer>> graph, int start) {
    List<Integer> result = new ArrayList<>();
    Set<Integer> visited = new HashSet<>();
    Queue<Integer> queue = new LinkedList<>();
    queue.offer(start); visited.add(start);
    while (!queue.isEmpty()) {
        int node = queue.poll();
        result.add(node);
        for (int neighbor : graph.getOrDefault(node, new ArrayList<>())) {
            if (!visited.contains(neighbor)) {
                visited.add(neighbor);
                queue.offer(neighbor);
            }
        }
    }
    return result;
}

Alation-Specific Preparation

What is Alation?

Core Product: Data Catalog

Domain Concepts to Know

Graph Problems HOT

Why Graph for Alation?

Design Problems HOT

LFU Cache — Hard Design Problem

Trie (Prefix Tree) — Implement & Use

Design Metadata Catalog System (LLD)

Topics to Review

Communication Tips

State assumptions out loud

Talk about trade-offs

Think about scale

Name design patterns

Acknowledge imperfections

Python Syntax I Always Forget

Java Syntax I Always Forget