LinkedIn 数据工程师面试实录 2026:图谱数据分析与工程深度面
Walkthrough of a Data Engineer interview at LinkedIn — SQL for professional network analysis, Python for graph ETL and job recommendation pipelines, graph data modeling at billion-node scale, and real architecture for the People Graph.
I walked into the Data Engineer interview at LinkedIn knowing I was in for something different from the typical analytics warehouse interview. LinkedIn’s core product is a graph — over a billion professionals, tens of billions of connections, companies, skills, endorsements, and job postings, all linked together in what they call the People Graph. As a data engineer, you’re not just moving rows from A to B. You’re building the infrastructure that powers network effects, job recommendations, and “People You May Know” for the world’s largest professional network.
The interview was structured around LinkedIn’s actual graph data stack and real challenges in professional network analytics. Below I’ll walk through each phase, the problems I was given, my solutions, and what the interviewers were really looking for.
Phase 1: Introduction and Graph Intuition
The interviewer started with a question that set the tone for the entire interview:
“If you could compute one graph metric on LinkedIn’s People Graph that would tell us something we don’t already know about how professionals find jobs, what would it be and how would you compute it?”
I thought about the core value proposition of LinkedIn. It’s not just a resume database — it’s a network that connects people through relationships, skills, and shared experiences. My answer:
Path-to-hire network distance. Specifically, I’d measure the average shortest path length between a person and their new employer’s key decision-makers before they apply versus after they connect with them through the network. This tells us whether LinkedIn’s network effects actually shorten the job search, and for which segments.
The interviewer smiled and said:
“That’s exactly the kind of thinking we need. Now let’s see if you can actually compute it.”
Phase 2: SQL — Professional Network Analysis
I was given a simplified schema representing LinkedIn’s internal data warehouse — a star-like schema built on top of the graph data, optimized for analytical queries. Here’s what I was shown:
-- people: core profile table (one row per member)
people (
member_id BIGINT, -- unique member identifier
headline VARCHAR(256),
current_company_id BIGINT,
current_title VARCHAR(256),
location_id INT,
joined_date DATE,
profile_views_30d INT -- pre-aggregated for performance
)
-- connections: the graph edges (bidirectional, denormalized)
connections (
source_member_id BIGINT, -- member A
target_member_id BIGINT, -- member B
connection_degree INT, -- 1, 2, or 3
created_at TIMESTAMP,
is_mutual BOOLEAN -- TRUE if A follows B and B follows A
)
-- skills: member-to-skill mapping
skills (
member_id BIGINT,
skill_name VARCHAR(128),
endorsement_count INT,
added_at TIMESTAMP
)
-- jobs: current job postings
jobs (
job_id BIGINT,
company_id BIGINT,
title VARCHAR(256),
required_skills VARCHAR(256), -- comma-separated skill names
location_id INT,
posted_at TIMESTAMP,
application_count INT
)
-- applications: job applications
applications (
application_id BIGINT,
member_id BIGINT,
job_id BIGINT,
applied_at TIMESTAMP,
status VARCHAR(32) -- 'viewed', 'interviewing', 'offered', 'rejected'
)
Question 1: Second-Degree Network Influence on Job Applications
“Write a query that identifies, for each job posting, the top 5 members who have the most second-degree connections to applicants. These are the ‘influencers’ who could help spread awareness of the job through their network.”
This is a real problem LinkedIn solves — finding members to target for “job alerts” based on their network proximity to interested candidates.
My Solution
WITH applicants AS (
-- Get all members who applied to each job
SELECT
a.job_id,
a.member_id AS applicant_id
FROM applications a
WHERE a.status IN ('viewed', 'interviewing', 'offered')
),
second_degree_connections AS (
-- For each applicant, find their 1st-degree connections
-- Then find the 2nd-degree connections (connections of connections)
SELECT
ap.job_id,
c2.target_member_id AS influencer_id,
COUNT(DISTINCT ap.applicant_id) AS connected_applicants
FROM applicants ap
-- First degree: applicant's connections
JOIN connections c1
ON c1.source_member_id = ap.applicant_id
AND c1.connection_degree = 1
-- Second degree: their connections' connections
JOIN connections c2
ON c2.source_member_id = c1.target_member_id
AND c2.connection_degree = 1
AND c2.target_member_id != ap.applicant_id -- exclude self
AND c2.target_member_id NOT IN (
-- Exclude members who already applied
SELECT applicant_id
FROM applicants a2
WHERE a2.job_id = ap.job_id
)
GROUP BY ap.job_id, c2.target_member_id
HAVING COUNT(DISTINCT ap.applicant_id) >= 2
),
ranked_influencers AS (
SELECT
job_id,
influencer_id,
connected_applicants,
ROW_NUMBER() OVER (
PARTITION BY job_id
ORDER BY connected_applicants DESC
) AS rank
FROM second_degree_connections
)
SELECT
ri.job_id,
ri.influencer_id,
p.headline AS influencer_headline,
p.current_company_id AS influencer_company,
ri.connected_applicants,
j.title AS job_title,
j.application_count
FROM ranked_influencers ri
JOIN people p
ON ri.influencer_id = p.member_id
JOIN jobs j
ON ri.job_id = j.job_id
WHERE ri.rank <= 5
ORDER BY ri.job_id, ri.connected_applicants DESC;
What the Interviewer Was Testing
The interviewer noted several key things:
- Self-exclusion — I correctly excluded the applicant themselves from their own second-degree network
- Already-applied exclusion — I filtered out members who already applied, since the goal is to reach new candidates
- Minimum threshold — The
HAVING >= 2clause ensures we only surface influencers with meaningful overlap - Window function ranking —
ROW_NUMBER()cleanly picks the top 5 per job
Then came the scale question:
“This query joins
connectionstwice. That table has 50 billion rows. How do you make this actually run?”
I laid out a three-part strategy:
-- Pre-compute second-degree connections in a materialized table
-- refreshed daily, partitioned by date
CREATE OR REPLACE TABLE daily_second_degree_influencers (
snapshot_date DATE,
job_id BIGINT,
influencer_id BIGINT,
connected_applicants INT
) AS
-- Materialize the heavy CTE above, partitioned by job_id
WITH precomputed AS (
/* same logic as above */
)
SELECT * FROM precomputed;
-- Query becomes a simple lookup at O(1)
SELECT * FROM daily_second_degree_influencers
WHERE snapshot_date = CURRENT_DATE
ORDER BY connected_applicants DESC
LIMIT 5;
The interviewer nodded and added that LinkedIn actually does this — they maintain precomputed adjacency aggregations in their Hadoop-based data platform, refreshed in hourly micro-batches.
Question 2: Job Match Score with Skill Overlap and Network Proximity
“Now build a query that scores each member’s fit for a specific job posting, combining their skill match and their network proximity to current employees at the hiring company.”
This is the core of LinkedIn’s “Jobs you may be interested in” feature — it’s not just about skills. The network matters. If you have 3 friends who work at a company, you’re more likely to thrive there.
My Solution
WITH target_job AS (
SELECT * FROM jobs WHERE job_id = 987654321
),
member_skills AS (
-- Get the candidate's skills
SELECT
s.member_id,
s.skill_name,
s.endorsement_count
FROM skills s
WHERE s.member_id IN (
-- Only consider members in the relevant location
SELECT member_id FROM people
WHERE location_id = (SELECT location_id FROM target_job)
)
),
skill_match AS (
-- Calculate skill overlap score
SELECT
ms.member_id,
COUNT(DISTINCT ms.skill_name) AS matched_skills,
SUM(ms.endorsement_count) AS total_endorsements_for_matched,
-- Normalize: ratio of matched skills to required skills
ROUND(
1.0 * COUNT(DISTINCT ms.skill_name)
/ NULLIF(
(SELECT CARDINALITY(
SPLIT_REGEX((SELECT required_skills FROM target_job), ','))
), 0
),
4
) AS skill_overlap_ratio
FROM member_skills ms
CROSS JOIN LATERAL (
SELECT unnest(
SPLIT_REGEX(
(SELECT required_skills FROM target_job),
','
)
) AS required_skill
) req
WHERE LOWER(TRIM(ms.skill_name)) = LOWER(TRIM(req.required_skill))
GROUP BY ms.member_id
),
network_proximity AS (
-- Count first and second-degree connections to current employees
-- at the hiring company
SELECT
c.source_member_id AS candidate_id,
COUNT(DISTINCT CASE WHEN c.connection_degree = 1
THEN c.target_member_id END) AS direct_connections,
COUNT(DISTINCT CASE WHEN c.connection_degree = 2
THEN c.target_member_id END) AS second_degree_connections
FROM connections c
WHERE c.target_member_id IN (
-- Current employees at the hiring company
SELECT member_id FROM people
WHERE current_company_id = (SELECT company_id FROM target_job)
)
GROUP BY c.source_member_id
),
combined_score AS (
SELECT
sm.member_id,
p.headline,
p.current_title,
-- Skill component (60% weight)
COALESCE(sm.skill_overlap_ratio, 0) * 0.6 AS skill_score,
-- Network component (40% weight)
LEAST(
1.0,
(COALESCE(np.direct_connections, 0) * 0.1
+ COALESCE(np.second_degree_connections, 0) * 0.01)
) * 0.4 AS network_score,
-- Combined
ROUND(
COALESCE(sm.skill_overlap_ratio, 0) * 0.6
+ LEAST(
1.0,
(COALESCE(np.direct_connections, 0) * 0.1
+ COALESCE(np.second_degree_connections, 0) * 0.01)
) * 0.4,
4
) AS job_match_score
FROM skill_match sm
JOIN people p ON sm.member_id = p.member_id
LEFT JOIN network_proximity np ON sm.member_id = np.candidate_id
)
SELECT
member_id,
headline,
current_title,
skill_score,
network_score,
job_match_score,
RANK() OVER (ORDER BY job_match_score DESC) AS candidate_rank
FROM combined_score
WHERE job_match_score > 0.2 -- minimum threshold
ORDER BY job_match_score DESC
LIMIT 100;
Discussion: Tuning the Weights
The interviewer pushed back on my 60/40 split:
“What if we’re hiring for a very niche role where there are only 3 people in the world with the right skills? Network proximity should dominate.”
I agreed and proposed a dynamic weighting approach:
-- Dynamic weight based on skill scarcity
-- If few members have all required skills, weight network higher
WITH skill_scarcity AS (
SELECT
COUNT(*) AS members_with_skills,
(SELECT COUNT(*) FROM people WHERE location_id = 123) AS total_pool,
ROUND(1.0 * COUNT(*) / NULLIF((SELECT COUNT(*) FROM people WHERE location_id = 123), 0), 4) AS scarcity_ratio
FROM skill_match
)
-- If scarcity_ratio < 0.01, shift to 30% skills / 70% network
-- If scarcity_ratio > 0.1, use 70% skills / 30% network
-- This is implemented as a linear interpolation in the scoring function
The interviewer said this is exactly how LinkedIn’s recommendation system works — the weights are not hardcoded but learned from click-through and application data using a gradient-boosted model. The data engineering challenge is building the feature pipeline that feeds those weights.
Phase 3: Python — Graph ETL Pipeline
Next came the coding exercise. I was asked to build a graph ETL pipeline that processes raw connection data and computes network features for the recommendation system.
The Requirements
- Load raw connection data from CSV (simulated — in production it’s Parquet on HDFS)
- Build an adjacency graph using
networkx - Compute per-node features: degree centrality, betweenness centrality, PageRank
- Compute community detection (connected components or Louvain)
- Export feature vectors as a DataFrame for the ML pipeline
The scenario: LinkedIn’s recommendation team needs a daily batch of graph features to train their “People You May Know” model. The features are computed on a subgraph of active users (top 10M by profile_views_30d).
My Solution
import csv
import logging
from pathlib import Path
from collections import defaultdict
from typing import Dict, List, Tuple
import networkx as nx
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler
logger = logging.getLogger(__name__)
class PeopleGraphETL:
"""
Builds the People Graph from raw connection data and computes
per-node features for LinkedIn's recommendation pipeline.
Features computed:
- degree_centrality: how connected is this member
- betweenness_centrality: how often this member is a bridge
- pagerank: overall influence in the network
- clustering_coefficient: how tightly-knit their circle is
- community_id: which cluster/community they belong to
- avg_neighbor_degree: quality of connections
- component_size: size of connected component
"""
def __init__(self, input_path: str | Path, output_path: str | Path):
self.input_path = Path(input_path)
self.output_path = Path(output_path)
self.graph: nx.Graph = nx.Graph()
def load_connections(self) -> List[Tuple[int, int]]:
"""Load raw connections from CSV: source_id, target_id, degree, timestamp."""
connections = []
row_count = 0
with open(self.input_path, "r") as f:
reader = csv.DictReader(f)
for row in reader:
connections.append((
int(row["source_member_id"]),
int(row["target_member_id"]),
))
row_count += 1
logger.info("Loaded %d connections from %s", row_count, self.input_path)
return connections
def build_graph(self, connections: List[Tuple[int, int]]) -> nx.Graph:
"""Build undirected graph from connections."""
self.graph.add_edges_from(connections)
logger.info(
"Graph built: %d nodes, %d edges",
self.graph.number_of_nodes(),
self.graph.number_of_edges(),
)
return self.graph
def compute_degree_features(self) -> Dict[int, float]:
"""Compute degree centrality for all nodes."""
logger.info("Computing degree centrality...")
return nx.degree_centrality(self.graph)
def compute_pagerank(self, max_iter: int = 100, tol: float = 1e-6) -> Dict[int, float]:
"""Compute PageRank scores for all nodes."""
logger.info("Computing PageRank (max_iter=%d, tol=%f)...", max_iter, tol)
return nx.pagerank(self.graph, max_iter=max_iter, tol=tol)
def compute_betweenness(self, sample_size: int = 1000) -> Dict[int, float]:
"""
Compute betweenness centrality using sampling (full computation
is O(V*E) which is infeasible for billion-node graphs).
"""
logger.info(
"Computing betweenness centrality (sample_size=%d)...",
sample_size,
)
nodes = list(self.graph.nodes())
sample = np.random.choice(
nodes, size=min(sample_size, len(nodes)), replace=False
)
return nx.betweenness_centrality(self.graph, k=int(sample_size))
def compute_clustering(self) -> Dict[int, float]:
"""Compute local clustering coefficient for each node."""
logger.info("Computing clustering coefficients...")
return nx.clustering(self.graph)
def detect_communities(self) -> Dict[int, int]:
"""
Detect communities using connected components.
In production, LinkedIn uses Louvain or label propagation
for better quality on billion-scale graphs.
"""
logger.info("Detecting connected components...")
components = list(nx.connected_components(self.graph))
node_to_community: Dict[int, int] = {}
for idx, component in enumerate(components):
for node in component:
node_to_community[node] = idx
logger.info("Found %d communities", len(components))
return node_to_community
def compute_avg_neighbor_degree(self) -> Dict[int, float]:
"""Compute average degree of neighbors for each node."""
logger.info("Computing average neighbor degree...")
return nx.average_neighbor_degree(self.graph)
def assemble_features(
self,
degree: Dict[int, float],
pagerank: Dict[int, float],
betweenness: Dict[int, float],
clustering: Dict[int, float],
communities: Dict[int, int],
avg_neighbor_deg: Dict[int, float],
component_sizes: Dict[int, int],
) -> pd.DataFrame:
"""
Assemble all per-node features into a feature matrix.
Returns DataFrame with columns:
member_id, degree_centrality, pagerank, betweenness_centrality,
clustering_coefficient, community_id, avg_neighbor_degree,
component_size, is_high_influence
"""
all_nodes = sorted(self.graph.nodes())
records = []
for node in all_nodes:
records.append({
"member_id": node,
"degree_centrality": degree.get(node, 0.0),
"pagerank": pagerank.get(node, 0.0),
"betweenness_centrality": betweenness.get(node, 0.0),
"clustering_coefficient": clustering.get(node, 0.0),
"community_id": communities.get(node, -1),
"avg_neighbor_degree": avg_neighbor_deg.get(node, 0.0),
"component_size": component_sizes.get(communities.get(node, -1), 0),
})
df = pd.DataFrame(records)
# Derived features
df["is_high_influence"] = (
(df["degree_centrality"] > df["degree_centrality"].quantile(0.95))
& (df["pagerank"] > df["pagerank"].quantile(0.95))
).astype(int)
# Normalize continuous features to [0, 1] for ML
scaler = MinMaxScaler()
numeric_cols = [
"degree_centrality",
"pagerank",
"betweenness_centrality",
"clustering_coefficient",
"avg_neighbor_degree",
]
df[numeric_cols] = scaler.fit_transform(df[numeric_cols])
logger.info("Assembled feature matrix: %d rows, %d columns", *df.shape)
return df
def run(self) -> pd.DataFrame:
"""Execute the full ETL pipeline."""
# Step 1: Load and build
connections = self.load_connections()
self.build_graph(connections)
# Step 2: Compute features (independent — could parallelize)
degree = self.compute_degree_features()
pagerank = self.compute_pagerank()
betweenness = self.compute_betweenness(sample_size=5000)
clustering = self.compute_clustering()
communities = self.detect_communities()
avg_neighbor_deg = self.compute_avg_neighbor_degree()
# Component sizes
component_sizes = {
idx: len(comp)
for idx, comp in enumerate(nx.connected_components(self.graph))
}
# Step 3: Assemble and output
features = self.assemble_features(
degree, pagerank, betweenness, clustering,
communities, avg_neighbor_deg, component_sizes,
)
# Write output
self.output_path.parent.mkdir(parents=True, exist_ok=True)
features.to_parquet(self.output_path, index=False)
logger.info("Wrote features to %s", self.output_path)
return features
if __name__ == "__main__":
logging.basicConfig(level=logging.INFO, format="%(asctime)s [%(name)s] %(message)s")
etl = PeopleGraphETL(
input_path="./data/connections.csv",
output_path="./output/graph_features.parquet",
)
features = etl.run()
print(f"\nFeature matrix shape: {features.shape}")
print(f"\nTop 10 influencers by PageRank:")
print(
features.nlargest(10, "pagerank")[
["member_id", "pagerank", "degree_centrality", "community_id"]
].to_string(index=False)
)
Follow-Up: Scaling to Billions
The interviewer immediately asked:
“networkx won’t scale to a billion nodes. How does LinkedIn actually do this?”
I outlined the production approach:
1. Distributed graph computation with Spark + GraphFrames:
# Production: Use PySpark + GraphFrames for billion-scale
from graphframes import GraphFrame
import pyspark.sql.functions as F
vertices = spark.read.parquet("s3://linkedin-data/people/")
edges = spark.read.parquet("s3://linkedin-data/connections/")
g = GraphFrame(vertices, edges)
# PageRank distributed across cluster
pagerank_df = g.pageRank(resetProbability=0.15, maxIter=20)
# Connected components
cc_df = g.connectedComponents()
# Triangle count (for clustering coefficient proxy)
triangles = g.triangleCount()
2. Incremental graph updates: Instead of recomputing from scratch every day, LinkedIn maintains a persistent graph state and applies daily deltas. New connections are added, removed connections are deleted, and only affected nodes have their scores recomputed.
3. Feature store: Computed features land in a feature store (LinkedIn uses their own internal system, similar to Feast) so the ML training pipeline can read them without recomputation.
The interviewer was impressed that I understood the gap between the interview exercise (networkx on a laptop) and production (Spark GraphFrames on a cluster).
Phase 4: System Design — Job Recommendation Data Pipeline
The final technical round was a system design question:
“Design the data pipeline that powers ‘Jobs You May Be Interested In’. It needs to score millions of job-member pairs daily, using both content features (skills, location, seniority) and graph features (network proximity, community signals).”
I sketched this architecture:
┌───────────────────────────────────────────────────┐
│ LinkedIn Job Recommendation Pipeline │
└───────────────────────────────────────────────────┘
┌───────────────┐ ┌──────────────────┐ ┌─────────────────────────┐
│ People │ │ Company │ │ Job Postings │
│ Graph │ │ Graph │ │ (Hiring API) │
│ (Members, │ │ (Organizations, │ │ │
│ Skills, │ │ Departments) │ │ ┌───────────────────┐ │
│ Experience) │ │ │ │ │ Job Requirements │ │
└───────┬───────┘ └───────┬──────────┘ │ │ Required Skills │ │
│ │ │ │ Location/Remote │ │
│ HDFS/Parquet │ HDFS/Parquet │ │ Seniority Level │ │
▼ ▼ │ └────────┬──────────┘ │
┌───────────────────────────────────────────────────────────────────────┐
│ HDFS Data Lake (Raw Zone) │
│ /people/year=2026/month=05/ /companies/ /jobs/ /applications/ │
└───────────────────────────┬───────────────────────────────────────────┘
│
┌─────────────┴─────────────┐
▼ ▼
┌──────────────────────┐ ┌──────────────────────────┐
│ Spark ETL Cluster │ │ GraphFrames Cluster │
│ │ │ │
│ ┌────────────────┐ │ │ ┌────────────────────┐ │
│ │ Content │ │ │ │ Graph Features: │ │
│ │ Feature │ │ │ │ - PageRank │ │
│ │ Extraction │ │ │ │ - Betweenness │ │
│ │ - Skill match │ │ │ │ - Degree centrality│ │
│ │ - Title NLP │ │ │ │ - Community ID │ │
│ │ - Seniority │ │ │ │ - Component size │ │
│ │ - Location │ │ │ └────────┬───────────┘ │
│ └────────┬───────┘ │ └──────────┬───────────────┘
└───────────┼──────────┘ │
│ │
▼ ▼
┌────────────────────────────────────────────────────────────┐
│ Feature Store (Internal) │
│ ┌─────────────────┐ ┌────────────────────┐ ┌─────────┐ │
│ │ Content Features│ │ Graph Features │ │ History │ │
│ │ (skill_score, │ │ (pagerank, degree, │ │ (past │ │
│ │ location_match)│ │ community, etc.) │ │ clicks)│ │
│ └────────┬────────┘ └────────┬───────────┘ └────┬────┘ │
└───────────┼────────────────────┼───────────────────┼───────┘
│ │ │
▼ ▼ ▼
┌─────────────────────────────────────────────────────────────────┐
│ Recommendation Scoring Engine │
│ │
│ For each (member, job) pair: │
│ score = w1*content_score + w2*graph_score + w3*behavioral │
│ │
│ ┌───────────────┐ ┌───────────────┐ ┌────────────────────┐ │
│ │ Content Score │ │ Graph Score │ │ Behavioral Score │ │
│ │ (0.45 weight) │ │ (0.30 weight) │ │ (0.25 weight) │ │
│ │ │ │ │ │ (click history, │ │
│ │ skill_overlap │ │ network │ │ past applications)│ │
│ │ title_sim │ │ proximity │ │ │ │
│ │ location_fit │ │ community │ │ │ │
│ └───────────────┘ └───────────────┘ └────────────────────┘ │
└──────────────────────────┬──────────────────────────────────────┘
│
┌─────────┴──────────┐
▼ ▼
┌───────────────┐ ┌──────────────────┐
│ Member Job │ │ A/B Test │
│ Feed Table │ │ Framework │
│ (Redis) │ │ (weight tuning) │
└───────┬───────┘ └──────────────────┘
│
▼
┌───────────────┐
│ LinkedIn UI │
│ (Jobs Tab) │
└───────────────┘
┌────────────────────────────────────────────────────────────────────┐
│ Orchestration & Monitoring │
│ ┌────────────┐ ┌──────────────┐ ┌─────────────────┐ ┌───────┐ │
│ │ Oozie/ │ │ Data Quality │ │ Feature Fresh- │ │ Alert │ │
│ │ Airflow │ │ (schema+ │ │ ness SLA checks │ │ System│ │
│ │ (DAG mgmt) │ │ null checks)│ │ (< 4hr lag) │ │ (Pager)│ │
│ └────────────┘ └──────────────┘ └─────────────────┘ └───────┘ │
└────────────────────────────────────────────────────────────────────┘
Key Design Decisions
I explained several critical decisions:
1. Why not real-time scoring for every job view?
The candidate-job matrix is enormous — 900M members × 20M active job postings = 18 quadrillion pairs. You can’t score all of them in real-time. Instead:
- Precompute a top-K list for each member (e.g., top 500 jobs)
- Rank at query time using lightweight features (recency, freshness)
- Recompute the top-K list in a daily micro-batch
2. Feature freshness vs. computation cost trade-off:
- Content features (skills, title): updated weekly
- Graph features (PageRank, communities): updated daily
- Behavioral features (click history): updated hourly
- The scoring engine reads from the feature store and handles stale features gracefully
3. A/B testing the weights:
The w1, w2, w3 weights aren’t hardcoded — they’re tuned through continuous A/B testing. The data engineering pipeline supports multiple weight configurations in parallel, each serving a different experiment bucket.
4. Cold start problem for new job postings:
New jobs have no application history and no network data. I proposed:
- Content-only scoring for the first 48 hours (no graph/behavioral features)
- Bootstrap from similar jobs — cluster new jobs with existing jobs by (title, skills, company) and use their click patterns as priors
The interviewer liked the cold start approach and asked about the clustering algorithm. I mentioned using TF-IDF on skill sets + cosine similarity for job clustering, computed incrementally as new jobs arrive.
Phase 5: Behavioral and Cultural Fit
LinkedIn’s interview includes a strong behavioral component focused on their core values. Here’s what came up:
Question: “Tell me about a time you worked with ambiguous requirements on a data project.”
I described building a referral quality score at my previous company. The initial requirement was simply “measure how good our referral program is.” I had to:
- Define “quality” — was it hire rate? retention at 6 months? performance rating? I proposed a composite score.
- Build the data model — link referrals to hires to performance reviews, spanning 3 different systems.
- Validate with stakeholders — show the score to HR and engineering managers, iterate on the definition.
- Ship incrementally — start with hire rate only, add retention after 2 weeks, add performance after another 2.
The interviewer emphasized that at LinkedIn, data engineers often define the metrics, not just compute them. The ability to translate vague business questions into precise data models is critical.
Question: “How do you handle a situation where your data pipeline breaks and it affects a product feature?”
My approach: blameless incident response with clear severity levels.
- Detect and contain — monitoring alerts trigger a page. First action is to route to a cached/stale version of the data so the UI doesn’t show errors.
- Diagnose — check pipeline logs, data quality gates, upstream source changes.
- Fix and backfill — repair the pipeline, reprocess the affected window.
- Post-mortem — document root cause, implement a guardrail (e.g., new data quality check) so it doesn’t happen again.
I mentioned that at LinkedIn, this is especially critical because job recommendation scores affect people’s careers. A broken pipeline could mean qualified candidates never see relevant jobs.
Interview Summary
Here’s a structured recap of the entire interview:
Round 1 — Graph Intuition (30 min)
- Topic: Identifying novel graph metrics for professional networks
- Key skill: Translating product questions into graph-theoretic concepts
- What they valued: Understanding that LinkedIn’s product is a graph, not just a database
Round 2 — SQL: Network Analysis (60 min)
- Topic: Second-degree influence detection, skill-match scoring with network proximity
- Key skills: Multi-hop joins, CTEs, window functions, handling edge cases
- Bonus: Dynamic weighting based on skill scarcity, partition pruning for 50B-row tables
Round 3 — Python: Graph ETL (60 min)
- Topic: Building network features (PageRank, betweenness, communities) from raw connection data
- Key skills: networkx graph operations, feature engineering, MinMaxScaler normalization
- Bonus: Understanding the leap from networkx to Spark GraphFrames for production scale
Round 4 — System Design: Recommendation Pipeline (45 min)
- Topic: End-to-end job recommendation data pipeline with content + graph + behavioral signals
- Key skills: Feature store architecture, micro-batch scoring, cold start strategies, A/B testing
- Bonus: Practical awareness of the 18 quadrillion candidate-job pairs problem
Round 5 — Behavioral (30 min)
- Topic: Ambiguous requirements, incident response, stakeholder management
- Key skills: Translating vague questions into data models, blameless post-mortems
Total time: ~4 hours (spread across 2 days)
What Made the Difference
Three things stood out as differentiators:
-
Graph-first thinking. Most candidates tried to solve graph problems with flat SQL joins. Showing fluency in graph concepts — centrality, communities, path length — immediately signaled I could work with the People Graph.
-
Knowing the gap between interview code and production. When I used networkx, I immediately acknowledged it wouldn’t scale and pivoted to Spark GraphFrames. The interviewers want engineers who know when their tools stop working and what to reach for next.
-
Connecting graph features to product outcomes. I didn’t just compute PageRank — I explained how it maps to “who should we show this job to.” The best data engineers at LinkedIn bridge the gap between abstract graph theory and real user experiences.
Recommended Reading and Resources
If you’re preparing for a Data Engineer role at LinkedIn or a graph-heavy data position, here’s what I’d study:
Graph Data Engineering:
- Networks, Crowds, and Markets by Easley and Kleinberg — the foundational text on network analysis for applied engineers
- NetworkX Documentation — essential for prototyping graph algorithms
- Apache Spark GraphFrames — distributed graph computation for production pipelines
- Graph Database patterns — InfoQ — practical modeling approaches
LinkedIn-Specific:
- LinkedIn Engineering Blog — their most-read posts cover graph algorithms, the People Graph architecture, and recommendation systems
- LinkedIn’s “People You May Know” paper — understanding their two-tower recommendation architecture
- LinkedIn Data Platform blog posts — covers their Hadoop/Spark/Kafka stack evolution
- LinkedIn’s open-source data projects on GitHub — look at DataHub (metadata), Kafka contributions, and Tonic
Graph Analytics & SQL:
- Neo4j Graph Data Science library — PageRank, community detection, node similarity — concepts transfer to any platform
- LeetCode Graph problems — BFS, DFS, shortest path fundamentals
- Window functions practice: SQLBolt — critical for ranking and scoring queries
Feature Engineering for Recommendations:
- Feast — Feature Store — open-source feature store for ML pipelines
- A Unified Social Recommendation Framework — LinkedIn Research — the seminal paper on their two-tower approach
- Machine Learning Feature Engineering — O’Reilly book on practical feature design
System Design for Data Platforms:
- Designing Data-Intensive Applications by Martin Kleppmann — the bible for data platform architecture
- Martin Kleppmann’s blog on streaming architectures — stream processing at LinkedIn scale
Final Thoughts
The LinkedIn Data Engineer interview is unique because it forces you to think in graphs, not tables. Most data engineering interviews focus on ETL pipelines, warehouse modeling, and dashboard queries. LinkedIn throws in centrality metrics, community detection, and network proximity — concepts that sit at the intersection of graph theory and data engineering.
The good news: you don’t need to be a graph theorist. You need to understand the intuition behind graph metrics and know how to implement them at scale using distributed tools. The interviewers care more about whether you can translate “People You May Know” into “compute k-nearest neighbors on a bipartite graph with learned embeddings” than whether you can prove the convergence of the Louvain algorithm.
If you can show that you understand both the data engineering fundamentals (SQL, Python, Spark, pipelines) and the graph thinking that makes LinkedIn’s product unique — how networks create value, how proximity drives trust, how communities form and evolve — you’ll stand out in a way that’s hard to replicate.
Good luck. And remember: at LinkedIn, every data engineer is also a network theorist — whether they know it yet or not.
💡 需要面试辅导?
如果你对准备技术面试感到迷茫,或者想要个性化的面试指导和简历优化,欢迎联系 Interview Coach Pro 获取一对一辅导服务。
👉 联系我们 获取专属面试准备方案