LinkedIn 数据工程师面试实录 2026：图谱数据分析与工程深度面

I walked into the Data Engineer interview at LinkedIn knowing I was in for something different from the typical analytics warehouse interview. LinkedIn’s core product is a graph — over a billion professionals, tens of billions of connections, companies, skills, endorsements, and job postings, all linked together in what they call the People Graph. As a data engineer, you’re not just moving rows from A to B. You’re building the infrastructure that powers network effects, job recommendations, and “People You May Know” for the world’s largest professional network.

The interview was structured around LinkedIn’s actual graph data stack and real challenges in professional network analytics. Below I’ll walk through each phase, the problems I was given, my solutions, and what the interviewers were really looking for.

Phase 1: Introduction and Graph Intuition

The interviewer started with a question that set the tone for the entire interview:

“If you could compute one graph metric on LinkedIn’s People Graph that would tell us something we don’t already know about how professionals find jobs, what would it be and how would you compute it?”

I thought about the core value proposition of LinkedIn. It’s not just a resume database — it’s a network that connects people through relationships, skills, and shared experiences. My answer:

Path-to-hire network distance. Specifically, I’d measure the average shortest path length between a person and their new employer’s key decision-makers before they apply versus after they connect with them through the network. This tells us whether LinkedIn’s network effects actually shorten the job search, and for which segments.

The interviewer smiled and said:

“That’s exactly the kind of thinking we need. Now let’s see if you can actually compute it.”

Phase 2: SQL — Professional Network Analysis

I was given a simplified schema representing LinkedIn’s internal data warehouse — a star-like schema built on top of the graph data, optimized for analytical queries. Here’s what I was shown:

-- people: core profile table (one row per member)
people (
  member_id         BIGINT,          -- unique member identifier
  headline          VARCHAR(256),
  current_company_id BIGINT,
  current_title     VARCHAR(256),
  location_id       INT,
  joined_date       DATE,
  profile_views_30d INT              -- pre-aggregated for performance
)

-- connections: the graph edges (bidirectional, denormalized)
connections (
  source_member_id  BIGINT,          -- member A
  target_member_id  BIGINT,          -- member B
  connection_degree INT,             -- 1, 2, or 3
  created_at        TIMESTAMP,
  is_mutual         BOOLEAN          -- TRUE if A follows B and B follows A
)

-- skills: member-to-skill mapping
skills (
  member_id         BIGINT,
  skill_name        VARCHAR(128),
  endorsement_count INT,
  added_at          TIMESTAMP
)

-- jobs: current job postings
jobs (
  job_id            BIGINT,
  company_id        BIGINT,
  title             VARCHAR(256),
  required_skills   VARCHAR(256),    -- comma-separated skill names
  location_id       INT,
  posted_at         TIMESTAMP,
  application_count INT
)

-- applications: job applications
applications (
  application_id    BIGINT,
  member_id         BIGINT,
  job_id            BIGINT,
  applied_at        TIMESTAMP,
  status            VARCHAR(32)       -- 'viewed', 'interviewing', 'offered', 'rejected'
)

Question 1: Second-Degree Network Influence on Job Applications

“Write a query that identifies, for each job posting, the top 5 members who have the most second-degree connections to applicants. These are the ‘influencers’ who could help spread awareness of the job through their network.”

This is a real problem LinkedIn solves — finding members to target for “job alerts” based on their network proximity to interested candidates.

My Solution

WITH applicants AS (
    -- Get all members who applied to each job
    SELECT
        a.job_id,
        a.member_id AS applicant_id
    FROM applications a
    WHERE a.status IN ('viewed', 'interviewing', 'offered')
),

second_degree_connections AS (
    -- For each applicant, find their 1st-degree connections
    -- Then find the 2nd-degree connections (connections of connections)
    SELECT
        ap.job_id,
        c2.target_member_id AS influencer_id,
        COUNT(DISTINCT ap.applicant_id) AS connected_applicants
    FROM applicants ap
    -- First degree: applicant's connections
    JOIN connections c1
        ON c1.source_member_id = ap.applicant_id
        AND c1.connection_degree = 1
    -- Second degree: their connections' connections
    JOIN connections c2
        ON c2.source_member_id = c1.target_member_id
        AND c2.connection_degree = 1
        AND c2.target_member_id != ap.applicant_id  -- exclude self
        AND c2.target_member_id NOT IN (
            -- Exclude members who already applied
            SELECT applicant_id
            FROM applicants a2
            WHERE a2.job_id = ap.job_id
        )
    GROUP BY ap.job_id, c2.target_member_id
    HAVING COUNT(DISTINCT ap.applicant_id) >= 2
),

ranked_influencers AS (
    SELECT
        job_id,
        influencer_id,
        connected_applicants,
        ROW_NUMBER() OVER (
            PARTITION BY job_id
            ORDER BY connected_applicants DESC
        ) AS rank
    FROM second_degree_connections
)

SELECT
    ri.job_id,
    ri.influencer_id,
    p.headline AS influencer_headline,
    p.current_company_id AS influencer_company,
    ri.connected_applicants,
    j.title AS job_title,
    j.application_count
FROM ranked_influencers ri
JOIN people p
    ON ri.influencer_id = p.member_id
JOIN jobs j
    ON ri.job_id = j.job_id
WHERE ri.rank <= 5
ORDER BY ri.job_id, ri.connected_applicants DESC;

What the Interviewer Was Testing

The interviewer noted several key things:

Self-exclusion — I correctly excluded the applicant themselves from their own second-degree network
Already-applied exclusion — I filtered out members who already applied, since the goal is to reach new candidates
Minimum threshold — The HAVING >= 2 clause ensures we only surface influencers with meaningful overlap
Window function ranking — ROW_NUMBER() cleanly picks the top 5 per job

Then came the scale question:

“This query joins connections twice. That table has 50 billion rows. How do you make this actually run?”

I laid out a three-part strategy:

-- Pre-compute second-degree connections in a materialized table
-- refreshed daily, partitioned by date

CREATE OR REPLACE TABLE daily_second_degree_influencers (
    snapshot_date    DATE,
    job_id           BIGINT,
    influencer_id    BIGINT,
    connected_applicants INT
) AS
-- Materialize the heavy CTE above, partitioned by job_id
WITH precomputed AS (
    /* same logic as above */
)
SELECT * FROM precomputed;

-- Query becomes a simple lookup at O(1)
SELECT * FROM daily_second_degree_influencers
WHERE snapshot_date = CURRENT_DATE
ORDER BY connected_applicants DESC
LIMIT 5;

The interviewer nodded and added that LinkedIn actually does this — they maintain precomputed adjacency aggregations in their Hadoop-based data platform, refreshed in hourly micro-batches.

Question 2: Job Match Score with Skill Overlap and Network Proximity

“Now build a query that scores each member’s fit for a specific job posting, combining their skill match and their network proximity to current employees at the hiring company.”

This is the core of LinkedIn’s “Jobs you may be interested in” feature — it’s not just about skills. The network matters. If you have 3 friends who work at a company, you’re more likely to thrive there.

My Solution

WITH target_job AS (
    SELECT * FROM jobs WHERE job_id = 987654321
),

member_skills AS (
    -- Get the candidate's skills
    SELECT
        s.member_id,
        s.skill_name,
        s.endorsement_count
    FROM skills s
    WHERE s.member_id IN (
        -- Only consider members in the relevant location
        SELECT member_id FROM people
        WHERE location_id = (SELECT location_id FROM target_job)
    )
),

skill_match AS (
    -- Calculate skill overlap score
    SELECT
        ms.member_id,
        COUNT(DISTINCT ms.skill_name) AS matched_skills,
        SUM(ms.endorsement_count) AS total_endorsements_for_matched,
        -- Normalize: ratio of matched skills to required skills
        ROUND(
            1.0 * COUNT(DISTINCT ms.skill_name)
            / NULLIF(
                (SELECT CARDINALITY(
                    SPLIT_REGEX((SELECT required_skills FROM target_job), ','))
                ), 0
            ),
            4
        ) AS skill_overlap_ratio
    FROM member_skills ms
    CROSS JOIN LATERAL (
        SELECT unnest(
            SPLIT_REGEX(
                (SELECT required_skills FROM target_job),
                ','
            )
        ) AS required_skill
    ) req
    WHERE LOWER(TRIM(ms.skill_name)) = LOWER(TRIM(req.required_skill))
    GROUP BY ms.member_id
),

network_proximity AS (
    -- Count first and second-degree connections to current employees
    -- at the hiring company
    SELECT
        c.source_member_id AS candidate_id,
        COUNT(DISTINCT CASE WHEN c.connection_degree = 1
            THEN c.target_member_id END) AS direct_connections,
        COUNT(DISTINCT CASE WHEN c.connection_degree = 2
            THEN c.target_member_id END) AS second_degree_connections
    FROM connections c
    WHERE c.target_member_id IN (
        -- Current employees at the hiring company
        SELECT member_id FROM people
        WHERE current_company_id = (SELECT company_id FROM target_job)
    )
    GROUP BY c.source_member_id
),

combined_score AS (
    SELECT
        sm.member_id,
        p.headline,
        p.current_title,
        -- Skill component (60% weight)
        COALESCE(sm.skill_overlap_ratio, 0) * 0.6 AS skill_score,
        -- Network component (40% weight)
        LEAST(
            1.0,
            (COALESCE(np.direct_connections, 0) * 0.1
             + COALESCE(np.second_degree_connections, 0) * 0.01)
        ) * 0.4 AS network_score,
        -- Combined
        ROUND(
            COALESCE(sm.skill_overlap_ratio, 0) * 0.6
            + LEAST(
                1.0,
                (COALESCE(np.direct_connections, 0) * 0.1
                 + COALESCE(np.second_degree_connections, 0) * 0.01)
              ) * 0.4,
            4
        ) AS job_match_score
    FROM skill_match sm
    JOIN people p ON sm.member_id = p.member_id
    LEFT JOIN network_proximity np ON sm.member_id = np.candidate_id
)

SELECT
    member_id,
    headline,
    current_title,
    skill_score,
    network_score,
    job_match_score,
    RANK() OVER (ORDER BY job_match_score DESC) AS candidate_rank
FROM combined_score
WHERE job_match_score > 0.2  -- minimum threshold
ORDER BY job_match_score DESC
LIMIT 100;

Discussion: Tuning the Weights

The interviewer pushed back on my 60/40 split:

“What if we’re hiring for a very niche role where there are only 3 people in the world with the right skills? Network proximity should dominate.”

I agreed and proposed a dynamic weighting approach:

-- Dynamic weight based on skill scarcity
-- If few members have all required skills, weight network higher
WITH skill_scarcity AS (
    SELECT
        COUNT(*) AS members_with_skills,
        (SELECT COUNT(*) FROM people WHERE location_id = 123) AS total_pool,
        ROUND(1.0 * COUNT(*) / NULLIF((SELECT COUNT(*) FROM people WHERE location_id = 123), 0), 4) AS scarcity_ratio
    FROM skill_match
)
-- If scarcity_ratio < 0.01, shift to 30% skills / 70% network
-- If scarcity_ratio > 0.1, use 70% skills / 30% network
-- This is implemented as a linear interpolation in the scoring function

The interviewer said this is exactly how LinkedIn’s recommendation system works — the weights are not hardcoded but learned from click-through and application data using a gradient-boosted model. The data engineering challenge is building the feature pipeline that feeds those weights.

Phase 3: Python — Graph ETL Pipeline

Next came the coding exercise. I was asked to build a graph ETL pipeline that processes raw connection data and computes network features for the recommendation system.

The Requirements

Load raw connection data from CSV (simulated — in production it’s Parquet on HDFS)
Build an adjacency graph using networkx
Compute per-node features: degree centrality, betweenness centrality, PageRank
Compute community detection (connected components or Louvain)
Export feature vectors as a DataFrame for the ML pipeline

The scenario: LinkedIn’s recommendation team needs a daily batch of graph features to train their “People You May Know” model. The features are computed on a subgraph of active users (top 10M by profile_views_30d).

My Solution

import csv
import logging
from pathlib import Path
from collections import defaultdict
from typing import Dict, List, Tuple

import networkx as nx
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler

logger = logging.getLogger(__name__)


class PeopleGraphETL:
    """
    Builds the People Graph from raw connection data and computes
    per-node features for LinkedIn's recommendation pipeline.

    Features computed:
      - degree_centrality: how connected is this member
      - betweenness_centrality: how often this member is a bridge
      - pagerank: overall influence in the network
      - clustering_coefficient: how tightly-knit their circle is
      - community_id: which cluster/community they belong to
      - avg_neighbor_degree: quality of connections
      - component_size: size of connected component
    """

    def __init__(self, input_path: str | Path, output_path: str | Path):
        self.input_path = Path(input_path)
        self.output_path = Path(output_path)
        self.graph: nx.Graph = nx.Graph()

    def load_connections(self) -> List[Tuple[int, int]]:
        """Load raw connections from CSV: source_id, target_id, degree, timestamp."""
        connections = []
        row_count = 0
        with open(self.input_path, "r") as f:
            reader = csv.DictReader(f)
            for row in reader:
                connections.append((
                    int(row["source_member_id"]),
                    int(row["target_member_id"]),
                ))
                row_count += 1
        logger.info("Loaded %d connections from %s", row_count, self.input_path)
        return connections

    def build_graph(self, connections: List[Tuple[int, int]]) -> nx.Graph:
        """Build undirected graph from connections."""
        self.graph.add_edges_from(connections)
        logger.info(
            "Graph built: %d nodes, %d edges",
            self.graph.number_of_nodes(),
            self.graph.number_of_edges(),
        )
        return self.graph

    def compute_degree_features(self) -> Dict[int, float]:
        """Compute degree centrality for all nodes."""
        logger.info("Computing degree centrality...")
        return nx.degree_centrality(self.graph)

    def compute_pagerank(self, max_iter: int = 100, tol: float = 1e-6) -> Dict[int, float]:
        """Compute PageRank scores for all nodes."""
        logger.info("Computing PageRank (max_iter=%d, tol=%f)...", max_iter, tol)
        return nx.pagerank(self.graph, max_iter=max_iter, tol=tol)

    def compute_betweenness(self, sample_size: int = 1000) -> Dict[int, float]:
        """
        Compute betweenness centrality using sampling (full computation
        is O(V*E) which is infeasible for billion-node graphs).
        """
        logger.info(
            "Computing betweenness centrality (sample_size=%d)...",
            sample_size,
        )
        nodes = list(self.graph.nodes())
        sample = np.random.choice(
            nodes, size=min(sample_size, len(nodes)), replace=False
        )
        return nx.betweenness_centrality(self.graph, k=int(sample_size))

    def compute_clustering(self) -> Dict[int, float]:
        """Compute local clustering coefficient for each node."""
        logger.info("Computing clustering coefficients...")
        return nx.clustering(self.graph)

    def detect_communities(self) -> Dict[int, int]:
        """
        Detect communities using connected components.
        In production, LinkedIn uses Louvain or label propagation
        for better quality on billion-scale graphs.
        """
        logger.info("Detecting connected components...")
        components = list(nx.connected_components(self.graph))
        node_to_community: Dict[int, int] = {}
        for idx, component in enumerate(components):
            for node in component:
                node_to_community[node] = idx
        logger.info("Found %d communities", len(components))
        return node_to_community

    def compute_avg_neighbor_degree(self) -> Dict[int, float]:
        """Compute average degree of neighbors for each node."""
        logger.info("Computing average neighbor degree...")
        return nx.average_neighbor_degree(self.graph)

    def assemble_features(
        self,
        degree: Dict[int, float],
        pagerank: Dict[int, float],
        betweenness: Dict[int, float],
        clustering: Dict[int, float],
        communities: Dict[int, int],
        avg_neighbor_deg: Dict[int, float],
        component_sizes: Dict[int, int],
    ) -> pd.DataFrame:
        """
        Assemble all per-node features into a feature matrix.

        Returns DataFrame with columns:
          member_id, degree_centrality, pagerank, betweenness_centrality,
          clustering_coefficient, community_id, avg_neighbor_degree,
          component_size, is_high_influence
        """
        all_nodes = sorted(self.graph.nodes())
        records = []
        for node in all_nodes:
            records.append({
                "member_id": node,
                "degree_centrality": degree.get(node, 0.0),
                "pagerank": pagerank.get(node, 0.0),
                "betweenness_centrality": betweenness.get(node, 0.0),
                "clustering_coefficient": clustering.get(node, 0.0),
                "community_id": communities.get(node, -1),
                "avg_neighbor_degree": avg_neighbor_deg.get(node, 0.0),
                "component_size": component_sizes.get(communities.get(node, -1), 0),
            })

        df = pd.DataFrame(records)

        # Derived features
        df["is_high_influence"] = (
            (df["degree_centrality"] > df["degree_centrality"].quantile(0.95))
            & (df["pagerank"] > df["pagerank"].quantile(0.95))
        ).astype(int)

        # Normalize continuous features to [0, 1] for ML
        scaler = MinMaxScaler()
        numeric_cols = [
            "degree_centrality",
            "pagerank",
            "betweenness_centrality",
            "clustering_coefficient",
            "avg_neighbor_degree",
        ]
        df[numeric_cols] = scaler.fit_transform(df[numeric_cols])

        logger.info("Assembled feature matrix: %d rows, %d columns", *df.shape)
        return df

    def run(self) -> pd.DataFrame:
        """Execute the full ETL pipeline."""
        # Step 1: Load and build
        connections = self.load_connections()
        self.build_graph(connections)

        # Step 2: Compute features (independent — could parallelize)
        degree = self.compute_degree_features()
        pagerank = self.compute_pagerank()
        betweenness = self.compute_betweenness(sample_size=5000)
        clustering = self.compute_clustering()
        communities = self.detect_communities()
        avg_neighbor_deg = self.compute_avg_neighbor_degree()

        # Component sizes
        component_sizes = {
            idx: len(comp)
            for idx, comp in enumerate(nx.connected_components(self.graph))
        }

        # Step 3: Assemble and output
        features = self.assemble_features(
            degree, pagerank, betweenness, clustering,
            communities, avg_neighbor_deg, component_sizes,
        )

        # Write output
        self.output_path.parent.mkdir(parents=True, exist_ok=True)
        features.to_parquet(self.output_path, index=False)
        logger.info("Wrote features to %s", self.output_path)

        return features


if __name__ == "__main__":
    logging.basicConfig(level=logging.INFO, format="%(asctime)s [%(name)s] %(message)s")

    etl = PeopleGraphETL(
        input_path="./data/connections.csv",
        output_path="./output/graph_features.parquet",
    )

    features = etl.run()
    print(f"\nFeature matrix shape: {features.shape}")
    print(f"\nTop 10 influencers by PageRank:")
    print(
        features.nlargest(10, "pagerank")[
            ["member_id", "pagerank", "degree_centrality", "community_id"]
        ].to_string(index=False)
    )

Follow-Up: Scaling to Billions

The interviewer immediately asked:

“networkx won’t scale to a billion nodes. How does LinkedIn actually do this?”

I outlined the production approach:

1. Distributed graph computation with Spark + GraphFrames:

# Production: Use PySpark + GraphFrames for billion-scale
from graphframes import GraphFrame
import pyspark.sql.functions as F

vertices = spark.read.parquet("s3://linkedin-data/people/")
edges = spark.read.parquet("s3://linkedin-data/connections/")
g = GraphFrame(vertices, edges)

# PageRank distributed across cluster
pagerank_df = g.pageRank(resetProbability=0.15, maxIter=20)

# Connected components
cc_df = g.connectedComponents()

# Triangle count (for clustering coefficient proxy)
triangles = g.triangleCount()

2. Incremental graph updates: Instead of recomputing from scratch every day, LinkedIn maintains a persistent graph state and applies daily deltas. New connections are added, removed connections are deleted, and only affected nodes have their scores recomputed.

3. Feature store: Computed features land in a feature store (LinkedIn uses their own internal system, similar to Feast) so the ML training pipeline can read them without recomputation.

The interviewer was impressed that I understood the gap between the interview exercise (networkx on a laptop) and production (Spark GraphFrames on a cluster).

Phase 4: System Design — Job Recommendation Data Pipeline

The final technical round was a system design question:

“Design the data pipeline that powers ‘Jobs You May Be Interested In’. It needs to score millions of job-member pairs daily, using both content features (skills, location, seniority) and graph features (network proximity, community signals).”

I sketched this architecture:

                                  ┌───────────────────────────────────────────────────┐
                                  │          LinkedIn Job Recommendation Pipeline     │
                                  └───────────────────────────────────────────────────┘

┌───────────────┐    ┌──────────────────┐    ┌─────────────────────────┐
│   People      │    │    Company       │    │     Job Postings        │
│   Graph       │    │    Graph         │    │     (Hiring API)        │
│  (Members,    │    │  (Organizations, │    │                         │
│   Skills,     │    │   Departments)   │    │  ┌───────────────────┐  │
│   Experience) │    │                  │    │  │ Job Requirements  │  │
└───────┬───────┘    └───────┬──────────┘    │  │ Required Skills   │  │
        │                    │               │  │ Location/Remote   │  │
        │  HDFS/Parquet      │  HDFS/Parquet │  │ Seniority Level   │  │
        ▼                    ▼               │  └────────┬──────────┘  │
┌───────────────────────────────────────────────────────────────────────┐
│                      HDFS Data Lake (Raw Zone)                        │
│  /people/year=2026/month=05/  /companies/  /jobs/  /applications/    │
└───────────────────────────┬───────────────────────────────────────────┘
                            │
              ┌─────────────┴─────────────┐
              ▼                           ▼
   ┌──────────────────────┐    ┌──────────────────────────┐
   │  Spark ETL Cluster   │    │   GraphFrames Cluster    │
   │                      │    │                          │
   │  ┌────────────────┐  │    │  ┌────────────────────┐  │
   │  │ Content        │  │    │  │ Graph Features:    │  │
   │  │ Feature        │  │    │  │ - PageRank         │  │
   │  │ Extraction     │  │    │  │ - Betweenness      │  │
   │  │ - Skill match  │  │    │  │ - Degree centrality│  │
   │  │ - Title NLP    │  │    │  │ - Community ID     │  │
   │  │ - Seniority    │  │    │  │ - Component size   │  │
   │  │ - Location     │  │    │  └────────┬───────────┘  │
   │  └────────┬───────┘  │    └──────────┬───────────────┘
   └───────────┼──────────┘               │
               │                          │
               ▼                          ▼
   ┌────────────────────────────────────────────────────────────┐
   │                    Feature Store (Internal)                 │
   │  ┌─────────────────┐  ┌────────────────────┐  ┌─────────┐ │
   │  │ Content Features│  │ Graph Features     │  │ History │ │
   │  │ (skill_score,   │  │ (pagerank, degree, │  │ (past   │ │
   │  │  location_match)│  │  community, etc.)  │  │  clicks)│ │
   │  └────────┬────────┘  └────────┬───────────┘  └────┬────┘ │
   └───────────┼────────────────────┼───────────────────┼───────┘
               │                    │                   │
               ▼                    ▼                   ▼
   ┌─────────────────────────────────────────────────────────────────┐
   │                 Recommendation Scoring Engine                   │
   │                                                                 │
   │  For each (member, job) pair:                                   │
   │    score = w1*content_score + w2*graph_score + w3*behavioral   │
   │                                                                 │
   │  ┌───────────────┐  ┌───────────────┐  ┌────────────────────┐  │
   │  │ Content Score │  │ Graph Score   │  │ Behavioral Score   │  │
   │  │ (0.45 weight) │  │ (0.30 weight) │  │ (0.25 weight)      │  │
   │  │               │  │               │  │ (click history,    │  │
   │  │ skill_overlap │  │ network       │  │  past applications)│  │
   │  │ title_sim     │  │ proximity     │  │                    │  │
   │  │ location_fit  │  │ community     │  │                    │  │
   │  └───────────────┘  └───────────────┘  └────────────────────┘  │
   └──────────────────────────┬──────────────────────────────────────┘
                              │
                    ┌─────────┴──────────┐
                    ▼                    ▼
            ┌───────────────┐    ┌──────────────────┐
            │ Member Job    │    │  A/B Test        │
            │ Feed Table    │    │  Framework       │
            │ (Redis)       │    │  (weight tuning) │
            └───────┬───────┘    └──────────────────┘
                    │
                    ▼
            ┌───────────────┐
            │ LinkedIn UI   │
            │ (Jobs Tab)    │
            └───────────────┘

┌────────────────────────────────────────────────────────────────────┐
│                      Orchestration & Monitoring                    │
│  ┌────────────┐  ┌──────────────┐  ┌─────────────────┐  ┌───────┐ │
│  │ Oozie/     │  │ Data Quality │  │ Feature Fresh-  │  │ Alert │ │
│  │ Airflow    │  │ (schema+     │  │ ness SLA checks │  │ System│ │
│  │ (DAG mgmt) │  │  null checks)│  │ (< 4hr lag)     │  │ (Pager)│ │
│  └────────────┘  └──────────────┘  └─────────────────┘  └───────┘ │
└────────────────────────────────────────────────────────────────────┘

Key Design Decisions

I explained several critical decisions:

1. Why not real-time scoring for every job view?

The candidate-job matrix is enormous — 900M members × 20M active job postings = 18 quadrillion pairs. You can’t score all of them in real-time. Instead:

Precompute a top-K list for each member (e.g., top 500 jobs)
Rank at query time using lightweight features (recency, freshness)
Recompute the top-K list in a daily micro-batch

2. Feature freshness vs. computation cost trade-off:

Content features (skills, title): updated weekly
Graph features (PageRank, communities): updated daily
Behavioral features (click history): updated hourly
The scoring engine reads from the feature store and handles stale features gracefully

3. A/B testing the weights:

The w1, w2, w3 weights aren’t hardcoded — they’re tuned through continuous A/B testing. The data engineering pipeline supports multiple weight configurations in parallel, each serving a different experiment bucket.

4. Cold start problem for new job postings:

New jobs have no application history and no network data. I proposed:

Content-only scoring for the first 48 hours (no graph/behavioral features)
Bootstrap from similar jobs — cluster new jobs with existing jobs by (title, skills, company) and use their click patterns as priors

The interviewer liked the cold start approach and asked about the clustering algorithm. I mentioned using TF-IDF on skill sets + cosine similarity for job clustering, computed incrementally as new jobs arrive.

Phase 5: Behavioral and Cultural Fit

LinkedIn’s interview includes a strong behavioral component focused on their core values. Here’s what came up:

Question: “Tell me about a time you worked with ambiguous requirements on a data project.”

I described building a referral quality score at my previous company. The initial requirement was simply “measure how good our referral program is.” I had to:

Define “quality” — was it hire rate? retention at 6 months? performance rating? I proposed a composite score.
Build the data model — link referrals to hires to performance reviews, spanning 3 different systems.
Validate with stakeholders — show the score to HR and engineering managers, iterate on the definition.
Ship incrementally — start with hire rate only, add retention after 2 weeks, add performance after another 2.

The interviewer emphasized that at LinkedIn, data engineers often define the metrics, not just compute them. The ability to translate vague business questions into precise data models is critical.

Question: “How do you handle a situation where your data pipeline breaks and it affects a product feature?”

My approach: blameless incident response with clear severity levels.

Detect and contain — monitoring alerts trigger a page. First action is to route to a cached/stale version of the data so the UI doesn’t show errors.
Diagnose — check pipeline logs, data quality gates, upstream source changes.
Fix and backfill — repair the pipeline, reprocess the affected window.
Post-mortem — document root cause, implement a guardrail (e.g., new data quality check) so it doesn’t happen again.

I mentioned that at LinkedIn, this is especially critical because job recommendation scores affect people’s careers. A broken pipeline could mean qualified candidates never see relevant jobs.

Interview Summary

Here’s a structured recap of the entire interview:

Round 1 — Graph Intuition (30 min)

Topic: Identifying novel graph metrics for professional networks
Key skill: Translating product questions into graph-theoretic concepts
What they valued: Understanding that LinkedIn’s product is a graph, not just a database

Round 2 — SQL: Network Analysis (60 min)

Topic: Second-degree influence detection, skill-match scoring with network proximity
Key skills: Multi-hop joins, CTEs, window functions, handling edge cases
Bonus: Dynamic weighting based on skill scarcity, partition pruning for 50B-row tables

Round 3 — Python: Graph ETL (60 min)

Topic: Building network features (PageRank, betweenness, communities) from raw connection data
Key skills: networkx graph operations, feature engineering, MinMaxScaler normalization
Bonus: Understanding the leap from networkx to Spark GraphFrames for production scale

Round 4 — System Design: Recommendation Pipeline (45 min)

Topic: End-to-end job recommendation data pipeline with content + graph + behavioral signals
Key skills: Feature store architecture, micro-batch scoring, cold start strategies, A/B testing
Bonus: Practical awareness of the 18 quadrillion candidate-job pairs problem

Round 5 — Behavioral (30 min)

Topic: Ambiguous requirements, incident response, stakeholder management
Key skills: Translating vague questions into data models, blameless post-mortems

Total time: ~4 hours (spread across 2 days)

What Made the Difference

Three things stood out as differentiators:

Graph-first thinking. Most candidates tried to solve graph problems with flat SQL joins. Showing fluency in graph concepts — centrality, communities, path length — immediately signaled I could work with the People Graph.
Knowing the gap between interview code and production. When I used networkx, I immediately acknowledged it wouldn’t scale and pivoted to Spark GraphFrames. The interviewers want engineers who know when their tools stop working and what to reach for next.
Connecting graph features to product outcomes. I didn’t just compute PageRank — I explained how it maps to “who should we show this job to.” The best data engineers at LinkedIn bridge the gap between abstract graph theory and real user experiences.

Final Thoughts

The LinkedIn Data Engineer interview is unique because it forces you to think in graphs, not tables. Most data engineering interviews focus on ETL pipelines, warehouse modeling, and dashboard queries. LinkedIn throws in centrality metrics, community detection, and network proximity — concepts that sit at the intersection of graph theory and data engineering.

The good news: you don’t need to be a graph theorist. You need to understand the intuition behind graph metrics and know how to implement them at scale using distributed tools. The interviewers care more about whether you can translate “People You May Know” into “compute k-nearest neighbors on a bipartite graph with learned embeddings” than whether you can prove the convergence of the Louvain algorithm.

If you can show that you understand both the data engineering fundamentals (SQL, Python, Spark, pipelines) and the graph thinking that makes LinkedIn’s product unique — how networks create value, how proximity drives trust, how communities form and evolve — you’ll stand out in a way that’s hard to replicate.

Good luck. And remember: at LinkedIn, every data engineer is also a network theorist — whether they know it yet or not.

💡 需要面试辅导？

如果你对准备技术面试感到迷茫，或者想要个性化的面试指导和简历优化，欢迎联系 Interview Coach Pro 获取一对一辅导服务。

LinkedIn 数据工程师面试实录 2026：图谱数据分析与工程深度面

Phase 1: Introduction and Graph Intuition

Phase 2: SQL — Professional Network Analysis

Question 1: Second-Degree Network Influence on Job Applications

My Solution

What the Interviewer Was Testing

Question 2: Job Match Score with Skill Overlap and Network Proximity

My Solution

Discussion: Tuning the Weights

Phase 3: Python — Graph ETL Pipeline

The Requirements

My Solution

Follow-Up: Scaling to Billions

Phase 4: System Design — Job Recommendation Data Pipeline

Key Design Decisions

Phase 5: Behavioral and Cultural Fit

Question: “Tell me about a time you worked with ambiguous requirements on a data project.”

Question: “How do you handle a situation where your data pipeline breaks and it affects a product feature?”

Interview Summary

What Made the Difference

Recommended Reading and Resources

Final Thoughts

💡 需要面试辅导？

准备好拿下下一次面试了吗？