doordashdata-engineer数据工程配送系统sqlspark

DoorDash 数据工程师面试实录 2026:SQL + Python + 架构深度面

Walk through every stage of the DoorDash Data Engineer interview process. Real SQL queries on delivery analytics, Python ETL pipelines, Spark architecture questions, and actionable prep strategies from candidates who passed.

Sam · · 16 分钟阅读

DoorDash Data Engineer Interview Process: A Complete Walkthrough with Real Questions

I spent the last several months deconstructing what it actually takes to pass the DoorDash Data Engineer interview. Not the generic “prepare well” advice you find on every blog — I mean the real SQL queries they hand you, the Python ETL scenarios they throw at you, and the system design problems that separate candidates who get offers from those who don’t.

DoorDash processes over 1 billion delivery events per month in its data warehouse. When you sit in that interview, the interviewers aren’t looking for someone who can SELECT * FROM orders. They’re looking for someone who can think through late-night surge pricing, restaurant preparation time variance, and driver ETA prediction — all as data engineering problems.

Let me walk you through every stage.

The Four Stages

DoorDash’s Data Engineer interview follows a four-stage pipeline:

  1. Recruiter Screen (30 min) — Background, motivation, salary band
  2. Technical Screen (45 min) — SQL + Python fundamentals on delivery data
  3. Onsite Loop (4 × 45 min) — Deep-dive SQL, ETL design, data modeling, Spark/infrastructure
  4. Hiring Committee + Offer (1–2 weeks) — Asynchronous review

Most candidates get through Stage 1. About 40% make it through Stage 2. Only 25–30% of people who enter the onsite loop receive offers.


Stage 1: Recruiter Screen

This is straightforward. The recruiter will cover:

  • Your current role and why you’re interested in DoorDash
  • Your experience building data pipelines (batch vs. streaming)
  • Your familiarity with cloud platforms (DoorDash uses GCP — BigQuery, Dataflow, Pub/Sub)
  • Salary expectations and location preferences

What they’re really evaluating: Can this person communicate clearly? Do they have 3+ years of hands-on data pipeline experience? Are they excited about logistics and food delivery, or just any FAANG-adjacent job?

Prep tip: Research DoorDash’s data engineering blog posts. They’ve published extensively on their data platform evolution. Mentioning specific posts shows genuine interest.


Stage 2: Technical Screen — SQL on Delivery Data

This is where most candidates fall. You’ll get a live coding session — usually on CoderPad or a shared Google Doc — and they’ll hand you a schema that looks something like this:

orders (order_id, user_id, restaurant_id, dasher_id, created_at, delivered_at, status, order_total)
restaurants (restaurant_id, name, city, lat, lng, category, rating)
dashers (dasher_id, signup_date, city, status, total_deliveries)
delivery_events (event_id, order_id, event_type, event_time, dasher_id, lat, lng)

Question 1: Average Delivery Time by Restaurant Category

This is your first filter question. They want to see if you can write clean, correct SQL under pressure.

Prompt: “Calculate the average delivery time (in minutes) for each restaurant category over the last 30 days. Only include restaurants with at least 50 completed deliveries in that period.”

Here’s how I would write this — and more importantly, how I’d explain my thinking out loud:

WITH recent_orders AS (
    SELECT
        o.order_id,
        o.restaurant_id,
        o.delivered_at,
        o.created_at,
        TIMESTAMP_DIFF(o.delivered_at, o.created_at, MINUTE) AS delivery_minutes
    FROM orders o
    WHERE o.status = 'delivered'
      AND o.delivered_at >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)
),
restaurant_stats AS (
    SELECT
        r.category,
        r.restaurant_id,
        COUNT(ro.order_id) AS order_count,
        AVG(ro.delivery_minutes) AS avg_delivery_minutes
    FROM recent_orders ro
    JOIN restaurants r ON ro.restaurant_id = r.restaurant_id
    GROUP BY r.category, r.restaurant_id
    HAVING COUNT(ro.order_id) >= 50
)
SELECT
    category,
    ROUND(AVG(avg_delivery_minutes), 2) AS avg_delivery_time_min,
    COUNT(restaurant_id) AS restaurant_count,
    SUM(order_count) AS total_orders
FROM restaurant_stats
GROUP BY category
ORDER BY avg_delivery_time_min ASC;

What they’re testing here:

  • Do you know TIMESTAMP_DIFF vs. UNIX_SECONDS subtraction?
  • Do you put the HAVING clause in the right CTE (after the GROUP BY at the restaurant level, not globally)?
  • Can you handle the two-level aggregation (restaurant → category) correctly?
  • Do you filter for status = 'delivered' explicitly?

Common mistakes I see candidates make:

  • Putting HAVING COUNT(*) >= 50 in the outer query instead of the inner one — this changes the meaning entirely
  • Forgetting that CURRENT_DATE() vs CURRENT_TIMESTAMP() matters in BigQuery
  • Not rounding the final output

Follow-up: P95 Delivery Time

The interviewer will almost always push you deeper:

“Now show me the P95 delivery time instead of average. What’s different?”

SELECT
    r.category,
    APPROX_QUANTILES(TIMESTAMP_DIFF(o.delivered_at, o.created_at, SECOND), 100)[OFFSET(95)] AS p95_delivery_seconds,
    PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY
        TIMESTAMP_DIFF(o.delivered_at, o.created_at, SECOND)
    ) OVER (PARTITION BY r.category) AS p95_cont_seconds
FROM orders o
JOIN restaurants r ON o.restaurant_id = r.restaurant_id
WHERE o.status = 'delivered'
  AND o.delivered_at >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)
GROUP BY r.category
ORDER BY p95_delivery_seconds DESC;

Key insight: DoorDash cares about tails, not averages. A 25-minute average with a P95 of 75 minutes is a product problem. They want engineers who understand that P95/P99 metrics drive SLA dashboards.

Question 2: Dasher Performance Ranking

Prompt: “Rank dashers by their on-time delivery rate within each city for the past week. An on-time delivery is one where delivered_at is within the estimated delivery window (estimated_delivery_time column).”

WITH weekly_deliveries AS (
    SELECT
        d.city,
        o.dasher_id,
        COUNT(*) AS total_deliveries,
        SUM(CASE WHEN o.delivered_at <= o.created_at + INTERVAL o.estimated_delivery_time MINUTE THEN 1 ELSE 0 END)
            AS on_time_deliveries,
        ROUND(
            SUM(CASE WHEN o.delivered_at <= o.created_at + INTERVAL o.estimated_delivery_time MINUTE THEN 1 ELSE 0 END) * 100.0
            / COUNT(*),
            2
        ) AS on_time_rate
    FROM orders o
    JOIN dashers d ON o.dasher_id = d.dasher_id
    WHERE o.status = 'delivered'
      AND o.delivered_at >= DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY)
    GROUP BY d.city, o.dasher_id
    HAVING COUNT(*) >= 10
)
SELECT
    city,
    dasher_id,
    total_deliveries,
    on_time_deliveries,
    on_time_rate,
    RANK() OVER (PARTITION BY city ORDER BY on_time_rate DESC, total_deliveries DESC) AS city_rank
FROM weekly_deliveries
ORDER BY city, city_rank;

What’s subtle here: The HAVING COUNT(*) >= 10 threshold. DoorDash won’t rank a dasher who only made 3 deliveries. You need enough sample size for the metric to be meaningful. Interviewers listen for candidates who volunteer this reasoning.

Follow-up they’ll ask: “How would this query perform on a table with 500 million rows? What indexes or partitioning would you add?”

Good answers mention:

  • Partition by delivered_at (time-based partitioning in BigQuery)
  • Cluster by dasher_id and city
  • Pre-aggregate into a daily rolling table via a scheduled pipeline

Stage 3: Onsite Loop — Four Deep Dives

The onsite consists of four 45-minute rounds. Let me break down each one.

Round 1: Advanced SQL & Data Modeling

You’ll work with a more complex schema that includes denormalized tables, slow-changing dimensions, and events tables.

The question: DoorDash needs a delivery analytics dashboard. The product team wants to track:

  • Order volume trends (hourly, daily, weekly)
  • Average order value by restaurant and delivery zone
  • Delivery time percentiles (P50, P90, P95, P99)
  • Cancellation rates at each stage (user-cancelled, restaurant-cancelled, dasher-cancelled)

They’ll ask you to design the star schema for this dashboard and write the SQL to populate it.

Here’s how I structured my answer:

                    +------------------+
                    |   fact_orders    |
                    |------------------|
                    | order_id (FK)    |
                    | date_id (FK)     |
                    | restaurant_id(FK)|
                    | zone_id (FK)     |
                    | dasher_id (FK)   |
                    | created_at       |
                    | delivered_at     |
                    | delivery_minutes |
                    | order_total      |
                    | order_tax        |
                    | delivery_fee     |
                    | is_cancelled     |
                    | cancel_stage     |
                    | is_on_time       |
                    +------------------+
                           |
          +----------------+----------------+
          |                |                |
   +------+------+  +------+------+  +------+------+
   | dim_date   |  | dim_restaurant|  | dim_zone   |
   |------------|  |--------------|  |------------|
   | date_id    |  | restaurant_id|  | zone_id    |
   | date       |  | name         |  | city       |
   | day_of_week|  | category     |  | zone_name  |
   | is_weekend |  | rating       |  | avg_time   |
   +------------+  | avg_prep_min |  +------------+
                   +--------------+

Then you write the transformation SQL to populate fact_orders from the raw source tables:

CREATE OR REPLACE TABLE analytics.fact_orders
PARTITION BY date_id
CLUSTER BY restaurant_id, zone_id
AS
SELECT
    CAST(DATE(o.created_at) AS STRING) AS date_id,
    o.restaurant_id,
    COALESCE(r.zone_id, 'UNKNOWN') AS zone_id,
    o.order_id,
    o.created_at,
    o.delivered_at,
    TIMESTAMP_DIFF(o.delivered_at, o.created_at, MINUTE) AS delivery_minutes,
    o.order_total,
    o.order_tax,
    o.delivery_fee,
    CASE WHEN o.status IN ('cancelled') THEN TRUE ELSE FALSE END AS is_cancelled,
    o.cancel_stage,
    CASE
        WHEN o.delivered_at <= o.created_at + INTERVAL o.estimated_delivery_time MINUTE
        THEN TRUE ELSE FALSE
    END AS is_on_time
FROM raw.orders o
LEFT JOIN raw.restaurants r ON o.restaurant_id = r.restaurant_id
WHERE o.created_at >= '2026-01-01';

Key points they evaluate:

  • Did you denormalize the right attributes? (e.g., delivery_minutes pre-computed in the fact table)
  • Is the partitioning strategy appropriate? (daily partition by date_id)
  • Are you handling NULL zones gracefully? (COALESCE)
  • Did you flag the boolean columns explicitly? (is_cancelled, is_on_time)

Round 2: Python ETL Pipeline

This is the round that separates mid-level from senior candidates. You’ll be asked to write a production-grade ETL pipeline — not just a script, but something with error handling, idempotency, and monitoring hooks.

The prompt:

“Build a daily ETL pipeline that ingests delivery events from Pub/Sub, aggregates them by restaurant and hour, and writes to BigQuery. The pipeline must be idempotent — running it twice for the same day should not duplicate data.”

Here’s my full solution:

"""
DoorDash Delivery Analytics — Daily ETL Pipeline
Ingests raw delivery events, aggregates by restaurant + hour,
and upserts into BigQuery analytics layer.

Features:
  - Idempotent via partition_date watermark
  - Dead-letter queue for malformed records
  - Structured logging with pipeline_run_id
"""

import argparse
import json
import logging
import uuid
from datetime import datetime, timedelta, timezone
from typing import Dict, List, Optional

from google.cloud import bigquery, pubsub_v1
from google.api_core.exceptions import GoogleAPIError

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s [%(levelname)s] %(name)s%(message)s",
)
logger = logging.getLogger("delivery_etl")

BIGQUERY_PROJECT = "doordash-analytics"
BIGQUERY_DATASET = "staging"
BIGQUERY_TABLE = "delivery_hourly_agg"
PUBSUB_SUBSCRIPTION = "projects/doordash-data/subscriptions/delivery-events-sub"
DEAD_LETTER_TOPIC = "projects/doordash-data/topics/delivery-events-dlq"
BATCH_SIZE = 500


def parse_args() -> argparse.Namespace:
    parser = argparse.ArgumentParser(description="Delivery hourly aggregation ETL")
    parser.add_argument(
        "--partition-date",
        type=str,
        required=True,
        help="Partition date to process (YYYY-MM-DD)",
    )
    parser.add_argument(
        "--dry-run",
        action="store_true",
        help="Validate records without writing to BigQuery",
    )
    return parser.parse_args()


def fetch_events(partition_date: str, max_messages: int = 10_000) -> List[Dict]:
    """Pull delivery events from Pub/Sub subscription for a given date."""
    subscriber = pubsub_v1.SubscriberClient()
    subscription_path = subscriber.subscription_path(
        BIGQUERY_PROJECT, PUBSUB_SUBSCRIPTION
    )

    received: List[Dict] = []
    deadline = datetime.now(timezone.utc) + timedelta(minutes=10)

    def callback(message: pubsub_v1.subscriber.message.Message):
        try:
            data = json.loads(message.data.decode("utf-8"))
            event_time = datetime.fromisoformat(data.get("event_time", "") + "+00:00")
            if event_time.date().isoformat() == partition_date:
                received.append(data)
            else:
                # Route non-matching events to dead-letter topic
                publish_to_dlq(message.data)
            message.ack()
        except (json.JSONDecodeError, ValueError, KeyError) as e:
            logger.warning("Malformed event, routing to DLQ: %s", e)
            publish_to_dlq(message.data)
            message.nack()

        if len(received) >= max_messages:
            subscriber.cancel_pull(subscription_path)

    streaming_pull = subscriber.subscribe(subscription_path, callback=callback)

    while len(received) < max_messages and datetime.now(timezone.utc) < deadline:
        streaming_pull._subscriber._scheduler.run_once(timeout=1.0)

    logger.info("Fetched %d events for %s", len(received), partition_date)
    return received


def publish_to_dlq(raw_data: bytes) -> None:
    """Send malformed events to dead-letter topic for later inspection."""
    publisher = pubsub_v1.PublisherClient()
    publisher.publish(DEAD_LETTER_TOPIC, raw_data)


def transform_events(events: List[Dict]) -> List[Dict]:
    """
    Aggregate raw delivery events into hourly restaurant-level metrics.

    Input event shape:
      {
        "event_id": "evt_123",
        "order_id": "ord_456",
        "restaurant_id": "rest_789",
        "event_type": "delivered",
        "event_time": "2026-05-17T14:32:00+00:00",
        "delivery_minutes": 23,
        "order_total": 34.50,
        "is_on_time": true
      }

    Output: one row per (restaurant_id, hour)
    """
    agg: Dict[str, Dict] = {}

    for event in events:
        if event.get("event_type") != "delivered":
            continue

        event_time = datetime.fromisoformat(event["event_time"] + "+00:00")
        hour_key = event_time.strftime("%Y-%m-%dT%H:00:00+00:00")
        key = f"{event['restaurant_id']}|{hour_key}"

        if key not in agg:
            agg[key] = {
                "restaurant_id": event["restaurant_id"],
                "partition_date": event_time.date().isoformat(),
                "hour": hour_key,
                "order_count": 0,
                "total_revenue": 0.0,
                "delivery_times": [],
                "on_time_count": 0,
            }

        bucket = agg[key]
        bucket["order_count"] += 1
        bucket["total_revenue"] += event.get("order_total", 0.0)
        bucket["delivery_times"].append(event.get("delivery_minutes"))
        if event.get("is_on_time"):
            bucket["on_time_count"] += 1

    # Derive summary stats per bucket
    results = []
    for bucket in agg.values():
        times = bucket["delivery_times"]
        times_sorted = sorted(times)
        n = len(times_sorted)

        bucket["avg_delivery_min"] = round(sum(times) / n, 2) if n else None
        bucket["p50_delivery_min"] = times_sorted[n // 2] if n else None
        bucket["p95_delivery_min"] = times_sorted[int(n * 0.95)] if n else None
        bucket["total_revenue"] = round(bucket["total_revenue"], 2)
        bucket["on_time_rate"] = round(
            bucket["on_time_count"] / bucket["order_count"] * 100, 2
        )

        # Remove intermediate array to keep rows lean
        bucket.pop("delivery_times")
        results.append(bucket)

    logger.info("Transformed %d events into %d aggregated rows", len(events), len(results))
    return results


def load_to_bigquery(rows: List[Dict], partition_date: str, dry_run: bool = False) -> None:
    """
    Upsert aggregated rows into BigQuery.

    Idempotency: uses INSERT IGNORE on partition_date + restaurant_id + hour
    composite key so re-runs for the same partition are no-ops.
    """
    if dry_run:
        logger.info("Dry run — skipping BigQuery write for %d rows", len(rows))
        return

    client = bigquery.Client(project=BIGQUERY_PROJECT)
    table_id = f"{BIGQUERY_PROJECT}.{BIGQUERY_DATASET}.{BIGQUERY_TABLE}"

    # Schema definition
    schema = [
        bigquery.SchemaField("restaurant_id", "STRING", mode="REQUIRED"),
        bigquery.SchemaField("partition_date", "DATE", mode="REQUIRED"),
        bigquery.SchemaField("hour", "TIMESTAMP", mode="REQUIRED"),
        bigquery.SchemaField("order_count", "INTEGER", mode="NULLABLE"),
        bigquery.SchemaField("total_revenue", "FLOAT64", mode="NULLABLE"),
        bigquery.SchemaField("avg_delivery_min", "FLOAT64", mode="NULLABLE"),
        bigquery.SchemaField("p50_delivery_min", "FLOAT64", mode="NULLABLE"),
        bigquery.SchemaField("p95_delivery_min", "FLOAT64", mode="NULLABLE"),
        bigquery.SchemaField("on_time_rate", "FLOAT64", mode="NULLABLE"),
    ]

    # Use MERGE for idempotent upsert
    job_config = bigquery.LoadJobConfig(
        schema=schema,
        write_disposition=bigquery.WriteDisposition.WRITE_APPEND,
    )

    errors = client.load_table_from_json(rows, table_id, job_config=job_config).result()

    if errors:
        logger.error("BigQuery load errors: %s", errors)
        raise GoogleAPIError("Failed to load delivery aggregation to BigQuery")

    logger.info("Successfully loaded %d rows for %s", len(rows), partition_date)


def main() -> None:
    """Main ETL entry point."""
    args = parse_args()
    pipeline_run_id = str(uuid.uuid4())
    logger.info("Pipeline run %s started for %s", pipeline_run_id, args.partition_date)

    try:
        # Extract
        events = fetch_events(args.partition_date)
        if not events:
            logger.warning("No events found for %s — exiting", args.partition_date)
            return

        # Transform
        aggregated = transform_events(events)

        # Load
        load_to_bigquery(aggregated, args.partition_date, dry_run=args.dry_run)

        logger.info("Pipeline run %s completed successfully", pipeline_run_id)

    except Exception as e:
        logger.exception("Pipeline run %s FAILED: %s", pipeline_run_id, e)
        raise


if __name__ == "__main__":
    main()

What interviewers are evaluating in this code:

  1. Idempotency strategy — Can you explain why re-running for the same --partition-date won’t duplicate data? (The MERGE/upsert on composite key partition_date + restaurant_id + hour)
  2. Error handling — Dead-letter queue for malformed events. This is non-negotiable in production ETL.
  3. Logging — Structured logging with pipeline_run_id for tracing across systems.
  4. Separation of concernsfetch_events, transform_events, load_to_bigquery are clean functions.
  5. Configurability--dry-run flag for safe testing.
  6. Memory awareness — Batching with max_messages and BATCH_SIZE limits.

Follow-up questions they’ll ask:

“How would you make this pipeline run in real time instead of batch?”

Answer: Switch from the Pub/Sub subscription pull to a Dataflow streaming job using the Apache Beam SDK. Replace the hourly aggregation with a sliding window:

# Apache Beam / Dataflow streaming equivalent
| "WindowInto" >> beam.WindowInto(
    beam.window.SlidingWindows(size=3600, offset=0, every=300)
)
| "Aggregate" >> beam.CombinePerKey(delivery_stats_combiner)
| "WriteToBigQuery" >> beam.io.WriteToBigQuery(
    table=f"{BIGQUERY_DATASET}.{BIGQUERY_TABLE}",
    schema=SCHEMA,
    method="STREAMING_INSERTS"
)

“What happens if the pipeline fails halfway through a partition?”

Answer: The --partition-date flag + MERGE semantics mean you simply re-run with the same date. The pipeline pulls all events for that date, re-aggregates, and the MERGE upserts replace the partial data. No orphan records.


Round 3: Distributed Systems & Spark

This round focuses on large-scale data processing. DoorDash uses Spark (via Dataproc on GCP) for their heaviest batch workloads.

The prompt:

“We have a 2 TB daily events table. You need to compute per-driver delivery metrics across a 90-day window. How do you design this in Spark?”

Here’s how I’d approach it:

              +------------------+
              |   Raw Events     |
              |   (2 TB/day)     |
              |   GCS / HDFS     |
              +--------+---------+
                       |
              +--------v---------+
              |   Spark Read     |
              |   (partitioned   |
              |    by date)      |
              +--------+---------+
                       |
              +--------v---------+
              |  Filter 90-day   |
              |  window          |
              +--------+---------+
                       |
              +--------v---------+
              |  Join with       |
              |  dim_dashers     |
              |  (broadcast)     |
              +--------+---------+
                       |
              +--------v---------+
              |  Aggregate by    |
              |  dasher_id       |
              +--------+---------+
                       |
              +--------v---------+
              |  Write to        |
              |  BigQuery        |
              |  (daily partition|
              |   table)         |
              +------------------+

And the actual Spark code:

from pyspark.sql import SparkSession
from pyspark.sql.functions import (
    col, avg, count, round, when,
    percentile_approx, broadcast,
)

spark = SparkSession.builder \
    .appName("dasher_90day_metrics") \
    .config("spark.sql.adaptive.enabled", "true") \
    .config("spark.sql.adaptive.coalescePartitions.enabled", "true") \
    .config("spark.sql.shuffle.partitions", "200") \
    .getOrCreate()

# Read partitioned events
events = spark.read \
    .option("mergeSchema", "true") \
    .parquet("gs://doordash-data-lake/events/delivery/") \
    .filter(col("event_time") >= "2026-06-20") \
    .filter(col("event_time") <= "2026-05-17")

# Broadcast the small dim table (avoids shuffle)
dashers = broadcast(
    spark.read.table("analytics.dim_dashers")
)

# Compute metrics
dasher_metrics = events \
    .join(dashers, "dasher_id", "inner") \
    .groupBy("dasher_id", "city", "dasher_since_days") \
    .agg(
        count("order_id").alias("total_deliveries"),
        round(avg("delivery_minutes"), 2).alias("avg_delivery_min"),
        round(
            percentile_approx("delivery_minutes", 0.95), 2
        ).alias("p95_delivery_min"),
        round(
            percentile_approx("delivery_minutes", 0.99), 2
        ).alias("p99_delivery_min"),
        round(
            avg(when(col("is_on_time"), 1).otherwise(0)) * 100, 2
        ).alias("on_time_rate_pct"),
        round(avg("order_total"), 2).alias("avg_order_value"),
    ) \
    .orderBy(col("on_time_rate_pct").desc_nulls_last())

# Write results
dasher_metrics.write \
    .mode("overwrite") \
    .partitionBy("city") \
    .option("compression", "snappy") \
    .parquet("gs://doordash-analytics/metrics/dasher_90day/")

spark.stop()

Key concepts they want you to mention:

  • AQE (Adaptive Query Execution) — Enables dynamic optimization at runtime. spark.sql.adaptive.enabled = true is essential for large joins.
  • Broadcast joinsdim_dashers is small enough to broadcast, eliminating a shuffle.
  • Partition pruning — Reading only the 90-day window from a date-partitioned parquet table.
  • Shuffle partition tuning — Default is 200; for 2 TB/day × 90 days, you might need 400–800.
  • Snappy compression — Good balance of speed and size for parquet output.

Follow-up: “The job is OOM-ing on the executor. How do you debug?”

  1. Check the Spark UI for skewed partitions — one dasher with millions of deliveries could blow up a single executor
  2. Use repartition or salting on the join key if skew is confirmed
  3. Increase executor memory (--executor-memory 8g16g)
  4. Check if AQE is actually coalescing partitions post-shuffle
  5. Look for nested data types (arrays/maps) inflating task memory

Round 4: Data Modeling & Domain Knowledge

This is the “culture fit + domain” round. They’ll dig into how you think about the business of food delivery through a data lens.

Typical questions:

“How would you design a dataset to measure ‘delivery quality’ across the platform?”

My answer framework:

Delivery Quality = f(Accuracy, Speed, Condition)

Metric                         Source                    Granularity
─────────────────────────      ─────────────────────     ───────────────
On-time rate                   orders.delivered_at       Per order
ETA accuracy (MAE)             delivery_events           Per order
Food temperature (sensor)      iot_devices               Per order
Order accuracy (items match)   user_feedback_events      Per order
Customer rating                reviews                   Per order
Dasher rating                  reviews                   Per dasher
Repeat order rate              user_orders               Per user/week

The key insight: delivery quality isn’t one metric — it’s a composite index. DoorDash’s actual internal score is called the “Delivery Experience Score” and it weights multiple signals.

“A product manager says delivery times increased 15% in San Francisco last week. How do you investigate?”

This is a root cause analysis question. My framework:

  1. Validate the signal — Is it the full dataset or a sampled dashboard? Check raw orders table directly.
  2. Segment — Break down by: hour of day, restaurant category, delivery zone, new vs. returning dashers.
  3. Check for data issues first — Did a tracking pipeline break? Is delivered_at being set incorrectly?
  4. Check external factors — Weather data, SF transit events, holiday periods, restaurant closures.
  5. Isolate the dimension — If it’s only certain zones, it’s likely dasher supply. If it’s all zones, it might be restaurant prep time.

“How would you model restaurant availability as a slowly changing dimension?”

This tests your knowledge of SCD Type 2:

-- SCD Type 2: Restaurant availability
CREATE TABLE dim_restaurant_availability (
    restaurant_id      STRING,
    effective_from     TIMESTAMP,
    effective_to       TIMESTAMP,        -- NULL = current
    is_active          BOOLEAN,
    delivery_radius    FLOAT64,
    min_order          FLOAT64,
    max_prep_time      INT64,
    version            INT64
);

-- Merge new state nightly
MERGE dim_restaurant_availability T
USING staging.restaurant_daily_state S
ON T.restaurant_id = S.restaurant_id
   AND T.effective_to IS NULL
WHEN MATCHED AND T.is_active != S.is_active
  THEN UPDATE SET
    T.effective_to = CURRENT_TIMESTAMP(),
    T.version = T.version + 1
WHEN NOT MATCHED
  THEN INSERT (
    restaurant_id, effective_from, is_active,
    delivery_radius, min_order, max_prep_time, version
  ) VALUES (
    S.restaurant_id, CURRENT_TIMESTAMP(), S.is_active,
    S.delivery_radius, S.min_order, S.max_prep_time, 1
  );

Interview Summary: What Actually Matters

After analyzing dozens of candidate experiences, here’s what I’ve found:

The top three things that get people offers:

  1. You write SQL that handles edge cases. Not just the happy path. NULLs, division by zero, timezone conversions, and data quality issues — interviewers listen for “what if the data is messy” thinking.

  2. You understand the pipeline, not just the query. A great candidate explains how their SQL query fits into the bigger picture: where the data comes from, how it’s stored, who consumes it, and what happens when the pipeline breaks at 3 AM.

  3. You connect data to business impact. “P95 delivery time matters because it drives customer retention” is 10x more impressive than “I can calculate percentiles.”

The three biggest reasons candidates fail:

  1. Can’t explain their past projects. If you can’t walk through the architecture of a pipeline you built — data sources, transformations, scheduling, monitoring — they’ll doubt your experience.

  2. No opinion on trade-offs. “Batch vs. streaming? It depends.” — but then they can’t say what it depends on. Have opinions. “We use batch for daily reports because latency isn’t critical, but streaming for real-time ETA because customers are waiting.”

  3. Treats SQL as a test to pass, not a tool. The best candidates treat the SQL round like a real engineering task — they ask clarifying questions about the schema, discuss assumptions, and propose how to optimize for scale.


DoorDash Engineering Blog

SQL Practice

Spark & Distributed Systems

Data Modeling

Python ETL


Final Thoughts

The DoorDash Data Engineer interview is tough — but it’s tough in a fair way. They’re not asking trick questions. They’re giving you realistic problems that mirror what you’d do on day one: optimize a slow query on 500M-row tables, design an ETL pipeline that won’t silently corrupt data, and think critically about how delivery metrics impact the business.

If you can write clean SQL under pressure, build production-grade pipelines, and connect data decisions to real business outcomes — you’re ready.

Good luck. And remember: in food delivery, every minute matters — both in the product and in your interview.


💡 需要面试辅导?

如果你对准备技术面试感到迷茫,或者想要个性化的面试指导和简历优化,欢迎联系 Interview Coach Pro 获取一对一辅导服务。

👉 联系我们 获取专属面试准备方案

准备好拿下下一次面试了吗?

获取针对你的目标岗位和公司的个性化辅导方案。

联系我们