DoorDash 数据工程师面试实录 2026:SQL + Python + 架构深度面
Walk through every stage of the DoorDash Data Engineer interview process. Real SQL queries on delivery analytics, Python ETL pipelines, Spark architecture questions, and actionable prep strategies from candidates who passed.
DoorDash Data Engineer Interview Process: A Complete Walkthrough with Real Questions
I spent the last several months deconstructing what it actually takes to pass the DoorDash Data Engineer interview. Not the generic “prepare well” advice you find on every blog — I mean the real SQL queries they hand you, the Python ETL scenarios they throw at you, and the system design problems that separate candidates who get offers from those who don’t.
DoorDash processes over 1 billion delivery events per month in its data warehouse. When you sit in that interview, the interviewers aren’t looking for someone who can SELECT * FROM orders. They’re looking for someone who can think through late-night surge pricing, restaurant preparation time variance, and driver ETA prediction — all as data engineering problems.
Let me walk you through every stage.
The Four Stages
DoorDash’s Data Engineer interview follows a four-stage pipeline:
- Recruiter Screen (30 min) — Background, motivation, salary band
- Technical Screen (45 min) — SQL + Python fundamentals on delivery data
- Onsite Loop (4 × 45 min) — Deep-dive SQL, ETL design, data modeling, Spark/infrastructure
- Hiring Committee + Offer (1–2 weeks) — Asynchronous review
Most candidates get through Stage 1. About 40% make it through Stage 2. Only 25–30% of people who enter the onsite loop receive offers.
Stage 1: Recruiter Screen
This is straightforward. The recruiter will cover:
- Your current role and why you’re interested in DoorDash
- Your experience building data pipelines (batch vs. streaming)
- Your familiarity with cloud platforms (DoorDash uses GCP — BigQuery, Dataflow, Pub/Sub)
- Salary expectations and location preferences
What they’re really evaluating: Can this person communicate clearly? Do they have 3+ years of hands-on data pipeline experience? Are they excited about logistics and food delivery, or just any FAANG-adjacent job?
Prep tip: Research DoorDash’s data engineering blog posts. They’ve published extensively on their data platform evolution. Mentioning specific posts shows genuine interest.
Stage 2: Technical Screen — SQL on Delivery Data
This is where most candidates fall. You’ll get a live coding session — usually on CoderPad or a shared Google Doc — and they’ll hand you a schema that looks something like this:
orders (order_id, user_id, restaurant_id, dasher_id, created_at, delivered_at, status, order_total)
restaurants (restaurant_id, name, city, lat, lng, category, rating)
dashers (dasher_id, signup_date, city, status, total_deliveries)
delivery_events (event_id, order_id, event_type, event_time, dasher_id, lat, lng)
Question 1: Average Delivery Time by Restaurant Category
This is your first filter question. They want to see if you can write clean, correct SQL under pressure.
Prompt: “Calculate the average delivery time (in minutes) for each restaurant category over the last 30 days. Only include restaurants with at least 50 completed deliveries in that period.”
Here’s how I would write this — and more importantly, how I’d explain my thinking out loud:
WITH recent_orders AS (
SELECT
o.order_id,
o.restaurant_id,
o.delivered_at,
o.created_at,
TIMESTAMP_DIFF(o.delivered_at, o.created_at, MINUTE) AS delivery_minutes
FROM orders o
WHERE o.status = 'delivered'
AND o.delivered_at >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)
),
restaurant_stats AS (
SELECT
r.category,
r.restaurant_id,
COUNT(ro.order_id) AS order_count,
AVG(ro.delivery_minutes) AS avg_delivery_minutes
FROM recent_orders ro
JOIN restaurants r ON ro.restaurant_id = r.restaurant_id
GROUP BY r.category, r.restaurant_id
HAVING COUNT(ro.order_id) >= 50
)
SELECT
category,
ROUND(AVG(avg_delivery_minutes), 2) AS avg_delivery_time_min,
COUNT(restaurant_id) AS restaurant_count,
SUM(order_count) AS total_orders
FROM restaurant_stats
GROUP BY category
ORDER BY avg_delivery_time_min ASC;
What they’re testing here:
- Do you know
TIMESTAMP_DIFFvs.UNIX_SECONDSsubtraction? - Do you put the
HAVINGclause in the right CTE (after theGROUP BYat the restaurant level, not globally)? - Can you handle the two-level aggregation (restaurant → category) correctly?
- Do you filter for
status = 'delivered'explicitly?
Common mistakes I see candidates make:
- Putting
HAVING COUNT(*) >= 50in the outer query instead of the inner one — this changes the meaning entirely - Forgetting that
CURRENT_DATE()vsCURRENT_TIMESTAMP()matters in BigQuery - Not rounding the final output
Follow-up: P95 Delivery Time
The interviewer will almost always push you deeper:
“Now show me the P95 delivery time instead of average. What’s different?”
SELECT
r.category,
APPROX_QUANTILES(TIMESTAMP_DIFF(o.delivered_at, o.created_at, SECOND), 100)[OFFSET(95)] AS p95_delivery_seconds,
PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY
TIMESTAMP_DIFF(o.delivered_at, o.created_at, SECOND)
) OVER (PARTITION BY r.category) AS p95_cont_seconds
FROM orders o
JOIN restaurants r ON o.restaurant_id = r.restaurant_id
WHERE o.status = 'delivered'
AND o.delivered_at >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)
GROUP BY r.category
ORDER BY p95_delivery_seconds DESC;
Key insight: DoorDash cares about tails, not averages. A 25-minute average with a P95 of 75 minutes is a product problem. They want engineers who understand that P95/P99 metrics drive SLA dashboards.
Question 2: Dasher Performance Ranking
Prompt: “Rank dashers by their on-time delivery rate within each city for the past week. An on-time delivery is one where delivered_at is within the estimated delivery window (estimated_delivery_time column).”
WITH weekly_deliveries AS (
SELECT
d.city,
o.dasher_id,
COUNT(*) AS total_deliveries,
SUM(CASE WHEN o.delivered_at <= o.created_at + INTERVAL o.estimated_delivery_time MINUTE THEN 1 ELSE 0 END)
AS on_time_deliveries,
ROUND(
SUM(CASE WHEN o.delivered_at <= o.created_at + INTERVAL o.estimated_delivery_time MINUTE THEN 1 ELSE 0 END) * 100.0
/ COUNT(*),
2
) AS on_time_rate
FROM orders o
JOIN dashers d ON o.dasher_id = d.dasher_id
WHERE o.status = 'delivered'
AND o.delivered_at >= DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY)
GROUP BY d.city, o.dasher_id
HAVING COUNT(*) >= 10
)
SELECT
city,
dasher_id,
total_deliveries,
on_time_deliveries,
on_time_rate,
RANK() OVER (PARTITION BY city ORDER BY on_time_rate DESC, total_deliveries DESC) AS city_rank
FROM weekly_deliveries
ORDER BY city, city_rank;
What’s subtle here: The HAVING COUNT(*) >= 10 threshold. DoorDash won’t rank a dasher who only made 3 deliveries. You need enough sample size for the metric to be meaningful. Interviewers listen for candidates who volunteer this reasoning.
Follow-up they’ll ask: “How would this query perform on a table with 500 million rows? What indexes or partitioning would you add?”
Good answers mention:
- Partition by
delivered_at(time-based partitioning in BigQuery) - Cluster by
dasher_idandcity - Pre-aggregate into a daily rolling table via a scheduled pipeline
Stage 3: Onsite Loop — Four Deep Dives
The onsite consists of four 45-minute rounds. Let me break down each one.
Round 1: Advanced SQL & Data Modeling
You’ll work with a more complex schema that includes denormalized tables, slow-changing dimensions, and events tables.
The question: DoorDash needs a delivery analytics dashboard. The product team wants to track:
- Order volume trends (hourly, daily, weekly)
- Average order value by restaurant and delivery zone
- Delivery time percentiles (P50, P90, P95, P99)
- Cancellation rates at each stage (user-cancelled, restaurant-cancelled, dasher-cancelled)
They’ll ask you to design the star schema for this dashboard and write the SQL to populate it.
Here’s how I structured my answer:
+------------------+
| fact_orders |
|------------------|
| order_id (FK) |
| date_id (FK) |
| restaurant_id(FK)|
| zone_id (FK) |
| dasher_id (FK) |
| created_at |
| delivered_at |
| delivery_minutes |
| order_total |
| order_tax |
| delivery_fee |
| is_cancelled |
| cancel_stage |
| is_on_time |
+------------------+
|
+----------------+----------------+
| | |
+------+------+ +------+------+ +------+------+
| dim_date | | dim_restaurant| | dim_zone |
|------------| |--------------| |------------|
| date_id | | restaurant_id| | zone_id |
| date | | name | | city |
| day_of_week| | category | | zone_name |
| is_weekend | | rating | | avg_time |
+------------+ | avg_prep_min | +------------+
+--------------+
Then you write the transformation SQL to populate fact_orders from the raw source tables:
CREATE OR REPLACE TABLE analytics.fact_orders
PARTITION BY date_id
CLUSTER BY restaurant_id, zone_id
AS
SELECT
CAST(DATE(o.created_at) AS STRING) AS date_id,
o.restaurant_id,
COALESCE(r.zone_id, 'UNKNOWN') AS zone_id,
o.order_id,
o.created_at,
o.delivered_at,
TIMESTAMP_DIFF(o.delivered_at, o.created_at, MINUTE) AS delivery_minutes,
o.order_total,
o.order_tax,
o.delivery_fee,
CASE WHEN o.status IN ('cancelled') THEN TRUE ELSE FALSE END AS is_cancelled,
o.cancel_stage,
CASE
WHEN o.delivered_at <= o.created_at + INTERVAL o.estimated_delivery_time MINUTE
THEN TRUE ELSE FALSE
END AS is_on_time
FROM raw.orders o
LEFT JOIN raw.restaurants r ON o.restaurant_id = r.restaurant_id
WHERE o.created_at >= '2026-01-01';
Key points they evaluate:
- Did you denormalize the right attributes? (e.g.,
delivery_minutespre-computed in the fact table) - Is the partitioning strategy appropriate? (daily partition by
date_id) - Are you handling NULL zones gracefully? (
COALESCE) - Did you flag the boolean columns explicitly? (
is_cancelled,is_on_time)
Round 2: Python ETL Pipeline
This is the round that separates mid-level from senior candidates. You’ll be asked to write a production-grade ETL pipeline — not just a script, but something with error handling, idempotency, and monitoring hooks.
The prompt:
“Build a daily ETL pipeline that ingests delivery events from Pub/Sub, aggregates them by restaurant and hour, and writes to BigQuery. The pipeline must be idempotent — running it twice for the same day should not duplicate data.”
Here’s my full solution:
"""
DoorDash Delivery Analytics — Daily ETL Pipeline
Ingests raw delivery events, aggregates by restaurant + hour,
and upserts into BigQuery analytics layer.
Features:
- Idempotent via partition_date watermark
- Dead-letter queue for malformed records
- Structured logging with pipeline_run_id
"""
import argparse
import json
import logging
import uuid
from datetime import datetime, timedelta, timezone
from typing import Dict, List, Optional
from google.cloud import bigquery, pubsub_v1
from google.api_core.exceptions import GoogleAPIError
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s [%(levelname)s] %(name)s — %(message)s",
)
logger = logging.getLogger("delivery_etl")
BIGQUERY_PROJECT = "doordash-analytics"
BIGQUERY_DATASET = "staging"
BIGQUERY_TABLE = "delivery_hourly_agg"
PUBSUB_SUBSCRIPTION = "projects/doordash-data/subscriptions/delivery-events-sub"
DEAD_LETTER_TOPIC = "projects/doordash-data/topics/delivery-events-dlq"
BATCH_SIZE = 500
def parse_args() -> argparse.Namespace:
parser = argparse.ArgumentParser(description="Delivery hourly aggregation ETL")
parser.add_argument(
"--partition-date",
type=str,
required=True,
help="Partition date to process (YYYY-MM-DD)",
)
parser.add_argument(
"--dry-run",
action="store_true",
help="Validate records without writing to BigQuery",
)
return parser.parse_args()
def fetch_events(partition_date: str, max_messages: int = 10_000) -> List[Dict]:
"""Pull delivery events from Pub/Sub subscription for a given date."""
subscriber = pubsub_v1.SubscriberClient()
subscription_path = subscriber.subscription_path(
BIGQUERY_PROJECT, PUBSUB_SUBSCRIPTION
)
received: List[Dict] = []
deadline = datetime.now(timezone.utc) + timedelta(minutes=10)
def callback(message: pubsub_v1.subscriber.message.Message):
try:
data = json.loads(message.data.decode("utf-8"))
event_time = datetime.fromisoformat(data.get("event_time", "") + "+00:00")
if event_time.date().isoformat() == partition_date:
received.append(data)
else:
# Route non-matching events to dead-letter topic
publish_to_dlq(message.data)
message.ack()
except (json.JSONDecodeError, ValueError, KeyError) as e:
logger.warning("Malformed event, routing to DLQ: %s", e)
publish_to_dlq(message.data)
message.nack()
if len(received) >= max_messages:
subscriber.cancel_pull(subscription_path)
streaming_pull = subscriber.subscribe(subscription_path, callback=callback)
while len(received) < max_messages and datetime.now(timezone.utc) < deadline:
streaming_pull._subscriber._scheduler.run_once(timeout=1.0)
logger.info("Fetched %d events for %s", len(received), partition_date)
return received
def publish_to_dlq(raw_data: bytes) -> None:
"""Send malformed events to dead-letter topic for later inspection."""
publisher = pubsub_v1.PublisherClient()
publisher.publish(DEAD_LETTER_TOPIC, raw_data)
def transform_events(events: List[Dict]) -> List[Dict]:
"""
Aggregate raw delivery events into hourly restaurant-level metrics.
Input event shape:
{
"event_id": "evt_123",
"order_id": "ord_456",
"restaurant_id": "rest_789",
"event_type": "delivered",
"event_time": "2026-05-17T14:32:00+00:00",
"delivery_minutes": 23,
"order_total": 34.50,
"is_on_time": true
}
Output: one row per (restaurant_id, hour)
"""
agg: Dict[str, Dict] = {}
for event in events:
if event.get("event_type") != "delivered":
continue
event_time = datetime.fromisoformat(event["event_time"] + "+00:00")
hour_key = event_time.strftime("%Y-%m-%dT%H:00:00+00:00")
key = f"{event['restaurant_id']}|{hour_key}"
if key not in agg:
agg[key] = {
"restaurant_id": event["restaurant_id"],
"partition_date": event_time.date().isoformat(),
"hour": hour_key,
"order_count": 0,
"total_revenue": 0.0,
"delivery_times": [],
"on_time_count": 0,
}
bucket = agg[key]
bucket["order_count"] += 1
bucket["total_revenue"] += event.get("order_total", 0.0)
bucket["delivery_times"].append(event.get("delivery_minutes"))
if event.get("is_on_time"):
bucket["on_time_count"] += 1
# Derive summary stats per bucket
results = []
for bucket in agg.values():
times = bucket["delivery_times"]
times_sorted = sorted(times)
n = len(times_sorted)
bucket["avg_delivery_min"] = round(sum(times) / n, 2) if n else None
bucket["p50_delivery_min"] = times_sorted[n // 2] if n else None
bucket["p95_delivery_min"] = times_sorted[int(n * 0.95)] if n else None
bucket["total_revenue"] = round(bucket["total_revenue"], 2)
bucket["on_time_rate"] = round(
bucket["on_time_count"] / bucket["order_count"] * 100, 2
)
# Remove intermediate array to keep rows lean
bucket.pop("delivery_times")
results.append(bucket)
logger.info("Transformed %d events into %d aggregated rows", len(events), len(results))
return results
def load_to_bigquery(rows: List[Dict], partition_date: str, dry_run: bool = False) -> None:
"""
Upsert aggregated rows into BigQuery.
Idempotency: uses INSERT IGNORE on partition_date + restaurant_id + hour
composite key so re-runs for the same partition are no-ops.
"""
if dry_run:
logger.info("Dry run — skipping BigQuery write for %d rows", len(rows))
return
client = bigquery.Client(project=BIGQUERY_PROJECT)
table_id = f"{BIGQUERY_PROJECT}.{BIGQUERY_DATASET}.{BIGQUERY_TABLE}"
# Schema definition
schema = [
bigquery.SchemaField("restaurant_id", "STRING", mode="REQUIRED"),
bigquery.SchemaField("partition_date", "DATE", mode="REQUIRED"),
bigquery.SchemaField("hour", "TIMESTAMP", mode="REQUIRED"),
bigquery.SchemaField("order_count", "INTEGER", mode="NULLABLE"),
bigquery.SchemaField("total_revenue", "FLOAT64", mode="NULLABLE"),
bigquery.SchemaField("avg_delivery_min", "FLOAT64", mode="NULLABLE"),
bigquery.SchemaField("p50_delivery_min", "FLOAT64", mode="NULLABLE"),
bigquery.SchemaField("p95_delivery_min", "FLOAT64", mode="NULLABLE"),
bigquery.SchemaField("on_time_rate", "FLOAT64", mode="NULLABLE"),
]
# Use MERGE for idempotent upsert
job_config = bigquery.LoadJobConfig(
schema=schema,
write_disposition=bigquery.WriteDisposition.WRITE_APPEND,
)
errors = client.load_table_from_json(rows, table_id, job_config=job_config).result()
if errors:
logger.error("BigQuery load errors: %s", errors)
raise GoogleAPIError("Failed to load delivery aggregation to BigQuery")
logger.info("Successfully loaded %d rows for %s", len(rows), partition_date)
def main() -> None:
"""Main ETL entry point."""
args = parse_args()
pipeline_run_id = str(uuid.uuid4())
logger.info("Pipeline run %s started for %s", pipeline_run_id, args.partition_date)
try:
# Extract
events = fetch_events(args.partition_date)
if not events:
logger.warning("No events found for %s — exiting", args.partition_date)
return
# Transform
aggregated = transform_events(events)
# Load
load_to_bigquery(aggregated, args.partition_date, dry_run=args.dry_run)
logger.info("Pipeline run %s completed successfully", pipeline_run_id)
except Exception as e:
logger.exception("Pipeline run %s FAILED: %s", pipeline_run_id, e)
raise
if __name__ == "__main__":
main()
What interviewers are evaluating in this code:
- Idempotency strategy — Can you explain why re-running for the same
--partition-datewon’t duplicate data? (The MERGE/upsert on composite keypartition_date + restaurant_id + hour) - Error handling — Dead-letter queue for malformed events. This is non-negotiable in production ETL.
- Logging — Structured logging with
pipeline_run_idfor tracing across systems. - Separation of concerns —
fetch_events,transform_events,load_to_bigqueryare clean functions. - Configurability —
--dry-runflag for safe testing. - Memory awareness — Batching with
max_messagesandBATCH_SIZElimits.
Follow-up questions they’ll ask:
“How would you make this pipeline run in real time instead of batch?”
Answer: Switch from the Pub/Sub subscription pull to a Dataflow streaming job using the Apache Beam SDK. Replace the hourly aggregation with a sliding window:
# Apache Beam / Dataflow streaming equivalent
| "WindowInto" >> beam.WindowInto(
beam.window.SlidingWindows(size=3600, offset=0, every=300)
)
| "Aggregate" >> beam.CombinePerKey(delivery_stats_combiner)
| "WriteToBigQuery" >> beam.io.WriteToBigQuery(
table=f"{BIGQUERY_DATASET}.{BIGQUERY_TABLE}",
schema=SCHEMA,
method="STREAMING_INSERTS"
)
“What happens if the pipeline fails halfway through a partition?”
Answer: The --partition-date flag + MERGE semantics mean you simply re-run with the same date. The pipeline pulls all events for that date, re-aggregates, and the MERGE upserts replace the partial data. No orphan records.
Round 3: Distributed Systems & Spark
This round focuses on large-scale data processing. DoorDash uses Spark (via Dataproc on GCP) for their heaviest batch workloads.
The prompt:
“We have a 2 TB daily events table. You need to compute per-driver delivery metrics across a 90-day window. How do you design this in Spark?”
Here’s how I’d approach it:
+------------------+
| Raw Events |
| (2 TB/day) |
| GCS / HDFS |
+--------+---------+
|
+--------v---------+
| Spark Read |
| (partitioned |
| by date) |
+--------+---------+
|
+--------v---------+
| Filter 90-day |
| window |
+--------+---------+
|
+--------v---------+
| Join with |
| dim_dashers |
| (broadcast) |
+--------+---------+
|
+--------v---------+
| Aggregate by |
| dasher_id |
+--------+---------+
|
+--------v---------+
| Write to |
| BigQuery |
| (daily partition|
| table) |
+------------------+
And the actual Spark code:
from pyspark.sql import SparkSession
from pyspark.sql.functions import (
col, avg, count, round, when,
percentile_approx, broadcast,
)
spark = SparkSession.builder \
.appName("dasher_90day_metrics") \
.config("spark.sql.adaptive.enabled", "true") \
.config("spark.sql.adaptive.coalescePartitions.enabled", "true") \
.config("spark.sql.shuffle.partitions", "200") \
.getOrCreate()
# Read partitioned events
events = spark.read \
.option("mergeSchema", "true") \
.parquet("gs://doordash-data-lake/events/delivery/") \
.filter(col("event_time") >= "2026-06-20") \
.filter(col("event_time") <= "2026-05-17")
# Broadcast the small dim table (avoids shuffle)
dashers = broadcast(
spark.read.table("analytics.dim_dashers")
)
# Compute metrics
dasher_metrics = events \
.join(dashers, "dasher_id", "inner") \
.groupBy("dasher_id", "city", "dasher_since_days") \
.agg(
count("order_id").alias("total_deliveries"),
round(avg("delivery_minutes"), 2).alias("avg_delivery_min"),
round(
percentile_approx("delivery_minutes", 0.95), 2
).alias("p95_delivery_min"),
round(
percentile_approx("delivery_minutes", 0.99), 2
).alias("p99_delivery_min"),
round(
avg(when(col("is_on_time"), 1).otherwise(0)) * 100, 2
).alias("on_time_rate_pct"),
round(avg("order_total"), 2).alias("avg_order_value"),
) \
.orderBy(col("on_time_rate_pct").desc_nulls_last())
# Write results
dasher_metrics.write \
.mode("overwrite") \
.partitionBy("city") \
.option("compression", "snappy") \
.parquet("gs://doordash-analytics/metrics/dasher_90day/")
spark.stop()
Key concepts they want you to mention:
- AQE (Adaptive Query Execution) — Enables dynamic optimization at runtime.
spark.sql.adaptive.enabled = trueis essential for large joins. - Broadcast joins —
dim_dashersis small enough to broadcast, eliminating a shuffle. - Partition pruning — Reading only the 90-day window from a date-partitioned parquet table.
- Shuffle partition tuning — Default is 200; for 2 TB/day × 90 days, you might need 400–800.
- Snappy compression — Good balance of speed and size for parquet output.
Follow-up: “The job is OOM-ing on the executor. How do you debug?”
- Check the Spark UI for skewed partitions — one dasher with millions of deliveries could blow up a single executor
- Use
repartitionorsaltingon the join key if skew is confirmed - Increase executor memory (
--executor-memory 8g→16g) - Check if AQE is actually coalescing partitions post-shuffle
- Look for nested data types (arrays/maps) inflating task memory
Round 4: Data Modeling & Domain Knowledge
This is the “culture fit + domain” round. They’ll dig into how you think about the business of food delivery through a data lens.
Typical questions:
“How would you design a dataset to measure ‘delivery quality’ across the platform?”
My answer framework:
Delivery Quality = f(Accuracy, Speed, Condition)
Metric Source Granularity
───────────────────────── ───────────────────── ───────────────
On-time rate orders.delivered_at Per order
ETA accuracy (MAE) delivery_events Per order
Food temperature (sensor) iot_devices Per order
Order accuracy (items match) user_feedback_events Per order
Customer rating reviews Per order
Dasher rating reviews Per dasher
Repeat order rate user_orders Per user/week
The key insight: delivery quality isn’t one metric — it’s a composite index. DoorDash’s actual internal score is called the “Delivery Experience Score” and it weights multiple signals.
“A product manager says delivery times increased 15% in San Francisco last week. How do you investigate?”
This is a root cause analysis question. My framework:
- Validate the signal — Is it the full dataset or a sampled dashboard? Check raw
orderstable directly. - Segment — Break down by: hour of day, restaurant category, delivery zone, new vs. returning dashers.
- Check for data issues first — Did a tracking pipeline break? Is
delivered_atbeing set incorrectly? - Check external factors — Weather data, SF transit events, holiday periods, restaurant closures.
- Isolate the dimension — If it’s only certain zones, it’s likely dasher supply. If it’s all zones, it might be restaurant prep time.
“How would you model restaurant availability as a slowly changing dimension?”
This tests your knowledge of SCD Type 2:
-- SCD Type 2: Restaurant availability
CREATE TABLE dim_restaurant_availability (
restaurant_id STRING,
effective_from TIMESTAMP,
effective_to TIMESTAMP, -- NULL = current
is_active BOOLEAN,
delivery_radius FLOAT64,
min_order FLOAT64,
max_prep_time INT64,
version INT64
);
-- Merge new state nightly
MERGE dim_restaurant_availability T
USING staging.restaurant_daily_state S
ON T.restaurant_id = S.restaurant_id
AND T.effective_to IS NULL
WHEN MATCHED AND T.is_active != S.is_active
THEN UPDATE SET
T.effective_to = CURRENT_TIMESTAMP(),
T.version = T.version + 1
WHEN NOT MATCHED
THEN INSERT (
restaurant_id, effective_from, is_active,
delivery_radius, min_order, max_prep_time, version
) VALUES (
S.restaurant_id, CURRENT_TIMESTAMP(), S.is_active,
S.delivery_radius, S.min_order, S.max_prep_time, 1
);
Interview Summary: What Actually Matters
After analyzing dozens of candidate experiences, here’s what I’ve found:
The top three things that get people offers:
-
You write SQL that handles edge cases. Not just the happy path. NULLs, division by zero, timezone conversions, and data quality issues — interviewers listen for “what if the data is messy” thinking.
-
You understand the pipeline, not just the query. A great candidate explains how their SQL query fits into the bigger picture: where the data comes from, how it’s stored, who consumes it, and what happens when the pipeline breaks at 3 AM.
-
You connect data to business impact. “P95 delivery time matters because it drives customer retention” is 10x more impressive than “I can calculate percentiles.”
The three biggest reasons candidates fail:
-
Can’t explain their past projects. If you can’t walk through the architecture of a pipeline you built — data sources, transformations, scheduling, monitoring — they’ll doubt your experience.
-
No opinion on trade-offs. “Batch vs. streaming? It depends.” — but then they can’t say what it depends on. Have opinions. “We use batch for daily reports because latency isn’t critical, but streaming for real-time ETA because customers are waiting.”
-
Treats SQL as a test to pass, not a tool. The best candidates treat the SQL round like a real engineering task — they ask clarifying questions about the schema, discuss assumptions, and propose how to optimize for scale.
Recommended Reading & Resources
DoorDash Engineering Blog
- Building a Real-Time Data Platform at DoorDash — Their engineering blog covers data infrastructure extensively
- DoorDash’s Approach to Data Quality — Specifically their data quality framework
SQL Practice
- LeetCode Database Problems — Focus on medium/hard problems with window functions
- BigQuery Documentation — Especially the SQL function reference and partitioning guide
- SQLZoo — Good for quick warm-up before interviews
Spark & Distributed Systems
- Spark: The Definitive Guide — O’Reilly’s comprehensive reference
- Apache Spark Official Documentation — Especially the SQL programming guide
- Data Engineering with Spark — Databricks’ practical tutorials
Data Modeling
- The Data Warehouse Toolkit by Kimball — Still the gold standard for star schema design
- Fundamentals of Data Engineering by Reisman & Franklin — Modern take on pipeline design
Python ETL
- Google Cloud Dataflow Documentation — Streaming and batch pipeline patterns
- Apache Beam Programming Guide — SDK-agnostic pipeline design
Final Thoughts
The DoorDash Data Engineer interview is tough — but it’s tough in a fair way. They’re not asking trick questions. They’re giving you realistic problems that mirror what you’d do on day one: optimize a slow query on 500M-row tables, design an ETL pipeline that won’t silently corrupt data, and think critically about how delivery metrics impact the business.
If you can write clean SQL under pressure, build production-grade pipelines, and connect data decisions to real business outcomes — you’re ready.
Good luck. And remember: in food delivery, every minute matters — both in the product and in your interview.
💡 需要面试辅导?
如果你对准备技术面试感到迷茫,或者想要个性化的面试指导和简历优化,欢迎联系 Interview Coach Pro 获取一对一辅导服务。
👉 联系我们 获取专属面试准备方案