atlassiandata-engineer数据工程dev-toolssql

Atlassian 数据工程师面试实录 2026:产品分析深度面

Walkthrough of a Data Engineer interview at Atlassian — SQL for feature adoption analysis, Python analytics pipelines for Jira and Confluence telemetry, and real architectural diagrams from product analytics at scale.

Sam · · 15 分钟阅读

I sat down for the Data Engineer interview at Atlassian with a mix of excitement and nerves. I’d been using Jira and Confluence for years in my own teams, but interviewing for the product analytics side meant thinking about these tools completely differently — not as a user, but as an engineer building the data infrastructure that tells Atlassian how millions of teams use their products every day.

The interview was structured around Atlassian’s actual data stack and real product analytics challenges. Below I’ll walk through each phase, the problems I was given, my solutions, and what the interviewers were really looking for.


Phase 1: Introduction and Product Sense

The interviewer opened with a straightforward question:

“You use Jira every day. If you could measure one thing about how teams use Jira that would tell you whether a team is healthy, what would it be and why?”

I paused for a moment and thought about the product from both a user and an engineering perspective. My answer:

Cycle time distribution and resolution rate. Specifically, I’d look at the median and P90 of time_to_resolve broken down by project. A healthy team has a tight distribution — issues move through statuses predictably. A bloated P90 means work is getting stuck, and that’s a signal that the team’s process (or their Jira workflow) needs attention.

The interviewer nodded and pushed deeper:

“Now imagine you’re building the dashboard that surfaces this. What’s in your data model?”

This is where the real interview started.


Phase 2: SQL — Feature Adoption Analysis

I was given a simplified schema representing Atlassian’s product analytics warehouse and asked to build a query that measures feature adoption for a newly released Jira feature — let’s say “Advanced Roadmaps” (formerly Portfolio for Jira).

Here’s the schema I was shown:

-- events: raw product event stream
events (
  event_id        BIGINT,
  event_type      VARCHAR,        -- 'feature_used', 'page_view', 'button_click'
  feature_name    VARCHAR,        -- 'advanced_roadmaps', 'sprint_planning', etc.
  tenant_id       BIGINT,         -- Jira Cloud tenant (customer)
  user_id         BIGINT,
  project_key     VARCHAR,        -- e.g., 'PROJ', 'CRM'
  timestamp       TIMESTAMP,
  properties      JSON            -- additional event metadata
)

-- tenants: customer-level metadata
tenants (
  tenant_id       BIGINT,
  product         VARCHAR,        -- 'jira-software', 'confluence', 'jira-service-mgmt'
  plan            VARCHAR,        -- 'free', 'standard', 'premium', 'enterprise'
  active_users    INT,
  created_at      TIMESTAMP
)

-- projects: Jira projects within tenants
projects (
  project_key     VARCHAR,
  tenant_id       BIGINT,
  created_at      TIMESTAMP,
  is_active       BOOLEAN
)

The Question

“Write a query that calculates the adoption rate of ‘Advanced Roadmaps’ among Premium and Enterprise tenants, segmented by tenant size (number of active users), for the last 30 days. Adoption means at least 5 distinct users interacted with the feature.”

My Solution

WITH feature_events AS (
    -- Filter to the target feature and time window
    SELECT
        e.tenant_id,
        e.user_id,
        e.timestamp,
        -- Count interactions per user in the window
        COUNT(*) AS interaction_count
    FROM events e
    WHERE
        e.event_type = 'feature_used'
        AND e.feature_name = 'advanced_roadmaps'
        AND e.timestamp >= NOW() - INTERVAL '30 days'
    GROUP BY e.tenant_id, e.user_id, e.timestamp
),

adopting_users AS (
    -- A user is "adopted" if they used the feature at least once in the window
    SELECT DISTINCT
        tenant_id,
        user_id
    FROM feature_events
),

tenant_adoption AS (
    -- Count adopting users per tenant
    SELECT
        a.tenant_id,
        COUNT(DISTINCT a.user_id) AS adopting_users
    FROM adopting_users a
    GROUP BY a.tenant_id
    HAVING COUNT(DISTINCT a.user_id) >= 5   -- adoption threshold
),

tenant_cohorts AS (
    -- Join tenant metadata and bucket by size
    SELECT
        t.tenant_id,
        t.plan,
        t.active_users,
        CASE
            WHEN t.active_users <= 10        THEN '0-10'
            WHEN t.active_users <= 50        THEN '11-50'
            WHEN t.active_users <= 200       THEN '51-200'
            WHEN t.active_users <= 1000      THEN '201-1000'
            ELSE '1000+'
        END AS size_bucket
    FROM tenants t
    WHERE t.plan IN ('premium', 'enterprise')
),

adoption_rates AS (
    SELECT
        tc.size_bucket,
        tc.plan,
        COUNT(DISTINCT tc.tenant_id) AS total_tenants,
        COUNT(DISTINCT ta.tenant_id) AS adopting_tenants,
        ROUND(
            100.0 * COUNT(DISTINCT ta.tenant_id)
            / NULLIF(COUNT(DISTINCT tc.tenant_id), 0),
            2
        ) AS adoption_pct
    FROM tenant_cohorts tc
    LEFT JOIN tenant_adoption ta
        ON tc.tenant_id = ta.tenant_id
    GROUP BY tc.size_bucket, tc.plan
)

SELECT
    size_bucket,
    plan,
    total_tenants,
    adopting_tenants,
    adoption_pct,
    -- Rank within each bucket to see which plan adopts faster
    RANK() OVER (
        PARTITION BY size_bucket
        ORDER BY adoption_pct DESC
    ) AS plan_rank
FROM adoption_rates
ORDER BY
    CASE size_bucket
        WHEN '0-10'     THEN 1
        WHEN '11-50'    THEN 2
        WHEN '51-200'   THEN 3
        WHEN '201-1000' THEN 4
        WHEN '1000+'    THEN 5
    END,
    plan_rank;

What the Interviewer Was Testing

The interviewer pointed out several things I got right:

  • HAVING clause for the adoption threshold — filtering after aggregation, not before
  • NULLIF in the division — preventing division by zero if a size bucket has no tenants
  • LEFT JOIN from tenants to adoption — preserving non-adopting tenants so the denominator is correct
  • Window function for ranking — adding business insight on top of the raw numbers

Then came the follow-up:

“How would you optimize this for a table with 50 billion rows and a 30-day rolling window?”

My answer: partition pruning and materialized summary tables.

-- Partition events by timestamp (e.g., daily partitions)
-- The WHERE clause on timestamp automatically prunes old partitions
-- Additionally, pre-aggregate at the tenant-user-day level:

CREATE TABLE IF NOT EXISTS daily_feature_adoption AS
SELECT
    DATE(timestamp) AS event_date,
    feature_name,
    tenant_id,
    user_id,
    COUNT(*) AS interaction_count
FROM events
GROUP BY DATE(timestamp), feature_name, tenant_id, user_id;

-- Then the main query reads from the daily table (~1000x fewer rows)

The interviewer also asked about handling backfill scenarios — when event data is delayed or arrives out of order. I mentioned idempotent upserts into the materialized table using INSERT ... ON CONFLICT ... DO UPDATE, which is standard practice in Atlassian’s data stack (they use Snowflake and dbt).


Phase 3: Python — Analytics Pipeline

Next, I was given a coding exercise: Build a Python module that processes raw Jira event data and computes a weekly feature engagement score.

The scenario: Atlassian receives event streams from Jira Cloud instances. Events arrive via Kafka and are landed in S3 as Parquet files. The analytics team needs a daily batch job that computes engagement metrics per feature, per tenant, and writes results to a analytics warehouse.

The Requirements

  1. Read Parquet files from S3 (simulated with a local path for the interview)
  2. Deduplicate events (same event_id can appear multiple times due to retries)
  3. Compute a weekly engagement score per (tenant_id, feature_name):
engagement_score = (unique_users * 0.4) +
                   (total_interactions / max_interactions * 0.3) +
                   (days_active / 7 * 0.3)
  1. Output a DataFrame and write it back to Parquet

My Solution

import pyarrow.parquet as pq
import pandas as pd
import numpy as np
from pathlib import Path
from typing import Optional
import logging

logger = logging.getLogger(__name__)


class FeatureEngagementPipeline:
    """
    Computes weekly feature engagement scores from raw Jira/Confluence
    product event Parquet files.

    Engagement score formula:
      score = (unique_users * 0.4)
            + (total_interactions / max_interactions * 0.3)
            + (days_active / 7 * 0.3)

    All components are normalized to [0, 1] before weighting.
    """

    WEIGHT_UNIQUE_USERS = 0.4
    WEIGHT_INTERACTIONS = 0.3
    WEIGHT_DAYS_ACTIVE = 0.3

    def __init__(
        self,
        input_dir: str | Path,
        output_dir: str | Path,
        days_lookback: int = 7,
    ):
        self.input_dir = Path(input_dir)
        self.output_dir = Path(output_dir)
        self.days_lookback = days_lookback

    def load_events(self) -> pd.DataFrame:
        """Load and concatenate all Parquet files from the input directory."""
        parquet_files = sorted(self.input_dir.glob("*.parquet"))
        if not parquet_files:
            raise FileNotFoundError(
                f"No Parquet files found in {self.input_dir}"
            )

        logger.info("Loading %d Parquet files", len(parquet_files))
        dfs = [pq.read_table(f).to_pandas() for f in parquet_files]
        events = pd.concat(dfs, ignore_index=True)
        logger.info("Loaded %d total events", len(events))
        return events

    def deduplicate(self, events: pd.DataFrame) -> pd.DataFrame:
        """Remove duplicate events based on event_id, keeping the first occurrence."""
        before = len(events)
        events = events.drop_duplicates(subset=["event_id"], keep="first")
        removed = before - len(events)
        logger.info(
            "Deduplicated: %d total -> %d unique (removed %d)",
            before, len(events), removed,
        )
        return events

    def filter_time_window(
        self, events: pd.DataFrame
    ) -> pd.DataFrame:
        """Keep only events within the lookback window."""
        cutoff = pd.Timestamp.now() - pd.Timedelta(days=self.days_lookback)
        events = events[events["timestamp"] >= cutoff].copy()
        logger.info("Filtered to %d events in last %d days", len(events), self.days_lookback)
        return events

    def compute_engagement_scores(
        self, events: pd.DataFrame
    ) -> pd.DataFrame:
        """
        Compute engagement score per (tenant_id, feature_name).

        Returns a DataFrame with columns:
          tenant_id, feature_name, unique_users, total_interactions,
          days_active, engagement_score, score_date
        """
        # Aggregate per tenant + feature
        agg = events.groupby(["tenant_id", "feature_name"]).agg(
            unique_users=("user_id", "nunique"),
            total_interactions=("event_id", "count"),
            days_active=("timestamp", lambda x: x.dt.date.nunique()),
        ).reset_index()

        # Normalize each component to [0, 1]
        max_users = agg["unique_users"].max() or 1
        max_interactions = agg["total_interactions"].max() or 1
        max_days = min(self.days_lookback, 7) or 1

        agg["norm_users"] = agg["unique_users"] / max_users
        agg["norm_interactions"] = agg["total_interactions"] / max_interactions
        agg["norm_days"] = agg["days_active"] / max_days

        # Weighted score
        agg["engagement_score"] = (
            agg["norm_users"] * self.WEIGHT_UNIQUE_USERS
            + agg["norm_interactions"] * self.WEIGHT_INTERACTIONS
            + agg["norm_days"] * self.WEIGHT_DAYS_ACTIVE
        )

        agg["score_date"] = pd.Timestamp.now().date()

        # Round for readability
        agg["engagement_score"] = agg["engagement_score"].round(4)

        return agg

    def write_output(self, scores: pd.DataFrame) -> str:
        """Write engagement scores to a Parquet file in the output directory."""
        self.output_dir.mkdir(parents=True, exist_ok=True)
        output_file = self.output_dir / f"engagement_{pd.Timestamp.now().date()}.parquet"

        # Select output columns
        output_cols = [
            "tenant_id", "feature_name", "unique_users",
            "total_interactions", "days_active",
            "engagement_score", "score_date",
        ]
        scores[output_cols].to_parquet(output_file, index=False)
        logger.info("Wrote %d rows to %s", len(scores), output_file)
        return str(output_file)

    def run(self) -> pd.DataFrame:
        """Execute the full pipeline: load -> dedup -> filter -> score -> write."""
        events = self.load_events()
        events = self.deduplicate(events)
        events = self.filter_time_window(events)
        scores = self.compute_engagement_scores(events)
        self.write_output(scores)
        return scores


if __name__ == "__main__":
    logging.basicConfig(level=logging.INFO)

    pipeline = FeatureEngagementPipeline(
        input_dir="./data/events/",
        output_dir="./data/output/",
        days_lookback=7,
    )

    result = pipeline.run()
    print(f"\nTop 10 most engaged features:")
    print(result.nlargest(10, "engagement_score").to_string(index=False))

Follow-Up: Scaling the Pipeline

The interviewer then asked:

“This works for a single run, but at Atlassian we process billions of events daily. How do you scale this?”

I outlined three layers:

1. Streaming-first ingestion: Raw events go into Kafka → Spark Structured Streaming (or Flink) → S3 partitioned by date/tenant_id. This replaces the batch Parquet load with partitioned, schema-enforced storage.

2. Incremental computation: Instead of recomputing everything, maintain a materialized view that aggregates per (tenant_id, feature_name, date). Each daily run only processes that day’s partition and merges into the rolling 7-day window.

3. Orchestration and monitoring: Use Airflow or Dagster to schedule the pipeline with dependencies — e.g., don’t start engagement scoring until the event ingestion job reports status=complete. Add data quality checks: alert if event volume drops more than 20% from the previous day.

The interviewer was particularly interested in the data quality angle. I mentioned using Great Expectations or dbt tests to validate that event_id is unique, tenant_id is not null, and feature_name matches an allowed enum list.


Phase 4: System Design — Product Analytics Architecture

The final technical round asked me to design a product analytics pipeline for Jira and Confluence that supports:

  • Real-time dashboards for product managers (hourly latency)
  • Historical trend analysis (daily rollups going back 3+ years)
  • Multi-tenant data isolation (critical — customer A must never see customer B’s data)
  • GDPR-compliant data deletion (right to be forgotten)

I sketched this architecture:

                          ┌─────────────────────────────────────────────────┐
                          │              Atlassian Product Analytics        │
                          └─────────────────────────────────────────────────┘

┌──────────────┐     ┌──────────────┐     ┌─────────────────────────────┐
│  Jira Cloud  │     │ Confluence   │     │  Jira Service Management    │
│  (SaaS)      │     │  Cloud       │     │  (SaaS)                     │
└──────┬───────┘     └──────┬───────┘     └────────────┬────────────────┘
       │                    │                          │
       │  event stream      │  event stream            │  event stream
       ▼                    ▼                          ▼
┌──────────────────────────────────────────────────────────────────────┐
│                        Kafka (Event Ingestion)                       │
│  topic: product_events_jira   │   topic: product_events_confluence  │
└──────────────────────────┬──────────────────────────────────────────┘

              ┌────────────┴────────────┐
              ▼                         ▼
   ┌─────────────────┐        ┌──────────────────────┐
   │  Spark Streaming │        │   Schema Registry    │
   │  (enrich + dedup)│        │  (Avro schema evolve)│
   └────────┬────────┘        └──────────────────────┘


   ┌─────────────────────────────────┐
   │         S3 Data Lake            │
   │  /events/year=2026/month=05/    │
   │  /events/year=2026/month=06/    │
   │  Partitioned: date, tenant_id   │
   │  Format: Parquet (Snappy)       │
   └────────┬────────────────────────┘

     ┌──────┴──────┐
     ▼             ▼
┌──────────┐  ┌──────────────┐
│ Snowflake│  │   dbt        │
│ (WH)     │  │ (transforms) │
└────┬─────┘  └──────┬───────┘
     │               │
     ▼               ▼
┌────────────┐  ┌──────────────┐
│  Looker    │  │   Airflow    │
│ (dashboards│  │ (orchestrate)│
│  for PMs)  │  │ + monitoring │
└────────────┘  └──────────────┘

┌──────────────────────────────────────────────────────────────────┐
│                       Data Governance Layer                      │
│  ┌────────────────┐  ┌────────────────┐  ┌────────────────────┐ │
│  │ Row-level sec- │  │ GDPR deletion  │  │ Data quality gates │ │
│  │ urity (tenant) │  │ pipeline       │  │ (Great Expect.)    │ │
│  └────────────────┘  └────────────────┘  └────────────────────┘ │
└──────────────────────────────────────────────────────────────────┘

Multi-Tenant Isolation

I emphasized that every single query — whether in dbt, Snowflake, or Looker — must include a tenant_id filter. At Atlassian, this is enforced through:

  • Snowflake Row Access Policies (RAP) — database-level row filtering so a tenant’s data is physically inaccessible to queries from another tenant’s context
  • Tenant-scoped database roles — each analytics user gets a role scoped to their tenant’s data
  • dbt pre-hooks that inject {{ var("tenant_id_filter") }} into every model

GDPR Right to Be Forgotten

When a user requests deletion, the pipeline must:

  1. Identify all partitions containing events for that user_id
  2. Rewrite Parquet files excluding those events (since Parquet is immutable, this means writing a new file)
  3. Update materialized views — recompute aggregations for affected (tenant_id, feature_name, date) tuples
  4. Log the deletion for audit compliance

This is why the architecture separates raw events (append-only, but rewriteable for GDPR) from aggregated tables (which are recomputed on demand).


Phase 5: Behavioral and Cultural Fit

Atlassian has strong cultural principles based on their “Team Topologies” and “Collaboration” values. The behavioral round focused on:

Question: “Tell me about a time you had to push back on a product or engineering decision based on data.”

I shared an experience where I noticed our analytics pipeline was dropping 12% of events due to a schema change that wasn’t properly versioned. The product team was making decisions based on incomplete data. I had to:

  1. Quantify the impact — build a report showing the discrepancy between raw Kafka counts and warehouse counts
  2. Propose a fix — implement schema versioning in the ingestion pipeline with backward-compatible field additions
  3. Retroactively correct — reprocess the affected time window so historical dashboards were accurate

The interviewer wanted to see that I could translate data issues into business impact, not just technical problems.

Question: “How do you handle it when two stakeholders want conflicting metrics from the same dataset?”

My approach: don’t pick a side, expose the full picture. If Product wants “DAU based on login” and Engineering wants “DAU based on API call”, I build both, but I also build a third metric — “DAU based on meaningful action” (e.g., created an issue, edited a page) — that often reveals what neither stakeholder was measuring correctly.

This is what Atlassian calls “data empathy” — understanding what each stakeholder thinks they’re measuring, and surfacing what they actually need to measure.


Interview Summary

Here’s a structured recap of the entire interview:

Round 1 — Product Sense (45 min)

  • Topic: Identifying health signals from product usage data
  • Key skill: Translating product intuition into measurable metrics
  • What they valued: Concrete thinking about what “healthy” means in context

Round 2 — SQL (60 min)

  • Topic: Feature adoption analysis with window functions, CTEs, and aggregation
  • Key skill: Correct joins, handling edge cases (division by zero, empty buckets)
  • Bonus: Partition pruning and materialized views for scale

Round 3 — Python Pipeline (60 min)

  • Topic: Building a batch analytics pipeline with deduplication, time-windowing, and scoring
  • Key skill: Clean, modular code with logging and error handling
  • Bonus: Scaling discussion (Kafka, Spark, incremental computation)

Round 4 — System Design (45 min)

  • Topic: End-to-end product analytics architecture
  • Key skill: Balancing real-time needs with batch processing, multi-tenant isolation
  • Bonus: GDPR compliance thinking (data deletion pipeline)

Round 5 — Behavioral (30 min)

  • Topic: Data-driven decision making, stakeholder management
  • Key skill: Communicating data issues in business terms

Total time: ~4.5 hours (spread across 2 days)

What Made the Difference

Three things stood out to me as differentiators:

  1. Thinking in tenants, not just users. Atlassian’s multi-tenant SaaS model means every data model, query, and dashboard has an extra dimension that most companies don’t worry about. Showing awareness of this upfront was a clear signal.

  2. Connecting data quality to business decisions. It’s not enough to say “there are nulls.” You need to say “these nulls mean 8% of our adoption numbers are wrong, which means the product team is misallocating sprint capacity.”

  3. Practical architecture over buzzwords. When drawing the pipeline, I focused on why each component existed and what failure mode it addressed, not just listing technologies. The interviewer could tell I’d actually built things at this scale.


If you’re preparing for a Data Engineer role at Atlassian or a similar product analytics position, here’s what I’d study:

Product Analytics Fundamentals:

  • Lean Analytics by Alistair Croll and Benjamin Yoskovitz — the book that defines how product metrics actually drive decisions
  • Amplitude’s blog on “North Star Metrics” — understand how to identify the one metric that matters most for each product
  • Mixpanel’s guide to “Activation” metrics — critical for SaaS products like Jira

Technical Skills:

Atlassian-Specific:

SQL Practice:

Python for Data Engineering:


Final Thoughts

The Atlassian Data Engineer interview is less about writing the fastest query and more about demonstrating that you understand the product, the users, and the business impact of the data you’re moving. The SQL and coding rounds are solidly medium-hard, but the differentiator is always the follow-up — the questions about scale, quality, governance, and what the numbers actually mean.

If you can show that you think about data the way a product manager thinks about features — with empathy for the end user, awareness of the business context, and rigor in measurement — you’ll be in a strong position.

Good luck. And remember: at Atlassian, the best data engineers are the ones who make the product team’s dashboards actually useful, not just technically impressive.


💡 需要面试辅导?

如果你对准备技术面试感到迷茫,或者想要个性化的面试指导和简历优化,欢迎联系 Interview Coach Pro 获取一对一辅导服务。

👉 联系我们 获取专属面试准备方案

准备好拿下下一次面试了吗?

获取针对你的目标岗位和公司的个性化辅导方案。

联系我们