Splunk 数据工程师面试实录 2026：SPL 日志分析 + SIEM 安全事件管道完整复盘

公司：Splunk 岗位：Data Engineer (L4) 面试形式：Phone Screen + Virtual Onsite (4 轮) 结果：Pass → Offer

2026 年 1 月通过内推投递了 Splunk 的 Data Engineer 岗位。整个流程大约 3 周。

Splunk 的 DE 面试最独特的地方：SPL（Splunk Processing Language）是面试的核心语言。 作为日志分析和安全监控平台，Splunk 有自己的查询语言 SPL——面试中不仅要会 SQL 和 Spark，还要会写 SPL 做安全事件关联分析。

Phone Screen：SPL 安全事件分析

题目：用 SPL 检测暴力破解攻击

给定登录事件日志，用 SPL 找出 5 分钟内失败登录超过 10 次的 IP 地址。

// 安全事件关联分析：检测暴力破解攻击
// 找出 5 分钟内失败登录超过 10 次的 IP

index=security sourcetype=login_events status=failed
| stats count AS failed_attempts, values(username) AS usernames, latest(_time) AS last_attempt
    BY src_ip
    window=5m
| where failed_attempts >= 10
| eval threat_level = if(failed_attempts >= 50, "critical",
                     if(failed_attempts >= 25, "high", "medium"))
| lookup threat_intel_by_ip src_ip OUTPUT threat_type, confidence
| where isnotnull(threat_type) OR threat_level == "critical"
| table _time, src_ip, failed_attempts, usernames, threat_level, threat_type, confidence
| sort 0 -failed_attempts

面试官追问：

“这个 SPL 查询在 100 亿条日志上跑了 30 秒还没结果，怎么优化？”

我回答：

// 优化 1: 使用 tstats 加速（基于预计算的数据模型）
// 比普通 stats 快 10-100 倍
| tstats count AS failed_attempts values(username) AS usernames
    FROM datamodel=Security.Authentication
    WHERE Authentication.action="failure"
    BY src_ip _time
    span=5m
| where failed_attempts >= 10

// 优化 2: 使用摘要索引（Summary Index）
// 预计算失败登录统计，查询时直接读摘要
index=security sourcetype=login_events status=failed
| bin _time span=5m src_ip
| stats count AS failed_attempts BY _time src_ip
| outputdata failed_logins_summary

// 优化 3: 限制时间范围
index=security sourcetype=login_events status=failed
[| inputlookup time_windows.csv | fields earliest_time, latest_time]
| stats count BY src_ip
| where count >= 10

VO Round 1：日志解析管道设计

题目：设计多格式日志解析管道

Splunk 要接入数千种不同格式的日志（Nginx、Apache、Java、系统日志），设计一个统一的解析管道。

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, regexp_extract, from_json, split, when, udf
from pyspark.sql.types import StructType, StructField, StringType, LongType, TimestampType

spark = SparkSession.builder \
    .appName("Splunk_Log_Parsing_Pipeline") \
    .getOrCreate()

# ====== 多格式日志解析管道 ======

# 1. Nginx 访问日志解析
# 格式: 192.168.1.1 - user [10/Oct/2026:13:55:36 +0000] "GET /api/users HTTP/1.1" 200 2326 "http://example.com" "Mozilla/5.0"
nginx_pattern = r'^(\S+) \S+ (\S+) \[([^\]]+)\] "(\S+) (\S+) (\S+)" (\d+) (\d+) "([^"]*)" "([^"]*)"'

def parse_nginx_log(message):
    """解析 Nginx 访问日志"""
    return regexp_extract(col("message"), nginx_pattern, 1).alias("client_ip"), \
           regexp_extract(col("message"), nginx_pattern, 3).alias("timestamp"), \
           regexp_extract(col("message"), nginx_pattern, 4).alias("method"), \
           regexp_extract(col("message"), nginx_pattern, 5).alias("path"), \
           regexp_extract(col("message"), nginx_pattern, 7).alias("status"), \
           regexp_extract(col("message"), nginx_pattern, 8).alias("bytes_sent"), \
           regexp_extract(col("message"), nginx_pattern, 9).alias("referer"), \
           regexp_extract(col("message"), nginx_pattern, 10).alias("user_agent")

# 2. Java 日志解析
# 格式: 2026-05-30 10:30:00.123 [main] INFO com.example.service.UserService - User login successful: user_id=123
java_pattern = r'^(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}\.\d+) \[(\w+)\] (\w+) ([\w.]+) - (.*)'

def parse_java_log(message):
    """解析 Java 应用日志"""
    return regexp_extract(col("message"), java_pattern, 1).alias("timestamp"), \
           regexp_extract(col("message"), java_pattern, 2).alias("thread"), \
           regexp_extract(col("message"), java_pattern, 3).alias("level"), \
           regexp_extract(col("message"), java_pattern, 4).alias("logger"), \
           regexp_extract(col("message"), java_pattern, 5).alias("message_content")

# 3. 统一解析管道
raw_logs = spark.readStream.format("kafka") \
    .option("subscribe", "raw_logs.topic") \
    .load()

# 根据 sourcetype 做条件解析
parsed_logs = raw_logs \
    .withColumn("parsed",
        when(col("sourcetype") == "nginx_access", 
             struct(parse_nginx_log(col("message"))))
        .when(col("sourcetype") == "java_app",
             struct(parse_java_log(col("message"))))
        .otherwise(struct(col("message").alias("raw_message")))
    )

# ====== 安全事件检测 ======

# 检测 HTTP 错误率异常
http_errors = parsed_logs \
    .filter(col("sourcetype") == "nginx_access") \
    .withColumn("is_error", col("status").cast("int") >= 400) \
    .withWatermark("timestamp", "10 minutes") \
    .groupBy(
        window(col("timestamp"), "5 minutes"),
        col("client_ip")
    ) \
    .agg(
        F.count("*").alias("total_requests"),
        F.sum(when(col("is_error"), 1).otherwise(0)).alias("error_count")
    ) \
    .withColumn("error_rate", 
        col("error_count") * 100.0 / col("total_requests")) \
    .filter(
        (col("error_rate") > 50) &  # 错误率超过 50%
        (col("total_requests") > 20)  # 至少 20 个请求
    )

# 检测暴力破解（失败登录）
brute_force = parsed_logs \
    .filter(
        (col("sourcetype") == "nginx_access") &
        (col("method") == "POST") &
        (col("path").rlike("/login|/auth")) &
        (col("status") == "401")
    ) \
    .withWatermark("timestamp", "10 minutes") \
    .groupBy(
        window(col("timestamp"), "5 minutes"),
        col("client_ip")
    ) \
    .agg(F.count("*").alias("failed_logins")) \
    .filter(col("failed_logins") >= 10)

# 合并告警
all_alerts = http_errors.unionByName(brute_force, allowMissingColumns=True)

all_alerts.writeStream \
    .format("kafka") \
    .option("topic", "security_alerts.topic") \
    .option("checkpointLocation", "s3://splunk-checkpoints/alerts") \
    .start()

面试官追问：

“如果日志格式突然变了怎么办？”

我回答：

# 方案: Schema 漂移检测 + 自适应解析

# 1. 解析失败率监控
parse_failures = raw_logs \
    .withColumn("parse_success", 
        when(F.length(regexp_extract(col("message"), nginx_pattern, 1)) > 0, True)
        .otherwise(False)) \
    .groupBy(window(col("timestamp"), "5 minutes")) \
    .agg(
        F.count("*").alias("total"),
        F.sum(F.when(col("parse_success"), 1).otherwise(0)).alias("success")
    ) \
    .withColumn("failure_rate",
        (1 - col("success") / col("total")) * 100)

# 如果解析失败率 > 5%，触发告警
parse_failures.filter(col("failure_rate") > 5)

# 2. 自适应解析（多正则表达式）
def adaptive_parse(message, patterns):
    """尝试多个正则表达式，返回第一个匹配的"""
    for pattern in patterns:
        match = re.match(pattern, message)
        if match:
            return match.groupdict()
    return {"raw": message, "parse_failed": True}

# 维护一个模式库
nginx_patterns = [
    r'^(?P<ip>\S+) \S+ (?P<user>\S+) \[(?P<timestamp>[^\]]+)\] "(?P<method>\S+) (?P<path>\S+) (?P<proto>\S+)" (?P<status>\d+) (?P<size>\d+)',
    r'^(?P<ip>\S+) \[(?P<timestamp>[^\]]+)\] "(?P<method>\S+) (?P<path>\S+)" (?P<status>\d+)'  # 简化格式
]

VO Round 2：系统设计 — SIEM 数据平台

题目：设计 Splunk 的 SIEM 数据平台

Splunk 需要从数百万个端点收集安全事件，实时检测威胁，并提供合规报告。

我的架构设计：

┌──────────────────────────────────────────────────────────────┐
│                    Splunk SIEM Data Platform                  │
│                                                               │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐    │
│  │ Firewalls│  │ Servers  │  │ Apps     │  │ Network  │    │
│  │          │  │          │  │          │  │ Devices  │    │
│  └────┬─────┘  └────┬─────┘  └────┬─────┘  └────┬─────┘    │
│       │              │              │              │         │
│       ▼              ▼              ▼              ▼        │
│  ┌────────────────────────────────────────────────────────┐  │
│  │              Universal Forwarders                        │  │
│  │  ┌──────────┐  ┌──────────┐  ┌──────────┐             │  │
│  │  │ Filter   │  │ Parse    │  │ Buffer   │             │  │
│  │  │ (Drop    │  │ (Regex)  │  │ (Batch)  │             │  │
│  │  │  Debug)  │  │          │  │          │             │  │
│  │  └──────────┘  └──────────┘  └──────────┘             │  │
│  └──────────────────────┬─────────────────────────────────┘  │
│                         │                                    │
│                         ▼                                    │
│  ┌────────────────────────────────────────────────────────┐  │
│  │              Indexers Cluster                           │  │
│  │  ┌────────────┐  ┌────────────┐  ┌──────────────────┐  │  │
│  │  │ Hot Buckets│  │ Warm       │  │  Cold Buckets    │  │  │
│  │  │ (最近 7 天) │  │ (7-30 天)  │  │  (30-90 天)      │  │  │
│  │  └────────────┘  └────────────┘  └──────────────────┘  │  │
│  └──────────────────────────┬─────────────────────────────┘  │
│                             │                                │
│                    ┌────────┴────────┐                       │
│                    ▼                 ▼                       │
│  ┌──────────────────────┐  ┌──────────────────────┐         │
│  │  SPL Search Engine   │  │  Spark (Batch        │         │
│  │  (Real-time)         │  │   Analytics)         │         │
│  │  (tstats 加速)       │  │                      │         │
│  └──────────────────────┘  └──────────────────────┘         │
│                             │                                │
│                             ▼                                │
│  ┌────────────────────────────────────────────────────────┐  │
│  │              Threat Intelligence                         │  │
│  │  ┌──────────┐  ┌──────────┐  ┌──────────┐             │  │
│  │  │ Brute    │  │ DDoS     │  │ Malware  │             │  │
│  │  │ Force    │  │ Detect   │  │ Detect   │             │  │
│  │  │ Detect   │  │          │  │          │             │  │
│  │  └──────────┘  └──────────┘  └──────────┘             │  │
│  └────────────────────────────────────────────────────────┘  │
└──────────────────────────────────────────────────────────────┘

核心 SPL 查询：

// 威胁 1: 暴力破解检测
index=security sourcetype=login_events
| stats count AS failed_attempts, values(username) AS targeted_users
    BY src_ip
    window=5m
| where failed_attempts >= 10
| eval severity = if(failed_attempts >= 50, "critical",
                    if(failed_attempts >= 25, "high", "medium"))
| lookup geo_ip_lookup src_ip OUTPUT country, city
| table _time, src_ip, country, city, failed_attempts, targeted_users, severity

// 威胁 2: 异常数据外传检测
index=network sourcetype=netflow
| stats sum(bytes_out) AS total_bytes_out, count AS connection_count
    BY src_ip, dst_ip
    window=1h
| where total_bytes_out > 1073741824  // 超过 1GB
| lookup asset_inventory src_ip OUTPUT hostname, department
| table _time, src_ip, hostname, department, dst_ip, total_bytes_out, connection_count

// 威胁 3: 横向移动检测
index=security sourcetype=auth_events
| eval is_lateral = case(
    action == "sudo" AND src_ip != dst_ip, "lateral_movement",
    action == "ssh" AND src_ip != dst_ip, "lateral_movement",
    1==1, "normal"
)
| where is_lateral == "lateral_movement"
| stats count AS lateral_moves, values(dst_ip) AS targeted_hosts
    BY src_ip
    window=30m
| where lateral_moves >= 5

面试总结

成功经验

SPL 熟练度：SPL 是 Splunk 的核心语言，面试中 SPL 编码能力比 SQL 更重要
安全领域知识：了解暴力破解、DDoS、横向移动等常见攻击模式
分布式搜索理解：知道 Search Head、Indexer、Forwarder 的架构

注意事项

日志解析经验：面试中会考察处理各种格式日志的能力
SPL 优化：tstats、摘要索引、数据模型加速是关键知识点
SIEM 架构：理解完整的 SIEM 数据流程（采集→索引→搜索→告警）

💡 需要面试辅导？

联系我们