Voozh

I've been on both sides of the Spark interview table more times than I'd like to admit. The pattern is always the same: candidates can write a groupBy().agg() in their sleep, but the moment you ask them why their job spills to disk or where the shuffle happens in a query plan, things fall apart. Spark interview questions aren't about syntax. They're about whether you understand execution. That's what separates a senior data engineer from someone who copy-pastes PySpark from Stack Overflow.

These 12 problems are the ones I've seen surface repeatedly in senior DE loops. They cover shuffle, skew, joins, memory, caching, and the optimizer. If you can answer all 12 cold, you're ready. If you can't, now you know where to grind. datadriven.io is great for spark practice if you want reps beyond what's here.

Want to practice these for real? Solve these problems live here with a real editor and graded solutions.

1. Identify the Shuffle

The question: Given the following PySpark code, identify which transformations trigger a shuffle (stage boundary) and which do not. Explain why.

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, upper

spark = SparkSession.builder.getOrCreate()

df = spark.read.parquet("s3://data/events/")

# Step 1
filtered = df.filter(col("event_type") == "purchase")

# Step 2
projected = filtered.select("user_id", "amount", upper(col("region")).alias("region"))

# Step 3
grouped = projected.groupBy("region").sum("amount")

# Step 4
sorted_df = grouped.orderBy(col("sum(amount)").desc())

sorted_df.show()

Why it matters: Steps 1 and 2 are narrow transformations: filter and select each operate on one input partition with zero data movement. Steps 3 and 4 are wide transformations: groupBy redistributes rows by key across executors, and orderBy requires a global sort, both crossing stage boundaries. The trap is that candidates say "joins are wide" as a blanket rule but forget that orderBy is also wide. The follow-up interviewers push on: "How many stages does this job have?" Answer: three. One for the read+filter+select, one for the groupBy, one for the orderBy. Each shuffle is a stage boundary.

2. Tune Shuffle Partitions

The question: Your Spark job processes 50 GB of data. The default spark.sql.shuffle.partitions is 200. What's wrong, and what value would you set instead? Show the config change.

from pyspark.sql import SparkSession

spark = (
 SparkSession.builder
 .config("spark.sql.shuffle.partitions", "400")
 .getOrCreate()
)

# 50 GB / 400 partitions = 125 MB per partition (under the 128 MB target)

Why it matters: The default 200 shuffle partitions is cargo-cult tuning. It originated from empirical tests on specific cluster sizes and was never meant as a universal constant. At 200 partitions, 50 GB means 250 MB per partition, which exceeds the recommended 128 MB target and risks executor spill. The fix is simple division: target size under 128 MB, so 400 partitions puts you at 125 MB each. Retuning shuffle partitions to match actual data volume can deliver roughly 40% performance improvement without changing cluster size. Interviewers probe whether you can do this math on the spot or whether you just memorize "set it to 200."

3. Force a Broadcast Join

The question: Table orders has 500 million rows. Table regions has 5,000 rows (2 MB). Write a join that avoids shuffle entirely.

from pyspark.sql.functions import broadcast

orders = spark.read.parquet("s3://data/orders/")
regions = spark.read.parquet("s3://data/regions/")

result = orders.join(broadcast(regions), orders.region_id == regions.id, "inner")

Why it matters: The broadcast join threshold defaults to 10 MB (spark.sql.autoBroadcastJoinThreshold). Since regions is 2 MB, Spark would auto-broadcast it here. But the real interview signal is: do you know that a broadcast join is a narrow transformation despite being a join? No shuffle occurs; Spark sends the small table to every executor. Most candidates conflate "join" with "wide transformation," and this is now a standard trick question on Databricks loops. The follow-up: "What happens if you force a broadcast hint on a 2 GB table?" Answer: it collects the data to the driver and likely causes an OOM error. The hard limit for broadcast is roughly 8 GB.

4. Handle Data Skew with Salting

The question: Your join on customer_id takes 4 hours because one customer has 50 million rows while the median is 500. AQE is disabled (legacy Spark 2.4 cluster). Fix the skew.

from pyspark.sql.functions import col, lit, explode, array, rand, floor, concat

SALT_BUCKETS = 10

# Salt the large (skewed) side
orders_salted = orders.withColumn(
 "salt", floor(rand() * SALT_BUCKETS).cast("int")
).withColumn(
 "join_key", concat(col("customer_id"), lit("_"), col("salt"))
)

# Replicate the small side across all salt values
salt_range = spark.range(0, SALT_BUCKETS).withColumnRenamed("id", "salt")
customers_replicated = customers.crossJoin(salt_range).withColumn(
 "join_key", concat(col("customer_id"), lit("_"), col("salt"))
)

result = orders_salted.join(customers_replicated, "join_key", "inner")

Why it matters: A single key with 50M+ rows causes collect_list() or join aggregations to build massive data structures in one executor. Classic OOM. Salt-based skew handling (append random integer to the hot key, replicate the small side to match) is the standard production pattern tested in senior data engineering loops. The trap: candidates who jump straight to salting without first asking "is AQE enabled?" reveal they don't check what Spark already does for them. On Spark 3.2+, AQE's spark.sql.adaptive.skewJoin.enabled splits skewed partitions automatically. The salting question specifically tests whether you understand the mechanism AQE automates.

5. Predict AQE Behavior

The question: AQE is enabled (Spark 3.2+). You have a sort-merge join where the left side is 200 GB and the right side ends up being 8 MB after filters. What does AQE do at runtime, and what config controls this?

spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "10485760") # 10 MB

# AQE will convert the sort-merge join to a broadcast join at runtime
# because the right side (8 MB) is below the broadcast threshold
# after shuffle statistics reveal the actual size

Why it matters: Adaptive Query Execution dynamically reoptimizes at runtime using shuffle statistics. It does three things: coalesces small post-shuffle partitions, switches sort-merge joins to broadcast when one side proves small, and splits skewed partitions. The key insight is that Catalyst's static plan might choose sort-merge because it can't predict the right side will shrink to 8 MB after filters. AQE waits for the shuffle output, sees 8 MB, and switches strategies mid-execution. The follow-up that catches candidates: "Does AQE work with Structured Streaming?" No. AQE requires materialization points (pausing after a shuffle to analyze stats), which streaming doesn't have.

6. Read a Catalyst Plan

The question: You run df.explain(True) and see that a filter on status = 'active' appears before the scan in the optimized logical plan, even though you wrote it after the join in your code. What happened?

# Your code order:
result = (
 orders.join(customers, "customer_id")
 .filter(col("status") == "active")
)

# But explain(True) shows filter pushed down before the join
result.explain(True)

Why it matters: Catalyst's 4-stage pipeline (analysis, logical optimization, physical planning, code generation) applies predicate pushdown during logical optimization. It moves filters as early as possible to reduce data volume before expensive operations like joins. Only stage 3 (physical planning) is cost-based; stages 1, 2, and 4 are rule-based. The trap: candidates who break transformation chains with intermediate actions (like cache() or count()) prevent Catalyst from seeing the full chain, which disables these optimizations. More transformations in a single chain give Spark more opportunities to optimize, not fewer. That's counterintuitive, and it's exactly what interviewers are testing.

7. Window Function Partition Skew

The question: Your window function OOMs on one executor while 63 others sit idle. Diagnose and fix.

from pyspark.sql.window import Window
from pyspark.sql.functions import row_number, col

# BAD: user_id "bot_account" has 90% of rows
window_spec = Window.partitionBy("user_id").orderBy("event_time")
df_ranked = df.withColumn("rn", row_number().over(window_spec))

# FIX: composite partition key spreads load
window_spec_fixed = Window.partitionBy("user_id", "event_date").orderBy("event_time")
df_ranked_fixed = df.withColumn("rn", row_number().over(window_spec_fixed))

Why it matters: Window functions don't intrinsically cause OOM. Data skew does. A partitionBy("user_id") where one user has 90% of the rows concentrates that entire partition on one executor. The window function itself is blameless. This shifts the interview from "how to fix the window" to "how to detect and rebalance upstream data." The follow-up: row_number() without a tiebreaker column in orderBy produces nondeterministic results across executor restarts. If your tiebreaker isn't unique, the ranking is undefined. Silent data bugs.

8. Cache Storage Level Trade-off

The question: You have a DataFrame used three times in your pipeline. It involves an expensive join. Choose the right storage level and explain why cache() alone isn't enough.

from pyspark import StorageLevel

expensive_df = orders.join(broadcast(regions), "region_id").filter(col("amount") > 100)

# cache() is lazy; it does nothing until an action fires
expensive_df.persist(StorageLevel.MEMORY_AND_DISK)
expensive_df.count() # triggers the actual caching

# ... use expensive_df three more times ...

expensive_df.unpersist() # release memory when done

Why it matters: cache() is a lazy transformation, not an action. Calling df.cache() does nothing until you trigger an action like count(). This is the single most common Spark caching mistake. Beyond that: RDD cache() defaults to MEMORY_ONLY, but DataFrame cache() defaults to MEMORY_AND_DISK. When a partition doesn't fit in memory, disk I/O is often faster than recomputing an expensive join from scratch. MEMORY_ONLY is only correct if you've confirmed the dataset fits entirely in executor memory. The follow-up that separates senior from staff: "When does caching hurt performance?" Answer: intermediate cache() calls break the transformation chain, preventing Catalyst from reordering expensive operations. Cache only when the DataFrame is used multiple times and the recomputation cost exceeds the serialization cost.

9. Diagnose an Executor OOM

The question: Your executor crashes with java.lang.OutOfMemoryError: Java heap space. Executor memory is set to 8 GB. The job processes 100 GB with 800 shuffle partitions. Walk through your debugging steps.

spark = (
 SparkSession.builder
 .config("spark.executor.memory", "8g")
 .config("spark.executor.memoryOverhead", "2g") # 20% for off-heap
 .config("spark.memory.fraction", "0.6") # 60% for unified memory
 .config("spark.memory.storageFraction", "0.5") # 50/50 split
 .config("spark.sql.shuffle.partitions", "800") # 125 MB target
 .config("spark.executor.cores", "5") # 5 cores per executor
 .getOrCreate()
)

Why it matters: Spark reserves a hardcoded 300 MB from executor heap for internal bookkeeping. Of the remaining 7.7 GB, 60% (4.6 GB) goes to Unified Memory, split between execution (shuffles, joins, sorts) and storage (cached data). With 5 cores per executor, each task gets roughly 920 MB of unified memory. At 125 MB per partition, that's fine for most operations, but a skewed partition or a large broadcast variable blows past it. The counterintuitive trap: "just add more memory" is backwards. A 32 GB heap with routine GC can pause for 74 seconds out of 120 seconds of execution (61% pause time). The real lever is partition count and memory fraction tuning, not heap size. Interviewers love this because it inverts junior intuitions.

10. Join Type Selection Under Constraints

The question: Table A is 300 GB. Table B is 500 MB. You have 10 executors with 8 GB each. Which join strategy, and why not the other two?

# 500 MB exceeds default 10 MB threshold, so increase it
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "536870912") # 512 MB

result = table_a.join(broadcast(table_b), "id", "inner")

Why it matters: Three options: broadcast, sort-merge, shuffle-hash. Sort-merge shuffles both sides (300 GB + 500 MB over the network). Shuffle-hash shuffles both sides and builds hash tables. Broadcast sends 500 MB to each executor (5 GB total network, no shuffle of the 300 GB side). With 8 GB per executor, 500 MB fits comfortably. The default 10 MB threshold is "extremely conservative" per production guidance; real workloads typically increase it to 100-500 MB. The follow-up that catches people: "What if Table B is 500 MB but has a type mismatch on the join key?" A string joined to an int silently produces empty results. Seasoned engineers check types before blaming join strategy.

11. Eliminate an Unnecessary Shuffle

The question: You have two window functions over the same DataFrame with the same partitionBy but different orderBy. How many shuffles happen, and can you reduce it?

from pyspark.sql.window import Window
from pyspark.sql.functions import row_number, sum as spark_sum

w1 = Window.partitionBy("dept_id").orderBy("salary")
w2 = Window.partitionBy("dept_id").orderBy("hire_date")

# Two windows, same partitionBy, different orderBy: one shuffle, two sorts
df_result = (
 df.withColumn("salary_rank", row_number().over(w1))
 .withColumn("tenure_rank", row_number().over(w2))
)

Why it matters: Spark only shuffles on partitionBy boundaries. Since both windows partition by dept_id, there's one shuffle (one repartition by dept_id) followed by two local sorts within each partition. Candidates who say "two windows = two shuffles" reveal they don't understand stage boundaries. The follow-up: if you change w2 to partitionBy("region_id"), now you get two shuffles because the partition keys differ. Reducing shuffle stages is underrated in interviews; a query with fewer stage boundaries can outperform a multi-stage query by orders of magnitude because stage synchronization and shuffle materialization are the true bottleneck, not the partition count.

12. AQE Partition Coalescing

The question: After a groupBy, your job produces 200 shuffle partitions, but only 15 have data (the rest are empty). AQE is enabled. What happens, and what config controls it?

spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")
spark.conf.set("spark.sql.adaptive.advisoryPartitionSizeInBytes", "134217728") # 128 MB

# AQE merges 185 empty/tiny partitions into ~15 real ones
# reducing task scheduling overhead dramatically
result = df.groupBy("country_code").agg(spark_sum("revenue"))

Why it matters: 185 empty partitions means 185 tasks that do nothing but consume scheduling overhead. AQE's partition coalescing merges small post-shuffle outputs into fewer, right-sized partitions. But here's the critical misconception: AQE can only reduce partitions, never increase them. If you start with too few partitions (say, 10 for 100 GB of data), AQE sits idle while your executors OOM on 10 GB partitions. You still need to set spark.sql.shuffle.partitions high enough for AQE to have material to coalesce. TPC-DS benchmarks showed up to 8x speedup on specific queries with AQE, but only when the initial partition count gave AQE room to work.

The tools change every 18 months. The problems don't change. Shuffle, skew, memory, upstream teams breaking contracts without telling you. These are eternal. Learn the concepts; the syntax is the easy part.

Which spark interview problem would you add to this list? I'm always curious which scenario questions are showing up in loops right now.

URL: https://dev.to/datadriven/top-12-spark-interview-problems-for-data-engineers-with-answers-4h0e

⇱ Top 12 Spark Interview Problems for Data Engineers, With Answers - DEV Community