I've watched candidates with 8 years of experience go blank when asked to define the grain of a fact table. Not because they're bad engineers; because nobody told them that data modeling is the actual filter. SQL problems test syntax. System design tests memorization. Data modeling tests whether you can think. That's why it's the section that separates senior from staff, and why interviewers keep leaning on it harder every cycle. AI can spit out a medium LeetCode solution in seconds; it still can't explain why your grain decision breaks downstream aggregates.
These 12 problems are the ones I've seen repeatedly across FAANG and late-stage startup loops. They cover star schema design, dimensional modeling tradeoffs, SCDs, late-arriving data, and the classification calls that trip up even experienced candidates.
Want to practice these for real? Solve these problems live here with a real editor and graded solutions.
1. Define the Grain of a Fact Table
The question: You're building an analytics warehouse for a ride-sharing company. Before designing any tables, state the grain of the core fact table. What does one row represent?
The answer is: one row per completed trip. Not per driver. Not per day. One row per atomic trip event, keyed by trip_id, with foreign keys to dim_driver, dim_rider, dim_pickup_location, dim_dropoff_location, and dim_date. Measures include fare_amount, tip_amount, trip_duration_seconds, trip_distance_miles.
Why it matters: Grain is the single most important decision in dimensional modeling. Candidates who jump into drawing tables without stating "one row represents X" are already drifting. Undefined grain causes silent metric inflation, duplicate rows, and join explosions that don't throw errors; they just produce wrong numbers. Interviewers test this first because everything downstream depends on it. The follow-up is always: "What happens when a trip has multiple stops?" If your grain assumed single-destination trips, you just broke your own schema.
2. Star Schema vs. Snowflake Schema
The question: Your team is building a new warehouse on Snowflake (the product). A junior engineer proposes snowflake schema (the design pattern) to save storage. Do you agree? Why or why not?
You don't agree. Star schema is the default for modern columnar warehouses. Snowflake, BigQuery, and Redshift compress denormalized dimensions so efficiently that snowflaking (normalizing dimensions into sub-tables) rarely saves meaningful storage anymore. The engineering overhead of maintaining normalized dimension hierarchies exceeds the storage cost of duplication. Star is the safe opening position in any interview.
Why it matters: Picking snowflake first signals junior thinking. The economics killed the normalization argument around 2024. Interviewers aren't testing whether you know both patterns exist; they're testing whether you can reason about the tradeoff. The follow-up: "When would you normalize a dimension?" The answer is when the dimension is enormous and changes frequently (millions of rows, daily updates), making the redundant writes expensive. That's rare.
3. Design a Fact Table for E-Commerce Orders
The question: Design a star schema for an e-commerce platform. The business needs to track orders at the line-item level for revenue analysis.
CREATE TABLE fact_order_line_item (
order_line_item_sk BIGINT PRIMARY KEY,
order_id BIGINT, , degenerate dimension
product_sk BIGINT REFERENCES dim_product,
customer_sk BIGINT REFERENCES dim_customer,
date_sk INT REFERENCES dim_date,
quantity INT,
unit_price DECIMAL(10,2),
discount_amount DECIMAL(10,2),
line_total DECIMAL(10,2)
);
Grain: one row per line item per order. order_id is a degenerate dimension; it lives in the fact table because it has no descriptive attributes worth a separate table.
Why it matters: This tests three things at once. Can you declare grain (line item, not order)? Do you know what a degenerate dimension is? And do you put the right measures in the fact table? Candidates who model at the order grain lose the ability to analyze product-level revenue without restructuring. You can always aggregate up from line items to orders; you can never disaggregate back down.
4. SCD Type 2 Implementation
The question: A customer changes their address. How do you model dim_customer to preserve the old address for historical reporting?
SCD Type 2: insert a new row with a new surrogate key, set effective_date and expiration_date on both rows, and flag the current row with is_current = TRUE. The original row stays intact; historical fact rows still join to the old address via the old surrogate key.
, Before change
, customer_sk=101, name='Jane', city='Austin', effective='2024-01-01', expiration='9999-12-31', is_current=TRUE
, After change: close old row, insert new
UPDATE dim_customer SET expiration_date = '2026-06-16', is_current = FALSE WHERE customer_sk = 101;
INSERT INTO dim_customer VALUES (102, 'Jane', 'Denver', '2026-06-17', '9999-12-31', TRUE);
Why it matters: SCD2 is a separator question. Juniors describe it from the textbook. Seniors bring up the trap: SCD2 row explosion. A dimension with 10M rows tracking frequently changing attributes can balloon to 150M rows in five years. The follow-up is always: "When would you use Type 1 instead?" Answer: when the business doesn't need the history. A corrected typo in a customer name doesn't warrant a new historical row. Type 1 overwrites are often correct, despite Type 2's prestige.
5. Late-Arriving Dimensions
The question: An order fact arrives, but the customer who placed it hasn't been loaded into dim_customer yet. What do you do?
Insert a placeholder row in dim_customer with a surrogate key and all descriptive columns set to "Unknown" or null. The fact row joins to this placeholder. When the real customer data arrives, you overwrite the placeholder via Type 1 update.
Why it matters: Late-arriving dimensions and late-arriving facts are entirely different problems, and mixing them up is an instant red flag. This tests whether you understand that the fact table can't wait; it needs a foreign key now. The alternative (dropping the fact row until the dimension arrives) loses data. The follow-up: "What if the dimension arrives with changes?" Then you might need to apply SCD2 logic to the placeholder row, which gets complex fast.
6. Bridge Tables for Many-to-Many Relationships
The question: A hospital system tracks patient diagnoses. One hospitalization can have multiple diagnoses, and one diagnosis applies to many hospitalizations. How do you model this?
You use a bridge table. Create a diagnosis_group_key that maps to a set of diagnoses in bridge_diagnosis. The fact table (fact_hospitalization) joins to diagnosis_group_key; the bridge table resolves each group to individual dim_diagnosis rows.
Why it matters: Many-to-many relationships in dimensional models are the source of the most dangerous bug in analytics: double-counting. Without a bridge table, a naive join between fact and dimension multiplies rows. Interviewers use this to test whether you understand the cardinality trap. The follow-up: "How do you handle weighting?" If a hospitalization has three diagnoses, does each get 1/3 of the revenue allocation? Bridge tables can carry a weight_factor column for exactly this.
7. Fact vs. Dimension Classification
The question: You have a column customer_lifetime_revenue. Is it a fact or a dimension attribute?
It's both, depending on usage. If you're summing it across rows, it's a fact. If you're banding it into ranges ("$0-$1K", "$1K-$10K") to filter or group by, it's a dimension attribute. Kimball calls this the aggregated-fact-as-attribute pattern.
If you would aggregate the column, it's a fact. If you would filter or group by it, it's a dimension. That's the whole test.
Why it matters: This exposes whether a candidate understands that the fact/dimension boundary isn't about data types. Numeric columns don't automatically belong in fact tables. The follow-up: "Where do you physically store it?" Usually in the dimension, banded into a descriptive range, with the raw number available as an additive fact if needed.
8. Factless Fact Tables
The question: The business wants to know which products were NOT sold in each store last month. How do you model this?
A factless fact table (coverage table). One row per store per product per month, representing eligibility. To find products not sold, you subtract the sales fact table from the coverage table.
Why it matters: Most candidates have never heard of factless fact tables. The name sounds like a contradiction. But they solve a real problem: you can't report on the absence of an event without first modeling what could have happened. Student attendance, product availability, promotional eligibility; these all use the same pattern. The follow-up: "Isn't this just a cross join?" Yes, and that's the point. The cross join defines the universe; the anti-join finds the gaps.
9. Accumulating Snapshot Fact Table
The question: Model an order fulfillment pipeline with stages: ordered, packed, shipped, delivered.
One row per order, with multiple date columns: order_date_sk, pack_date_sk, ship_date_sk, delivery_date_sk. The row gets updated as the order progresses through stages. Null dates indicate incomplete milestones.
Why it matters: This is the "advanced grain" question. Most candidates know transaction facts and periodic snapshots; accumulating snapshots trip them up because the row mutates. The fact table updates in place, which feels wrong if you've been taught that fact tables are append-only. Insurance claims, hiring workflows, procurement cycles; all use this pattern. The follow-up: "How do you handle an order that skips a stage?" That's a null in the milestone column, and your reporting logic needs to handle it.
10. Conformed Dimensions
The question: Sales and marketing each have their own dim_customer table with different definitions. What's the risk, and how do you fix it?
The risk is the CEO gets two different customer counts. Conformed dimensions are shared across fact tables and business units, with identical keys, attributes, and definitions. You build one dim_customer, owned by a central data team, and both domains join to it.
Why it matters: This tests organizational thinking, not just schema design. Split-brain dimensions are how companies end up with "which number is right?" meetings. The follow-up: "What if the two teams need different attributes?" Add them to the same dimension. A wide dimension with 50 columns that both teams trust is better than two narrow dimensions that contradict each other.
11. Normalization vs. Denormalization for Analytics
The question: When would you choose a normalized (3NF) model over a denormalized star schema in an analytics warehouse?
Almost never for the presentation layer. Denormalized schemas achieve 20 to 100x faster query performance on complex analytics workloads by eliminating joins. BigQuery benchmarks show 49% average improvement with fully denormalized tables compared to star schemas. But the staging layer should stay normalized. 3NF in staging preserves flexibility; when requirements change, you can rematerialize the presentation layer without remodeling the entire pipeline.
Why it matters: The real answer is "both, in different layers." Organizations run 3NF in source systems, normalize in staging for integrity, and denormalize in the presentation layer for speed. Candidates who pick one paradigm for the entire warehouse reveal they've never dealt with a schema migration. The follow-up: "What about high-cardinality many-to-many relationships?" Don't denormalize those. A customer/orders/products grain creates explosive row multiplication.
12. Late-Arriving Facts and Backfills
The question: Your daily pipeline processes orders by processing_date. An order from 10 days ago arrives today. How does your pipeline handle it?
Partition by event_time (when the order was placed), not processing_time (when it arrived). Keep a rolling recompute window open; reprocess the last 14 days on every run. This auto-reconciles normal late arrivals without manual intervention. For data outside the window, run an explicit backfill job.
Why it matters: Late data isn't a failure mode; it's the normal case. Most production systems expect 10 to 20% of daily volume to arrive delayed. Candidates who say "drop anything older than 7 days" have never worked on a pipeline that finance depends on. The follow-up: "What if the late fact needs to join to a dimension that has since changed (SCD2)?" You join to the dimension version that was active at event time, not processing time. That's the whole point of surrogate keys and effective dates.
, -
Data modeling questions keep showing up because they're the one thing AI can't fake for you. An LLM will produce a schema. It won't explain why that grain breaks when requirements shift, or defend the denormalization when the interviewer pushes back. If you want structured reps on these exact patterns, i used DataDriven for data modeling interview questions and it was the most efficient prep I found for this category.
Which data modeling interview question would you add to this list? I'm curious what y'all are seeing in loops right now.
For further actions, you may consider blocking this person and/or reporting abuse
