Crack SQL Interview Question: Join vs Case-When Statement
Solving SQL questions with useful procedures
In this article, we will go over a SQL question from a Facebook data science interview. Hope the procedure explained in this article would help you become more effective in writing SQL queries.
SQL Question:
Users By Average Session Time
Calculate each user's average session time. A session is defined as the time difference between a page_load and page_exit. For simplicity, assume a user has only 1 session per day and if there are multiple of the same events on that day, consider only the latest page_load and earliest page_exit. Output the user_id and their average session time.
Source: stratascratch.com
I would recommend clicking the link to practice writing queries for this exercise.
Table: facebook_web_log
Step 1: let’s check out the raw data first. This table includes a sequence of actions, such as, "page_load", "scroll_down", "scroll up", "page_exit" with timestamp for a given user_id. We’re assigned to compute the average session time for each user. Let’s clarify a few things. For example,
- How is a session defined? Based on the question, a session is defined as the time difference between a "page_load" and "page_exit". If there are multiples of same events, such as, "page_load" and "page_exit" in a given day, then consider using only the latest page_load and earliest page_exit.
- If there is only one of the two events, "page_load" and "page_exit" in a given day for a user. For example, on 4/25/2019, there is only "page_load" event for user_id 2, but no "page_exit" is found. How should we handle this kind of situation?
- Also in a given day, the latest page_load might be later than the earliest page_exit for a user. If that’s the case, how should we handle it?
- These are the kinds of questions we need to get clarification from the interviewer. For this exercise, we would exclude the records if there is only one of the two events or the latest page_load being later than the earliest page_exit.
Step 2: We need to identify the relevant information we would need to solve the problem. For example,
- We only need to use the rows with values of "page_load" and "page_exit" in the field of "action" because that’s how session time is defined.
- We would need to extract the date variable from the timestamp variable because that’s where the session time comes from.
Step 3: We will prepare the data and create the variables we’ve identified in step 2. In this step, I will use two different methods.
Method 1: JOIN two temporary tables, Load and Exit
- To create Load table including only "page_load" actions, we can use
WHERE action = 'page_load'. If there are multiple "page_load" actions in a given date for a user, we would only keep the latest one. Therefore, we usetimestamp::DATEto create the date variable and use aggregate functionMAX(timestamp)withGROUP BY user_id, timestamp::DATEto keep the latest "page_load" actions in a given date. - To create Exit table including only "page_exit" actions, we use
WHERE action = 'page_exit'. Then we use aggregate functionMIN(timestamp)withGROUP BY user_id, timestamp::DATEto keep the earliest "page_exit" actions in a given date. - Once Load and Exit tables are created using
WITHstatement, we can join the two tables together based on "user_id" and "date". We will useINNER JOINhere, because we need to have both "page_load" and "page_event" in a given day for a user.
WITH load AS(
SELECT user_id,
timestamp::DATE AS date,
MAX(timestamp) AS lastest_load
FROM facebook_web_log
WHERE action = 'page_load'
GROUP BY user_id, timestamp::DATE
),
-- Create Exit table
exit AS(
SELECT user_id,
timestamp::DATE AS date,
MIN(timestamp) AS earliest_exit
FROM facebook_web_log
WHERE action = 'page_exit'
GROUP BY user_id, timestamp::DATE
)
SELECT a.user_id,
a.date,
a.lastest_load,
b.earliest_exit
FROM LOAD a
INNER JOIN exit b
ON a.user_id = b.user_id AND
a.date = b.date
When we run the above code, we can produce a table like the following:
Method 2: Create two new columns, Load and Exit using CASE-WHEN statement
- Instead of using WHERE statement to keep "page_load" and "page_exit" actions, we can use
CASE-WHENstatement. The code,CASE WHEN action = 'page_load' THEN timestamp ELSE NULL ENDcan create a new column which only includes the timestamps for "page_load" actions and set it to NULL for other actions. We will write similar code for "page_exit" actions. - Once we have the two new columns, we can use aggregate functions
MAX()andMIN()to compute the latest timestamp for "page_load" action and the earliest timestamp for "page_exit" action for a given day for a user withGROUP BY user_id, timestamp::DATE.
SELECT user_id,
timestamp::DATE AS date,
MAX(CASE WHEN action = 'page_load' THEN timestamp ELSE NULL END) AS latest_page_load,
MIN(CASE WHEN action = 'page_exit' THEN timestamp ELSE NULL END) AS earliest_page_exit
FROM facebook_web_log
GROUP BY user_id, timestamp::DATE
When we run the above code, we can produce a table like the following. You will notice a difference from this output – we have one extra record for user_id 2 on 4/25/2019 because we used INNER JOIN to keep only the matched records in Method 1 whereas we only keep all the available records for "page_load" and "page_exit" here.
Step 4: Once the data is prepared in step 3, computing the average session time per user_id should be straightforward. We just need to use the aggregate function AVG() with GROUP BY user_id .
Final solution using Method 1: JOIN two temporary tables, Load and Exit
I include the code, WHERE b.earliest_exit ≥ a.lastest_load in the end because we would like to avoid the case, in which the latest page_load is later than the earliest page_exit.
WITH load AS(
SELECT user_id,
timestamp::DATE AS date,
MAX(timestamp) AS lastest_load
FROM facebook_web_log
WHERE action = 'page_load'
GROUP BY user_id, timestamp::DATE
),
-- Create Exit table
exit AS(
SELECT user_id,
timestamp::DATE AS date,
MIN(timestamp) AS earliest_exit
FROM facebook_web_log
WHERE action = 'page_exit'
GROUP BY user_id, timestamp::DATE
)
SELECT a.user_id,
AVG(b.earliest_exit - a.lastest_load) AS avg_session_time
FROM LOAD a
INNER JOIN exit b
ON a.user_id = b.user_id AND
a.date = b.date
WHERE b.earliest_exit >= a.lastest_load
GROUP BY a.user_id
Final solution using Method 2: Create two new columns, Load and Exit using CASE-WHEN statement
For this method, I include for code, WHERE (earliest_page_exit-latest_page_load) IS NOT NULL because we would like to remove the extra record for user_id 2 on 4/25/2019.
WITH CTE AS(
SELECT user_id,
timestamp::DATE AS date,
MAX(CASE WHEN action = 'page_load' THEN timestamp ELSE NULL END) AS latest_page_load,
MIN(CASE WHEN action = 'page_exit' THEN timestamp ELSE NULL END) AS earliest_page_exit
FROM facebook_web_log
GROUP BY user_id, timestamp::DATE
)
SELECT user_id,
AVG(earliest_page_exit-latest_page_load) AS avg_session_time
FROM CTE
WHERE (earliest_page_exit-latest_page_load) IS NOT NULL AND
earliest_page_exit >= latest_page_load
GROUP BY user_id
Answer:
If you would like to explore more SQL Interview Questions, please check out my articles:
- Comprehensive SQL Cheat Sheet
- Crack SQL Interview Question with Useful Procedure
- Crack SQL Interview Question: Subquery vs. CTE
- Crack SQL Interview Question: Join vs Case-When Statement
- Crack SQL Interview Question: Window Functions with Partition-By
- Crack SQL Interview Question: Date_Part Function
- Crack SQL Interview Questions: ROW_NUMBER, RANK and DENSE_RANK
- Crack SQL Interview Question: UNNEST, STRING_TO_ARRAY
- Crack SQL Interview Question: GENERATE_SERIES, STRING_AGG, SPLIT_PART
- Crack SQL Interview Question: Self-Join and Non-Equi Join
- Crack SQL Interview Question: ANY Operator
- Crack SQL Interview Question: Subquery
Thank you for reading !!!
If you enjoy this article and would like to Buy Me a Coffee, please click here.
You can sign up for a membership to unlock full access to my articles, and have unlimited access to everything on Medium. Please subscribe if you’d like to get an email notification whenever I post a new article.
Share This Article
Towards Data Science is a community publication. Submit your insights to reach our global audience and earn through the TDS Author Payment Program.
Write for TDS