VOOZH about

URL: https://towardsdatascience.com/crack-sql-interview-question-join-vs-case-when-statement-116d40a361f0/

⇱ Crack SQL Interview Question: Join vs Case-When Statement | Towards Data Science


Crack SQL Interview Question: Join vs Case-When Statement

Solving SQL questions with useful procedures

6 min read
👁 Photo by LinkedIn Sales Solutions on Unsplash
Photo by LinkedIn Sales Solutions on Unsplash

In this article, we will go over a SQL question from a Facebook data science interview. Hope the procedure explained in this article would help you become more effective in writing SQL queries.

SQL Question:

Users By Average Session Time
Calculate each user's average session time. A session is defined as the time difference between a page_load and page_exit. For simplicity, assume a user has only 1 session per day and if there are multiple of the same events on that day, consider only the latest page_load and earliest page_exit. Output the user_id and their average session time.
Source: stratascratch.com
I would recommend clicking the link to practice writing queries for this exercise.

Table: facebook_web_log

👁 Image by Author
Image by Author

Step 1: let’s check out the raw data first. This table includes a sequence of actions, such as, "page_load", "scroll_down", "scroll up", "page_exit" with timestamp for a given user_id. We’re assigned to compute the average session time for each user. Let’s clarify a few things. For example,

  • How is a session defined? Based on the question, a session is defined as the time difference between a "page_load" and "page_exit". If there are multiples of same events, such as, "page_load" and "page_exit" in a given day, then consider using only the latest page_load and earliest page_exit.
  • If there is only one of the two events, "page_load" and "page_exit" in a given day for a user. For example, on 4/25/2019, there is only "page_load" event for user_id 2, but no "page_exit" is found. How should we handle this kind of situation?
  • Also in a given day, the latest page_load might be later than the earliest page_exit for a user. If that’s the case, how should we handle it?
  • These are the kinds of questions we need to get clarification from the interviewer. For this exercise, we would exclude the records if there is only one of the two events or the latest page_load being later than the earliest page_exit.

Step 2: We need to identify the relevant information we would need to solve the problem. For example,

  • We only need to use the rows with values of "page_load" and "page_exit" in the field of "action" because that’s how session time is defined.
  • We would need to extract the date variable from the timestamp variable because that’s where the session time comes from.

Step 3: We will prepare the data and create the variables we’ve identified in step 2. In this step, I will use two different methods.

Method 1: JOIN two temporary tables, Load and Exit

  • To create Load table including only "page_load" actions, we can use WHERE action = 'page_load' . If there are multiple "page_load" actions in a given date for a user, we would only keep the latest one. Therefore, we use timestamp::DATE to create the date variable and use aggregate function MAX(timestamp) with GROUP BY user_id, timestamp::DATE to keep the latest "page_load" actions in a given date.
  • To create Exit table including only "page_exit" actions, we use WHERE action = 'page_exit' . Then we use aggregate function MIN(timestamp) with GROUP BY user_id, timestamp::DATE to keep the earliest "page_exit" actions in a given date.
  • Once Load and Exit tables are created using WITH statement, we can join the two tables together based on "user_id" and "date". We will use INNER JOIN here, because we need to have both "page_load" and "page_event" in a given day for a user.
WITH load AS(
SELECT user_id,
timestamp::DATE AS date,
MAX(timestamp) AS lastest_load
FROM facebook_web_log
WHERE action = 'page_load'
GROUP BY user_id, timestamp::DATE
),
-- Create Exit table
exit AS(
SELECT user_id,
timestamp::DATE AS date,
MIN(timestamp) AS earliest_exit
FROM facebook_web_log
WHERE action = 'page_exit'
GROUP BY user_id, timestamp::DATE
)
SELECT a.user_id,
a.date,
a.lastest_load,
b.earliest_exit
FROM LOAD a
INNER JOIN exit b
ON a.user_id = b.user_id AND
a.date = b.date

When we run the above code, we can produce a table like the following:

👁 Image by Author
Image by Author

Method 2: Create two new columns, Load and Exit using CASE-WHEN statement

  • Instead of using WHERE statement to keep "page_load" and "page_exit" actions, we can use CASE-WHEN statement. The code, CASE WHEN action = 'page_load' THEN timestamp ELSE NULL END can create a new column which only includes the timestamps for "page_load" actions and set it to NULL for other actions. We will write similar code for "page_exit" actions.
  • Once we have the two new columns, we can use aggregate functions MAX() and MIN() to compute the latest timestamp for "page_load" action and the earliest timestamp for "page_exit" action for a given day for a user with GROUP BY user_id, timestamp::DATE .
SELECT user_id,
timestamp::DATE AS date,
MAX(CASE WHEN action = 'page_load' THEN timestamp ELSE NULL END) AS latest_page_load,
MIN(CASE WHEN action = 'page_exit' THEN timestamp ELSE NULL END) AS earliest_page_exit
FROM facebook_web_log
GROUP BY user_id, timestamp::DATE

When we run the above code, we can produce a table like the following. You will notice a difference from this output – we have one extra record for user_id 2 on 4/25/2019 because we used INNER JOIN to keep only the matched records in Method 1 whereas we only keep all the available records for "page_load" and "page_exit" here.

👁 Image by Author
Image by Author

Step 4: Once the data is prepared in step 3, computing the average session time per user_id should be straightforward. We just need to use the aggregate function AVG() with GROUP BY user_id .

Final solution using Method 1: JOIN two temporary tables, Load and Exit

I include the code, WHERE b.earliest_exit ≥ a.lastest_load in the end because we would like to avoid the case, in which the latest page_load is later than the earliest page_exit.

WITH load AS(
SELECT user_id,
timestamp::DATE AS date,
MAX(timestamp) AS lastest_load
FROM facebook_web_log
WHERE action = 'page_load'
GROUP BY user_id, timestamp::DATE
),
-- Create Exit table
exit AS(
SELECT user_id,
timestamp::DATE AS date,
MIN(timestamp) AS earliest_exit
FROM facebook_web_log
WHERE action = 'page_exit'
GROUP BY user_id, timestamp::DATE
)
SELECT a.user_id,
AVG(b.earliest_exit - a.lastest_load) AS avg_session_time
FROM LOAD a
INNER JOIN exit b
ON a.user_id = b.user_id AND
a.date = b.date
WHERE b.earliest_exit >= a.lastest_load
GROUP BY a.user_id

Final solution using Method 2: Create two new columns, Load and Exit using CASE-WHEN statement

For this method, I include for code, WHERE (earliest_page_exit-latest_page_load) IS NOT NULL because we would like to remove the extra record for user_id 2 on 4/25/2019.

WITH CTE AS(
SELECT user_id,
timestamp::DATE AS date,
MAX(CASE WHEN action = 'page_load' THEN timestamp ELSE NULL END) AS latest_page_load,
MIN(CASE WHEN action = 'page_exit' THEN timestamp ELSE NULL END) AS earliest_page_exit
FROM facebook_web_log
GROUP BY user_id, timestamp::DATE
)
SELECT user_id,
AVG(earliest_page_exit-latest_page_load) AS avg_session_time
FROM CTE
WHERE (earliest_page_exit-latest_page_load) IS NOT NULL AND
earliest_page_exit >= latest_page_load
GROUP BY user_id

Answer:

👁 Image

If you would like to explore more SQL Interview Questions, please check out my articles:

Thank you for reading !!!

If you enjoy this article and would like to Buy Me a Coffee, please click here.

You can sign up for a membership to unlock full access to my articles, and have unlimited access to everything on Medium. Please subscribe if you’d like to get an email notification whenever I post a new article.


Written By

Aaron Zhu

Towards Data Science is a community publication. Submit your insights to reach our global audience and earn through the TDS Author Payment Program.

Write for TDS

Related Articles