VOOZH about

URL: https://towardsdatascience.com/sql-for-data-analysis-subquery-vs-cte-699ef629d9eb/

⇱ Crack SQL Interview Question: Subquery vs. CTE | Towards Data Science


Crack SQL Interview Question: Subquery vs. CTE

Difference between CTE and Subquery

8 min read
👁 Photo by Mike Benna on Unsplash
Photo by Mike Benna on Unsplash

Background

SQL is an essential skill for data science professionals. Many companies store their data in a relational database system, such as MySQL, PostgreSQL, MS SQL Server, SQLite. SQL is the programming language we need to interact with them. We can write a SQL query to select, filter, transform, insert, update, and delete the underlying data in the database.

Not only SQL can be used to query a database, it can also be used for data analysis. Both Subquery and CTE (Common Table Expression) are useful tools we can use to write a complex SQL query to implement data analysis, in a similar way as other data science tools, such as Pandas in Python and dplyr in R.

In this article, I will explain the similarities and differences between Subquery and CTE.

The similarity Between Subquery and CTE

I will use a common SQL interview problem to demonstrate the similarity between Subquery and CTE.

SQL Problem:

Distance Per Dollar
You're given a dataset of Uber rides with the traveling distance ("distance_to_travel") and cost ("monetary_cost") for each ride. For each date, find the difference between the distance-per-dollar for that date and the average distance-per-dollar for that year-month. Distance-per-dollar is defined as the distance traveled divided by the cost of the ride. 
The output should include the year-month (YYYY-MM) and the average difference in distance-per-dollar for said year-month as an absolute value rounded to the 2nd decimal. You should also count both success and failed request_status as the distance and cost values are populated for all ride requests. Also, assume that all dates are unique in the dataset. Order your results by the earliest request date first.
Source: stratascratch.com
I would recommend clicking the link to practice writing queries for this exercise.
👁 Source: stratascratch.com
Source: stratascratch.com

The following are steps we need to take to solve this problem:

  • Step 1: On a given day, compute distance-per-cost and create year-month based on that day
  • Step 2: Use a window function to compute the average distance-per-cost at the year-month level
  • Step 3: Use an aggregation function to compute the average of the absolute difference between daily distance-per-cost and monthly distance-per-cost at the year-month level
  • Step 4: Order output by earliest year-month

Use Subquery

A subquery, as its name suggests, is a query inside a query. To use a subquery, we simply add parentheses and put the query inside them. The output that is created by a subquery would act as a temporary table. After we execute the whole statement, the temporary subquery would be gone automatically.

For this SQL problem, if we’re using Subquery, we will end up having multiple nested subqueries.

  • Step 1: We create an inner query (a) to compute daily distance-per-cost and create a new column, year-month using TO_CHAR .
SELECT *,
 TO_CHAR(request_date::DATE, 'YYYY-MM') AS month
 (distance_to_travel/monetary_cost) AS daily_dis_to_cost
FROM uber_request_logs
  • Step 2: We create another nested subquery (b) to compute monthly average distance-per-cost using a window function AVG(a.daily_dis_to_cost) OVER (PARTITION BY a.month) . With a window function, the output will have the same number of rows as the original data set.
SELECT 
 a.request_date,
 a.month,
 a.daily_dis_to_cost,
 AVG(a.daily_dis_to_cost) OVER(PARTITION BY a.month) AS monthly_dist_to_cost
 FROM 
 (SELECT *,
 TO_CHAR(request_date::DATE, 'YYYY-MM') AS month,
 (distance_to_travel/monetary_cost) AS daily_dis_to_cost
 FROM uber_request_logs) a
 ORDER BY request_date

Intermediate Output:

👁 (Created by Author)
(Created by Author)
  • Steps 3&4, we run the outer query to implement an aggregation function AVG(ABS(b.daily_dis_to_cost-b.monthly_dis_to_cost)) and GROUP BY b.month to compute the average of the absolute difference between daily distance-per-cost and monthly distance-per-cost at the year-month level. Lastly, we order the output by month.
SELECT 
 b.month,
 ROUND(AVG(ABS(b.daily_dis_to_cost-b.monthly_dist_to_cost))::DECIMAL, 2) AS avg_diff
FROM
 (SELECT 
 a.request_date,
 a.month,
 a.daily_dis_to_cost,
 AVG(a.daily_dis_to_cost) OVER(PARTITION BY a.month) AS monthly_dist_to_cost
 FROM 
 (SELECT *,
 TO_CHAR(request_date::DATE, 'YYYY-MM') AS month,
 (distance_to_travel/monetary_cost) AS daily_dis_to_cost
 FROM uber_request_logs) a
 ORDER BY request_date) b
GROUP BY b.month
ORDER BY b.month

Desired Output:

👁 (Created by Author)
(Created by Author)

Use CTE

A CTE (aka common table expression) is the result set that we create using WITH clause before writing the main query. We can simply use its output as a temporary table, just like a subquery. Similar to subqueries, we can also create multiple CTEs.

To solve the above SQL problem using CTE, we would create multiple CTEs (cte1 and cte2), which are equivalent to subqueries (a and b) mentioned above.

WITH cte1 AS(
 SELECT *,
 TO_CHAR(request_date::DATE, 'YYYY-MM') AS month, 
 (distance_to_travel/monetary_cost) AS daily_dis_to_cost
 FROM uber_request_logs
),
cte2 AS(
 SELECT 
 request_date,
 month,
 daily_dis_to_cost,
 AVG(daily_dis_to_cost) OVER(PARTITION BY month) AS monthly_dis_to_cost
 FROM cte1
 ORDER BY request_date
)
SELECT
 month,
 ROUND(AVG(ABS(daily_dis_to_cost-monthly_dis_to_cost))::DECIMAL, 2) AS avg_diff
FROM cte2
GROUP BY month
ORDER BY month

The results produced by subquery and CTE are exactly the same. At first, we might think there are not many differences between these two tools, except the syntax. In the next section, we will discuss the differences between the two methods.

The difference between Subquery and CTE

Advantages of Using CTE

CTE can be reusable: One advantage of using CTE is CTE is reusable by design. Instead of having to declare the same subquery in every place you need to use it, you can use CTE to define a temporary table once, then refer to it whenever you need it.

CTE can be more readable: Another advantage of CTE is CTE is more readable than Subqueries. Since CTE can be reusable, you can write less code using CTE than using a subquery. Also, people tend to follow logic and ideas easier in sequence than in a nested fashion. When you write a query, it is easier to break down a complex query into smaller pieces using CTE.

CTEs can be recursive: A CTE can run recursively, which a subquery cannot. This makes it especially well suited to tree structures, in which information in a given row is based on the information from the previous row(s). The recursion feature can be implemented with RECURSIVE and UNION ALL.

WITH RECURSIVE [cte_name] (column, ...) AS (
 [non-recursive_term]
UNION ALL
 [recursive_term])
SELECT ... FROM [cte_name];

Let’s create the Fibonacci sequence (Each number in the sequence is the sum of the two numbers that precede it.) using the recursion feature of CTE.

WITH RECURSIVE fib(f1, f2) AS ( 
 SELECT 0, 1 
 UNION ALL
 SELECT f2, (f1+f2) FROM fib ) 
SELECT f1 FROM fib LIMIT 10;

Output:

👁 Image

We’ve discussed a few differences between CTE and subquery so far. It looks like CTE has more advantages than a subquery. But in the next section, we’re going to talk about features of a subquery, that can’t be achieved using CTE.

Advantages of Using Subquery

A subquery can be used in the WHERE clause: We can use a subquery to return a value, then use it in the WHERE clause. In the following example, we would like to return employees who have a salary above the average salary. It would be easy to implement with a subquery, which computes the average salary.

SELECT
 employee_name, salary 
FROM sample
WHERE
 salary > (SELECT AVG(salary) FROM sample)

A subquery can act as a column with a single value: You can also use a subquery as a new column. The only constraint is that the subquery must return only one value. In the following example, we would like to add a new column, that contains the average salary. We can use a subquery to compute the average salary, then include it in the SELECT statement.

SELECT
 employee_name,
 salary,
 (SELECT AVG(salary) FROM sample) AS average_salary
FROM sample

A subquery can be used with CORRELATED SUBQUERY: Unlike CTE, we can use an inner subquery as a correlated subquery. That means for each record processed by an outer query, an inner query will be executed.

In the following example, we would like to return the employee with the second-highest salary. Here is what we’re going to do to solve this problem using a correlated subquery. For each employee (in the outer query (a)), we compute the number of employees (in the inner query (b)), who have higher salaries than a given employee. If there is only one other employee who has a higher salary than this given employee, we keep this employee.

SELECT 
 employee_name, salary
FROM sample a
WHERE 1 = (SELECT COUNT(DISTINCT(salary)) FROM sample b WHERE a.salary < b.salary)

But keep in mind that because the inner subquery would be evaluated every time when each row is processed by the outer query, it could be slow.

If you would like to explore more SQL Interview Questions, please check out my articles:

Thank you for reading !!!

If you enjoy this article and would like to Buy Me a Coffee, please click here.

You can sign up for a membership to unlock full access to my articles, and have unlimited access to everything on Medium. Please subscribe if you’d like to get an email notification whenever I post a new article.


Written By

Aaron Zhu

Towards Data Science is a community publication. Submit your insights to reach our global audience and earn through the TDS Author Payment Program.

Write for TDS

Related Articles