Python PySpark sum() Function

Last Updated : 23 Jul, 2025

PySpark, the Python API for Apache Spark, is a powerful tool for big data processing and analytics. One of its essential functions is sum(), which is part of the pyspark.sql.functions module. This function allows us to compute the sum of a column's values in a DataFrame, enabling efficient data analysis on large datasets.

Overview of the PySpark sum() Function

The sum() function in PySpark is used to calculate the sum of a numerical column across all rows of a DataFrame. It can be applied in both aggregate functions and grouped operations.

Syntax:

pyspark.sql.functions.sum(col)

Here, col is the column name or column expression for which we want to compute the sum.

To illustrate the use of sum(), let’s start with a simple example.

Setting Up PySpark

First, ensure we have PySpark installed. We can install it using pip if we haven't done so:

pip install pyspark

Example 1: Basic Sum Calculation

Let’s create a simple DataFrame and compute the sum of a numerical column.

Output:

👁 Screenshot-2024-09-20-155006

Example of PySpark sum() function

Explanation:

DataFrame Creation: We create a DataFrame with names and associated values.
Sum Calculation: We use the sum() function to calculate the total of the "Value" column and then collect the result.

Example 2: sum() with groupBy() in Sales Data Analysis

Let's consider a more realistic scenario: analyzing sales data for a retail store. Suppose we have a DataFrame with sales records, including the product name and the total sales amount.

Output:

👁 Screenshot-2024-09-20-155350

Real World Example of

Explanation:

Group By: We use groupBy("Product") to group the sales records by product name.
Aggregation: The agg(sum("Sales").alias("Total_Sales")) computes the total sales for each product, renaming the result to "Total_Sales".

Example 3: Using sum() with Conditions

We can also compute sums conditionally using the when function. For instance, if we want to calculate total sales only for products that exceeded a certain threshold.

Output:

👁 Screenshot-2024-09-20-155821

Using pyspark sum() with conditions

Explanation:

Filtering: The where("Sales > 1000") filters the records to include only those with sales over 1000.
Aggregation: The sum of the filtered records is computed.

Conclusion

The sum() function in PySpark is a fundamental tool for performing aggregations on large datasets. Whether you're calculating total values across a DataFrame or aggregating data based on groups, sum() provides a flexible and efficient way to handle numerical data.

In real-world applications, this function can be used extensively in data analysis tasks such as sales reporting, financial analysis, and performance tracking. With its ability to process massive amounts of data quickly, PySpark'ssum() function plays a crucial role in the analytics landscape.

Comment

Article Tags:

Python

Python-Pyspark

Explore

Python Fundamentals

Python Data Structures

Advanced Python

Data Science with Python

Web Development with Python

Python Practice

Python Courses

URL: https://www.geeksforgeeks.org/python/python-pyspark-sum-function/