Pyspark - Aggregation on multiple columns

Last Updated : 19 Dec, 2021

In this article, we will discuss how to perform aggregation on multiple columns in Pyspark using Python. We can do this by using Groupby() function

Let's create a dataframe for demonstration:

Output:

👁 Image

In PySpark, groupBy() is used to collect the identical data into groups on the PySpark DataFrame and perform aggregate functions on the grouped data

The aggregation operation includes:

count(): This will return the count of rows for each group.

dataframe.groupBy('column_name_group').count()

mean(): This will return the mean of values for each group.

dataframe.groupBy('column_name_group').mean('column_name')

max(): This will return the maximum of values for each group.

dataframe.groupBy('column_name_group').max('column_name')

min(): This will return the minimum of values for each group.

dataframe.groupBy('column_name_group').min('column_name')

sum(): This will return the total values for each group.

dataframe.groupBy('column_name_group').sum('column_name')

avg(): This will return the average for values for each group.

dataframe.groupBy('column_name_group').avg('column_name').show()

We can groupBy and aggregate on multiple columns at a time by using the following syntax:

dataframe.groupBy('column_name_group1','column_name_group2',............,'column_name_group n').aggregate_operation('column_name')

Example 1: Groupby with mean() function with DEPT and NAME

Output:

👁 Image

Example 2: Aggregation on all columns

Output:

👁 Image

Comment

Article Tags:

Python

Python-Pyspark

Explore

Python Fundamentals

Python Data Structures

Advanced Python

Data Science with Python

Web Development with Python

Python Practice

Python Courses

URL: https://www.geeksforgeeks.org/python/pyspark-aggregation-on-multiple-columns/