PySpark Groupby

Last Updated : 19 Dec, 2021

In this article, we are going to discuss Groupby function in PySpark using Python.

Let's create the dataframe for demonstration:

Output:

In PySpark, groupBy() is used to collect the identical data into groups on the PySpark DataFrame and perform aggregate functions on the grouped data

The aggregation operation includes:

count(): This will return the count of rows for each group.

dataframe.groupBy('column_name_group').count()

mean(): This will return the mean of values for each group.

dataframe.groupBy('column_name_group').mean('column_name')

max(): This will return the maximum of values for each group.

dataframe.groupBy('column_name_group').max('column_name')

min(): This will return the minimum of values for each group.

dataframe.groupBy('column_name_group').min('column_name')

sum(): This will return the total values for each group.

dataframe.groupBy('column_name_group').sum('column_name')

avg(): This will return the average for values for each group.

dataframe.groupBy('column_name_group').avg('column_name').show()

We have to use any one of the functions with groupby while using the method

Syntax: dataframe.groupBy('column_name_group').aggregate_operation('column_name')

Example 1: Groupby with sum()

Groupby with DEPT along FEE with sum().

Output:

👁 Image

Example 2: Groupby with min()

Output:

👁 Image

Example 3: Groupby with max()

Output:

👁 Image

Example 4: Groupby with avg()

Output:

👁 Image

Example 5: Groupby with count()

Output:

👁 Image

Example 6: Groupby with mean()

Output:

👁 Image

Applying groupby() on multiple columns

Here we are going to use groupby() on multiple columns.

Syntax: dataframe.groupBy('column_name_group1','column_name_group2',............,'column_name_group n').aggregate_operation('column_name')

Example 1: Groupby with mean() functions with DEPT and NAME

Output:

👁 Image

We can also groupBy and aggregate on multiple columns at a time by using the following syntax:

dataframe.groupBy("group_column").agg( max("column_name"),sum("column_name"),min("column_name"),mean("column_name"),count("column_name")).show()

We have to import these agg functions from the module sql.functions.

Example:

Output:

👁 Image

Comment

Article Tags:

Python

Python-Pyspark

Explore

Python Fundamentals

Python Data Structures

Advanced Python

Data Science with Python

Web Development with Python

Python Practice

Python Courses

URL: https://www.geeksforgeeks.org/python/pyspark-groupby/

⇱ PySpark Groupby - GeeksforGeeks

PySpark Groupby

Let's create the dataframe for demonstration:

The aggregation operation includes:

Example 1: Groupby with sum()

Example 2: Groupby with min()

Example 3: Groupby with max()

Example 4: Groupby with avg()

Example 5: Groupby with count()

Example 6: Groupby with mean()

Applying groupby() on multiple columns

Example 1: Groupby with mean() functions with DEPT and NAME

Explore