VOOZH about

URL: https://www.geeksforgeeks.org/python/pyspark-groupby/

⇱ PySpark Groupby - GeeksforGeeks


  • Courses
  • Tutorials
  • Interview Prep

PySpark Groupby

Last Updated : 19 Dec, 2021

In this article, we are going to discuss Groupby function in PySpark using Python.

Let's create the dataframe for demonstration:

Output:

👁 Image

In PySpark,  groupBy() is used to collect the identical data into groups on the PySpark DataFrame and perform aggregate functions on the grouped data

The aggregation operation includes:

  • count(): This will return the count of rows for each group.

dataframe.groupBy('column_name_group').count()

  • mean(): This will return the mean of values for each group.

dataframe.groupBy('column_name_group').mean('column_name')

  • max(): This will return the maximum of values for each group.

dataframe.groupBy('column_name_group').max('column_name')

  • min(): This will return the minimum of values for each group.

dataframe.groupBy('column_name_group').min('column_name')

  • sum(): This will return the total values for each group.

dataframe.groupBy('column_name_group').sum('column_name')

  • avg(): This will return the average for values for each group.

dataframe.groupBy('column_name_group').avg('column_name').show()

We have to use any one of the functions with groupby while using the method

Syntax: dataframe.groupBy('column_name_group').aggregate_operation('column_name')

Example 1: Groupby with sum()

Groupby with DEPT along FEE with sum().

Output:

👁 Image

Example 2: Groupby with min()

Output:

👁 Image

Example 3: Groupby with max()

Output:

👁 Image

Example 4: Groupby with avg()

Output:

👁 Image

Example 5: Groupby with count()

Output:

👁 Image

Example 6: Groupby with mean()

Output:

👁 Image

Applying groupby() on multiple columns

Here we are going to use groupby() on multiple columns.

Syntax: dataframe.groupBy('column_name_group1','column_name_group2',............,'column_name_group n').aggregate_operation('column_name')

Example 1: Groupby with mean() functions with DEPT and NAME

Output:

👁 Image

We can also  groupBy and aggregate on multiple columns at a time by using the following syntax:

dataframe.groupBy("group_column").agg( max("column_name"),sum("column_name"),min("column_name"),mean("column_name"),count("column_name")).show()

We have to import these agg functions from the module sql.functions.

Example:

Output:

👁 Image
Comment
Article Tags: