VOOZH about

URL: https://www.geeksforgeeks.org/python/pyspark-aggregation-on-multiple-columns/

⇱ Pyspark - Aggregation on multiple columns - GeeksforGeeks


  • Courses
  • Tutorials
  • Interview Prep

Pyspark - Aggregation on multiple columns

Last Updated : 19 Dec, 2021

In this article, we will discuss how to perform aggregation on multiple columns in Pyspark using Python. We can do this by using Groupby() function

Let's create a dataframe for demonstration:

Output:

👁 Image

In PySpark, groupBy() is used to collect the identical data into groups on the PySpark DataFrame and perform aggregate functions on the grouped data

The aggregation operation includes:

  • count(): This will return the count of rows for each group.

dataframe.groupBy('column_name_group').count()

  • mean(): This will return the mean of values for each group.

dataframe.groupBy('column_name_group').mean('column_name')

  • max(): This will return the maximum of values for each group.

dataframe.groupBy('column_name_group').max('column_name')

  • min(): This will return the minimum of values for each group.

dataframe.groupBy('column_name_group').min('column_name')

  • sum(): This will return the total values for each group.

dataframe.groupBy('column_name_group').sum('column_name')

  • avg(): This will return the average for values for each group.

dataframe.groupBy('column_name_group').avg('column_name').show()

We can   groupBy and aggregate on multiple columns at a time by using the following syntax:

dataframe.groupBy('column_name_group1','column_name_group2',............,'column_name_group n').aggregate_operation('column_name')

Example 1: Groupby with mean() function with DEPT and NAME

Output:

👁 Image

Example 2: Aggregation on all columns

Output:

👁 Image
Comment
Article Tags: