Pyspark GroupBy DataFrame with Aggregation or Count

Last Updated : 23 Jun, 2025

Pyspark is a powerful tool for handling large datasets in a distributed environment using Python. One common operation when working with data is grouping it based on one or more columns. This can be easily done in Pyspark using the groupBy() function, which helps to aggregate or count values in each group.

In this article, we will explore how to use the groupBy() function in Pyspark for counting occurrences and performing various aggregation operations.

Syntax of groupBy()

DataFrame.groupby(by=None, axis=0, level=None, as_index=True, sort=True, group_keys=True, squeeze=False, **kwargs)

Parameters:

by: The column(s) to group by, can be a single column, list, or a function.
axis: The axis to operate on, default is 0 (rows).
level: For multi-level index DataFrames, specify the level(s) to group by.
as_index: If True (default), the grouped column(s) become the index; otherwise, the original index is kept.
sort: If True (default), groups are sorted; False keeps original order.
group_keys: Includes group labels in the output, default is True.
squeeze: If True, reduces dimensionality to a DataFrame or Series.
kwargs: Extra parameters for aggregation functions like count(), sum(), etc.

Creating a Pyspark DataFrame

Before performing the groupBy() operation, let's create a simple DataFrame containing some student data, including columns like ID, NAME, DEPT, and FEE.

Output:

👁 Pyspark groupBy DataFrame with aggregation or count

Snapshot of the dataframe

Pyspark groupBy with Count

To count the number of rows in each group, we can use the count() function. This method counts the occurrences of each unique value in the specified column.

Output:

👁 Pyspark groupBy DataFrame with aggregation or count

Snapshot of the output

Explanation:

groupBy('DEPT'): Groups the data by the DEPT column.
count(): Counts the number of rows for each group (department).

Pyspark groupBy with Aggregation

You can apply various aggregation functions to your grouped data, such as sum(), max(), min(), mean(), etc.

Output:

👁 Pyspark groupBy DataFrame with aggregation or count

Snapshot of the output

Explanation:

groupBy("DEPT"): Groups the data by the DEPT column.
agg(): Applies the aggregation functions (max, sum, min, mean, count) on the FEE column for each group.

Comment

Article Tags:

Python

Python-Pyspark

Explore

Python Fundamentals

Python Data Structures

Advanced Python

Data Science with Python

Web Development with Python

Python Practice

Python Courses

URL: https://www.geeksforgeeks.org/python/pyspark-groupby-dataframe-with-aggregation-or-count/