How to Get the Number of Elements in Pyspark Partition

Last Updated : 23 Jul, 2025

In this article, we are going to learn how to get the number of elements in a partition using Pyspark in Python.

Are you a data enthusiast who has ever worked on a Pyspark data frame? Then, you might surely know that whenever we upload any file in the Pyspark, it creates a partition of that data equal to the number of cores. You can repartition that data and divide it into as many partitions according to your wish. Thus, after partitioning, if you want to know how many elements exist in every RDD dataframe partition, you can achieve it using the function of the Pyspark module. In this article, we will discuss the same.

Prerequisite

Note: In the article about installing Pyspark we have to install python instead of scala rest of the steps are the same.

Modules Required:

Pyspark: An open source, distributed computing framework and set of libraries for real-time, large-scale data processing API primarily developed for Apache Spark, is known as Pyspark. You can install the following module through the following command in Python:

pip install pyspark

Methods to get the number of elements in a partition:

Using spark_partition_id() function
Using map() function

Method 1: Using the spark_partition_id() function

In this method, we are going to make the use of spark_partition_id() function to get the number of elements of the partition in a data frame.

Stepwise Implementation:

Step 1: First of all, import the required libraries, i.e. SparkSession, and spark_partition_id. The SparkSession library is used to create the session while spark_partition_id is used to get the record count per partition.

from pyspark.sql import SparkSession
from pyspark.sql.functions import spark_partition_id

Step 2: Now, create a spark session using the getOrCreate function.

spark_session = SparkSession.builder.getOrCreate()

Step 3: Then, read the CSV file for which you want to check the number of elements in the partition.

data_frame=csv_file = spark_session.read.csv('#Path of CSV file', sep = ',', inferSchema = True, header = True)

Step 4: Finally, get the number of elements of partition using the spark_partition_id function.

data_frame.withColumn("partitionId",spark_partition_id()).groupBy("partitionId").count().show()

Example 1:

In this example, we have read the CSV file (link) and obtained the number of partitions as well as the number of elements per partition using the spark_partition_id function.

Output:

👁 How to Get the Number of Elements in Pyspark Partition

Example 2:

In this example, we have read the CSV file (link) and obtained the number of partitions as well as the number of elements per partition using the spark_partition_id function. Further, we have repartitioned that data and again get the number of partitions as well as the record count per transition of the new partitioned data.

Output:

When we get the number of elements in the partition before repartitioning, we got the following output:

👁 How to Get the Number of Elements in Pyspark Partition

When we get the number of elements in the partition after repartitioning, we got the following output:

👁 How to Get the Number of Elements in Pyspark Partition

Method 2: Using the map function

In this method, we are going to make the use of map() function with glom() function to get the number of elements of the partition in a data frame.

Stepwise Implementation:

Step 1: First of all, import the required libraries, i.e. SparkSession. The SparkSession library is used to create the session.

from pyspark.sql import SparkSession

Step 2: Now, create a spark session using the getOrCreate function.

spark_session = SparkSession.builder.getOrCreate()

Step 3: Later on, create the Spark Context Session.

sc = spark_session.sparkContext

Step 4: Then, read the CSV file of which we want to know the number of partitions or enter the dataset with the number of partitions you want to do of that dataset.

data_frame=csv_file = spark_session.read.csv('#Path of CSV file', sep = ',', inferSchema = True, header = True)

num_partitions = Declare_number_of_partitions_to_be_done
data_frame = sc.parallelize(Declare_the_dataset, num_partitions)

Step 5: Further, get the length of each partition of the data frame using glom and map function and using collect() to retrieve data.

l = data_frame.glom().map(len).collect()

Step 6: Finally, print the length of each partition obtained in the previous step.

print(l)

Example:

In this example, we have declared a dataset and the number of partitions to be done on it. Then, we applied the glom and map function on the data set and obtained the number of elements in the partition.

Output:

👁 How to Get the Number of Elements in Pyspark Partition

Comment

Article Tags:

Technical Scripter

Python

Python Programs

Technical Scripter 2022