PySpark partitionBy() method

Last Updated : 23 May, 2024

PySpark partitionBy() is used to partition based on column values while writing DataFrame to Disk/File system. When you write DataFrame to Disk by calling partitionBy() Pyspark splits the records based on the partition column and stores each partition data into a sub-directory.

PySpark Partition is a way to split a large dataset into smaller datasets based on one or more partition keys. You can also create a partition on multiple columns using partitionBy(), just pass columns you want to partition as an argument to this method.

Syntax: partitionBy(self, *cols)

Let’s Create a DataFrame by reading a CSV file. You can find the dataset at this link Cricket_data_set_odi.csv

👁 Image

Create dataframe for demonstration:

Output:

👁 Image

PySpark partitionBy() with One column:

From the above DataFrame, we will be use Team as a partition key for our examples below:

Output:

👁 Image

PySpark partitionBy() with Multiple Columns:

You can also create partitions on multiple columns using PySpark partitionBy(). Just pass columns you want to partition as arguments to this method.

From the above DataFrame, we are using Team and Speciality as a partition key for our examples below.

Output:

👁 Image

Control Number of Records per Partition File:

Use the option maxRecordsPerFile if you want to control the number of records for each partition. This is especially helpful when your data is skewed (some partitions with very few records and other partitions with high numbers of records).

Output:

👁 Image

Comment

Article Tags:

Python

Python-Pyspark

Explore

Python Fundamentals

Python Data Structures

Advanced Python

Data Science with Python

Web Development with Python

Python Practice

Python Courses

URL: https://www.geeksforgeeks.org/python/pyspark-partitionby-method/