Voozh

In large-scale data engineering and analytics, files are often stored in deeply partitioned directories to improve performance and manageability. This structure is common in data lakes where data is organised hierarchically by time (e.g., year, month, day).

A deeply partitioned structure enables Apache Spark to efficiently prune irrelevant partitions, resulting in faster queries and improved resource utilisation. In this article, we will explore how to read and write deeply partitioned files using Apache Spark in Java.

1. Setting Up an Apache Spark Example Project

Let’s start by including the following Maven dependencies in the project.

 <dependency>
 <groupId>org.apache.spark</groupId>
 <artifactId>spark-core_2.13</artifactId>
 <version>3.5.7</version>
 </dependency>
 <dependency>
 <groupId>org.apache.spark</groupId>
 <artifactId>spark-sql_2.13</artifactId>
 <version>3.5.7</version>
 <scope>provided</scope>
 </dependency>

These dependencies ensure that the Java program can interact with the Spark runtime, handle distributed data processing, and perform SQL-style operations on structured datasets. The spark-core library provides essential functionalities for Spark applications, while spark-sql adds support for working with DataFrames and executing SQL queries.

Note: The folder structure assumes the source data is already organised into partition folders following a standard date-time hierarchy (year, month, day, hour), similar to how partitioned datasets are typically structured in distributed storage systems.

2. Java Program

This code example represents the core logic for reading and writing deeply partitioned files using Apache Spark in Java. It demonstrates how Spark can automatically traverse a nested directory structure, extract partition metadata, and reorganise the data into a new partitioned format.

public class DeepPartitionExample {

 public static void main(String[] args) {
 // Initialize Spark Session
 SparkSession spark = SparkSession.builder()
 .appName("DeepPartitionExampleApp")
 .master("local[*]") // for local testing
 .getOrCreate();

 String inputPath = "data/sales_partitioned_source/customers";

 // Read deeply partitioned CSV files
 var inputDf = spark.read()
 .format("csv")
 .schema("CustomerId STRING, Region STRING, PurchaseAmount DOUBLE")
 .option("header", "true")
 .option("pathGlobFilter", "*.csv")
 .option("recursiveFileLookup", "true")
 .load(inputPath);

 // Extract partition columns from file paths
 var processedDf = inputDf
 .withColumn("year", regexp_extract(input_file_name(), "year=(\\d{4})", 1))
 .withColumn("month", regexp_extract(input_file_name(), "month=(\\d{2})", 1))
 .withColumn("day", regexp_extract(input_file_name(), "day=(\\d{2})", 1));

 // Write the DataFrame with partitions
 String outputPath = "data/sales_partitioned_output/customers";

 processedDf.write()
 .format("csv")
 .option("header", "true")
 .mode("overwrite")
 .partitionBy("year", "month", "day")
 .save(outputPath);

 spark.stop();
 }

}

Here’s a breakdown of each part.

Initializing the Spark Session

The program begins by creating a SparkSession, which serves as the entry point for all Spark functionalities. The .master("local[*]") configuration indicates that the program will run locally using all available CPU cores, making it suitable for local testing and development.

SparkSession spark = SparkSession.builder()
 .appName("DeepPartitionExampleApp")
 .master("local[*]")
 .getOrCreate();

Reading Deeply Partitioned Files

Next, the code reads a collection of CSV files stored under a deeply partitioned directory structure. The inputPath variable points to the base directory containing subfolders such as year=2024/month=09/day=28/. The .read() operation configures the input format as CSV and explicitly defines the schema.

The following read options are particularly important:

option("pathGlobFilter", "*.csv") ensures that only CSV files are processed.
option("recursiveFileLookup", "true") enables Spark to look through all nested subdirectories and read every CSV file under the given path.

 String inputPath = "data/sales_partitioned_source/customers";
 var inputDf = spark.read()
 .format("csv")
 .schema("CustomerId STRING, Region STRING, PurchaseAmount DOUBLE")
 .option("header", "true")
 .option("pathGlobFilter", "*.csv")
 .option("recursiveFileLookup", "true")
 .load(inputPath);

This capability makes it possible to process deeply partitioned datasets without manually enumerating each subdirectory.

Extracting Partition Columns

After reading the files, the code uses Spark SQL functions such as regexp_extract and input_file_name() to extract partition information embedded in the file paths.

 var processedDf = inputDf
 .withColumn("year", regexp_extract(input_file_name(), "year=(\\d{4})", 1))
 .withColumn("month", regexp_extract(input_file_name(), "month=(\\d{2})", 1))
 .withColumn("day", regexp_extract(input_file_name(), "day=(\\d{2})", 1));

This transformation step adds new columns to the DataFrame for each partition value. It enables Spark to use these columns in operations such as filtering, aggregation, or repartitioning, turning directory information into actual data fields that can be queried and processed.

Writing the Processed Data with New Partitions

Once we’ve extracted the year, month, and day columns, we can instruct Spark to write the data back in a partitioned layout.

 processedDf.write()
 .format("csv")
 .option("header", "true")
 .mode("overwrite")
 .partitionBy("year", "month", "day")
 .save(outputPath);

Here:

.format("csv") writes the data in CSV format.
.mode("overwrite") clears any existing output directory before writing.
.partitionBy("year", "month", "day") organizes the files in partitioned folders.

After writing the results, we call spark.stop(). This gracefully shuts down the Spark context, releasing all allocated resources. This step is important in long-running pipelines or scheduled jobs.

3. Conclusion

This article explored a Java example that demonstrates how to read and write deeply partitioned files using Apache Spark. It covered how to set up a Spark session, read files recursively from nested directories, extract partition information using regular expressions, write partitioned outputs, and organize data for efficient processing.

Do you want to know how to develop your skillset to become a Java Rockstar?

Subscribe to our newsletter to start Rocking right now!

To get you started we give you our best selling eBooks for FREE!

1. JPA Mini Book

2. JVM Troubleshooting Guide

3. JUnit Tutorial for Unit Testing

4. Java Annotations Tutorial

5. Java Interview Questions

6. Spring Interview Questions

7. Android UI Design

and many more ....

I agree to the Terms and Privacy Policy

👁 Image

Thank you!

We will contact you soon.

URL: https://www.javacodegeeks.com/reading-and-writing-deeply-partitioned-files-in-apache-spark.html

⇱ Reading and Writing Deeply Partitioned Files in Apache Spark - Java Code Geeks

1. Setting Up an Apache Spark Example Project

2. Java Program

3. Conclusion

Thank you!

Omozegie Aziegbe

Related Articles

Simple REST client in Java

Spring Boot Error – Error creating a bean with name ‘dataSource’ defined in class path resource DataSourceAutoConfiguration

How to fix Exception in thread “main” java.lang.NoClassDefFoundError: org/slf4j/LoggerFactory in Java

Mockito: Cannot instantiate @InjectMocks field: the type is an interface

100 Java Spring Interview Questions & Answers – The ULTIMATE List (PDF Download)

Spring Boot Remove Embedded Tomcat Server, Enable Jetty Server

What is SecurityContext and SecurityContextHolder in Spring Security?

How to install Apache Web Server on EC2 Instance using User data script