VOOZH about

URL: https://www.geeksforgeeks.org/python/identify-corrupted-records-in-a-dataset-using-pyspark/

⇱ Identify corrupted records in a dataset using pyspark - GeeksforGeeks


  • Courses
  • Tutorials
  • Interview Prep

Identify corrupted records in a dataset using pyspark

Last Updated : 23 Jul, 2025

There can be datasets that may contain corrupt records. Those records don't follow data-specific rules that are followed by correct records e.g., a corrupt record may have been delimited with a pipe ("|") character but the rest of other records are delimited by comma (","), and it is mentioned to read data from that file with comma separator.

This article demonstrates three different ways to identify corrupt records and get rid of corrupt records:

  • PERMISSIVE
  • DROPMALFORMED
  • FAILFAST

Let's discuss all the modes one by one with examples but before that make sure to set up the virtual environment. We are going to use google colab, here's how to set it up:

Step 1: Install Java and Spark in your Colab environment:

Step 2: Set up environment variables:

Step 3: Initialize Spark:

Now that we have set up the colab environment, let's explore the working of all three modes:

To download the csv file used in this article, click here.

PERMISSIVE

It is the default mode. In “Permissive” Mode, “NULLs” are inserted for the Fields that could Not be Parsed correctly. If you want to retain bad records in dataframe , Use "columnNameOfCorruptRecord" option to identify bad records.

Example

This PySpark code reads a CSV file, identifies corrupted records, and counts the total number of records. It sets a schema for the data, reads the CSV with specified options (including handling corrupted records), filters and displays the corrupted records, and provides the total record count.

In the above code "_corrupt_record" column is used to store the corrupted records.

Output:

👁 pyspark-1
Permissive Mode

DROPMALFORMED

This mode is used to drop corrupted records while trying to read from a given dataset.

Example

This PySpark code reads a CSV file and drops any malformed or corrupted records. It sets a schema for the data, reads the CSV with specified options (including dropping malformed records), displays the cleaned dataset, and provides the total record count.

In the output you will find 9 records. But 'DROPMALFORMED' is not going to change total number of records in 'customers_df'.

Output:

👁 pyspark-2
Dropmalformed Mode

FAILFAST

This mode will throw error if malformed records are detected while trying to read from a given dataset.

Example

This PySpark code reads a CSV file in "FAILFAST" mode, which means it will fail and raise an exception if it encounters any malformed records that do not adhere to the specified schema. It sets a schema for the data, reads the CSV with the specified options, displays the dataset, and provides the total record count. If any malformed records are encountered, it will raise an exception and print the error message.

Since, there is one corrupt record in the dataset therefore it is going to raise exception. Advantage of FAILFAST mode is it will not allow to proceed with working on a dataset if it contains corrupted records.

Output:

👁 pyspark-3
Failfast Mode

To download the jupyter notebook for the entire code, click here.

Comment