![]() |
VOOZH | about |
Data cleaning is one of the most important steps in the data analysis process. Raw data often contains missing values, inconsistent column names, duplicates, or unwanted entries. Cleaning such data manually using Pandas can become repetitive and error-prone, especially for large datasets.
PyJanitor is an open-source Python library that extends Pandas by adding convenient functions for data cleaning. It helps perform common tasks such as renaming columns, handling missing values, filtering data, and encoding categorical variables with minimal code.
The goal of PyJanitor is to make data cleaning simpler, faster, and less error-prone, especially for beginners.
PyJanitor offers a variety of features that simplify data cleaning:
You can install PyJanitor using pip:
pip install pyjanitor
clean_names() function standardizes column names by:
Output
column_1 column_2
0 1 3
1 2 4
Use remove_empty() to remove rows or columns containing only missing values.
Output
A B
0 1.0 4.0
1 3.0 6.0
We can identify the data points that are repeated using the duplicated() function, which returns True if all the columns of a data point are repeated, and False if any one is not repeated.
Output
0 False
1 False
2 True
3 False
dtype: bool
We can encode an object data type to a categorical data type using the encode_categorical() function, in which we need to pass the column names for which we want to encode.
Output
A object
B object
dtype: object
A B
0 low type1
1 medium type2
2 high type1
3 medium type3
4 low type2
A category
B category
dtype: object
Explanation:
Renaming columns is common in data cleaning; PyJanitor’s clean_names standardizes names to lowercase and replaces spaces with underscores.
Output
first_name last_name age_years_
0 1 5 9
1 2 6 10
2 3 7 11
3 4 8 12
Filtering data based on certain conditions is a common data cleaning task. PyJanitor provides the filter_string function to filter rows based on string conditions.
Output
Name Age
2 Mary 35
3 Harry 40
Example:
Output
salesmonth company1 company2 company3
0 Jan 150.0 180.0 400.0
1 Feb 200.0 250.0 500.0
2 April 400.0 500.0 675.0
Now that we have understood the main features of PyJanitor, let's dive deep into some other main functions.
The fill_empty function replaces empty values in a column with a specified value.
Example:
Output
col1 col2 col3
0 1 0.0 0.0
1 2 4.0 5.0
2 3 0.0 6.0
Explanation: jn.impute(data, column_names=['col2', 'col3'], value=0): Fill missing values in columns 'col2' and 'col3' with 0.
The filter_on function lets you filter rows in a DataFrame based on a condition. It does not change the original data.
Example:
Output
student_id score
1 S2 75
2 S3 50
3 S4 90
Explanation: f1 = data.query("score >= 50"): Filter rows where the score is greater than or equal to 50.
The rename_column function is used to change a column name in a DataFrame.
Example:
Output
x_new y
0 10 40
1 20 50
2 30 60
Explanation: data = data.rename(columns={'x': 'x_new'}): Rename column x to x_new.
The add_column function is used to add a new column to a DataFrame.
Example:
Output
a b c d e
0 0 a 1 e 4
1 1 b 1 f 5
2 2 c 1 g 6
Explanation: