Data Preprocessing in Python

Last Updated : 30 Apr, 2026

Data preprocessing is the first step in any data analysis or machine learning pipeline. It involves cleaning, transforming and organizing raw data to ensure it is accurate, consistent and ready for modeling. It has a big impact on model building such as:

Clean and well-structured data allows models to learn meaningful patterns rather than noise.
Properly processed data prevents misleading inputs, leading to more reliable predictions.
Organized data makes it simpler to create useful inputs for the model, enhancing model performance.
Organized data supports better Exploratory Data Analysis (EDA), making patterns and trends more interpretable.

👁 data_cleaning

Data Preprocessing

Steps-by-Step implementation

Let's implement various preprocessing features,

Step 1: Import Libraries and Load Dataset

We prepare the environment with libraries like pandas, numpy, scikit learn, matplotlib and seaborn for data manipulation, numerical operations, visualization and scaling. Load the dataset for preprocessing.

The sample dataset can be downloaded from here.

Output:

👁 Screenshot-2025-08-29-132400

Dataset

Step 2: Inspect Data Structure and Check Missing Values

We understand dataset size, data types and identify any incomplete (missing) data that needs handling.

df.info(): Prints concise summary including count of non-null entries and data type of each column.
df.isnull().sum(): Returns the number of missing values per column.

Output:

Step 3: Statistical Summary and Visualizing Outliers

Get numeric summaries like mean, median, min/max and detect unusual points (outliers). Outliers can skew models if not handled.

df.describe(): Computes count, mean, std deviation, min/max and quartiles for numerical columns.
Boxplots: Visualize spread and detect outliers using matplotlib’s boxplot().

Output:

👁 boxplot-data-preprocessing

Boxplot

Step 4: Remove Outliers Using the Interquartile Range (IQR) Method

Remove extreme values beyond a reasonable range to improve model robustness.

IQR = Q3 (75th percentile) – Q1 (25th percentile).
Values below Q1 - 1.5IQR or above Q3 + 1.5IQR are outliers.
Calculate lower and upper bounds for each column separately.
Filter data points to keep only those within bounds.

Note: In practice, outlier removal should be applied across all relevant numerical columns to ensure consistent preprocessing.

Step 5: Correlation Analysis

Understand relationships between features and the target variable (Outcome). Correlation helps gauge feature importance.

df.corr(): Computes pairwise correlation coefficients between columns.
Heatmap via seaborn visualizes correlation matrix clearly.
Sorting correlations with corr['Outcome'].sort_values() highlights features most correlated with the target.

Output: