Data cleaning is the process of preparing raw data by detecting and correcting errors so it can be effectively used for analysis. It is a foundational step in data preprocessing that ensures datasets are suitable for analytical, statistical and machine learning tasks.
Raw data is often noisy, incomplete and inconsistent which can negatively impact the accuracy of the model.
Clean datasets are also important in EDA (Exploratory Data Analysis), which enhances the interpretability of data so that the right actions can be taken based on insights.
Common Data Anomalies
Data quality issues can arise from human errors, system failures or problems during data collection and integration. Some of the most common data quality challenges include:
Missing values: Incomplete records can reduce statistical power and introduce bias into analysis.
Duplicate records: Repeated entries may overrepresent certain observations resulting in skewed outcomes.
Incorrect data types: Mismatched formats, such as text stored in numeric fields can cause calculation errors and analysis failures.
Outliers and anomalies: Extremely high or low values can distort statistical measures and influence model performance.
Inconsistent formats: Variations in date formats, text casing or measurement units can create issues when merging or comparing datasets.
Spelling and typographical errors: Errors in text fields can lead to incorrect grouping, classification or interpretation of categorical data
Data Cleaning Process
1. Assess Data Quality
The first step in data cleaning is to assess the quality of your data. This involves checking for:
Missing Values: Identify any blank or null values in the dataset. Missing values can be due to various reasons such as incomplete data collection, data entry errors or data loss during transmission.
Incorrect Values: Check for values that are outside the expected range or are inconsistent with the data type.
Inconsistencies in Data Format: Verify that the data format is consistent throughout the dataset.
In the deduplicated DataFrame, the duplicate rows 1 and 6 have been removed to ensure each record is unique
3. Fix Structural Errors
Structural errors occur when data formats, naming conventions or variable types are inconsistent which can affect analysis accuracy. Correcting these issues ensures uniform and reliable data representation.
Standardize data formats to maintain consistency in dates, times and other data types across the dataset.
Correct naming inconsistencies in column names, variable names or labels to ensure clarity and uniformity.
Ensure consistent data representation such as using the same units for measurements or the same scales for ratings.
Missing data can introduce bias and reduce the reliability of analysis. Properly addressing missing values helps maintain the integrity of your dataset.
Impute missing values using statistical methods such as mean, median or mode to fill gaps.
Remove records with missing values when the missing data is extensive or cannot be accurately imputed.
The missing value in the 'Name' column (row 7) has been filled with 'Unknown' to indicate unavailable data, ensuring the dataset remains complete and consistent.
5. Normalize Data
Data normalization organizes the dataset to reduce redundancy and ensure consistency making it easier to manage and analyze.
Split data into multiple tables, with each table storing specific types of information.
Ensure consistency across the dataset to support efficient querying and accurate analysis.
Outliers are data points that deviate significantly from the rest of the dataset and can affect analysis accuracy. Properly handling them ensures more reliable insights.
Remove outliers that result from errors or are not representative of the population.
Transform extreme but valid outliers to reduce their impact on the analysis.
Step 10: Recalculate Outlier Bounds and Remove Outliers from the Updated Data
mean = df3['Age'].mean(): Calculates the average (mean) value of the Age column in the DataFrame df3.
std = df3['Age'].std(): Computes the standard deviation (spread or variability) of the Age column in df3.
lower_bound = mean - 2 * std: Defines the lower limit for acceptable Age values, set as two standard deviations below the mean.
upper_bound = mean + 2 * std: Defines the upper limit for acceptable Age values, set as two standard deviations above the mean.
df4 = df3[(df3['Age'] >= lower_bound) & (df3['Age'] <= upper_bound)]: Creates a new DataFrame df4 by selecting only rows where the Age value falls between the lower and upper bounds, effectively removing outlier ages outside this range.
Data validation and verification involve ensuring that the data is accurate and consistent by comparing it with external sources or expert knowledge.
For the machine learning prediction we separate independent and target features
Here we will consider 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', and 'Embarked' as independent features.
Survived as target variables because PassengerId will not affect the survival rate
Step 12: Data formatting
Data formatting involves converting the data into a standard format or structure that can be easily processed by the algorithms or models used for analysis. Here we will discuss commonly used data formatting techniques i.e. Scaling and Normalization.
1. Min-Max Scaling: Scaling involves transforming the values of features to a specific range. Min-Max scaling rescales the values to a specified range, typically between 0 and 1. It preserves the original distribution and ensures that the minimum value maps to 0 and the maximum value maps to 1.
2. Standardization (Z-score scaling): Standardization transforms the values to have a mean of 0 and a standard deviation of 1. It centers the data around the mean and scales it based on the standard deviation. Standardization makes the data more suitable for algorithms that assume a Gaussian distribution or require features to have zero mean and unit variance.
Understand the data: Know the source, structure and domain of the data to identify potential quality issues and determine appropriate cleaning actions.
Document the process: Keep records of decisions, methods, assumptions and rules applied during data cleaning.
Prioritize critical issues: Focus first on major quality problems that could have a systemic impact on analysis or decision-making.
Automate where possible: Use scripts or tools for repetitive cleaning tasks to improve efficiency and consistency.
Collaborate with domain experts: Engage stakeholders or domain specialists to validate that the cleaned data meets business requirements.
Monitor and maintain: Continuously track data quality and perform cleaning periodically to ensure long-term accuracy and reliability.
Advantages
Removing errors, inconsistencies, and irrelevant data helps the model learn better from the data.
Ensures the data is accurate, consistent, and free of errors.
Transforms data into a format that better represents underlying patterns and relationships.
Improves data quality, making it more reliable and accurate.
Helps identify and remove sensitive or confidential information, improving data security.
Disadvantages
It is time-consuming, especially for large and complex datasets.
It can lead to loss of important information if not handled carefully.
Requires significant time, effort, expertise, and sometimes specialized tools.
Removing too much data can contribute to underfitting.