VOOZH

URL: https://www.analyticsvidhya.com/blog/2020/12/tutorial-to-data-preparation-for-training-machine-learning-model/

⇱ Tutorial to Data Preparation for Training Machine Learning Model

India's Most Futuristic AI Conference Is Back – Bigger, Sharper, Bolder

d

h

m

s

Machine Learning

Reading list

Machine Learning Basics for a Newbie

6 Steps of Machine learning Lifecycle Introduction to Predictive Modeling

Introduction to Exploratory Data Analysis & Data Insights Descriptive Statistics Inferential Statistics How to Understand Population Distributions?

Reading Data Files into Python Different Variable Datatypes

Probability for Data Science Basic Concepts of Probability Axioms of Probability Conditional Probability

Central Tendencies for Continuous Variables Spread of Data KDE plots for Continuous Variable Overview of Distribution for Continuous variables Normal Distribution Skewed Distribution Skeweness and Kurtosis Distribution for Continuous Variable

Central Tendencies for Categorical Variables Understanding Discrete Distributions Performing EDA on Categorical Variables

Dealing with Missing Values Understanding Outliers Identifying Outliers in Data Outlier Detection in Python Outliers Detection Using IQR, Z-score, LOF and DBSCAN

Sample and Population Central Limit Theorem Confidence Interval and Margin of Error

Bivariate Analysis Introduction

Covariance Pearson Correlation Spearman's Correlation & Kendall's Tau Correlation versus Causation Tabular and Graphical methods for Bivariate Analysis Performing Bivariate Analysis on Continuous-Continuous Variables

Tabular and Graphical methods for Continuous-Categorical Variables Introduction to Hypothesis Testing P-value Two sample Z-test T-test T-test vs Z-test Performing Bivariate Analysis on Continuous-Catagorical variables

Chi-Squares Test Bivariate Analysis on Categorical Categorical Variables

Multivariate Analysis A Comprehensive Guide to Data Exploration The Data Science behind IPL

Supervised Learning vs Unsupervised Learning Reinforcement Learning Generative and Descriminative Models Parametric and Non Parametric model

Machine Learning Pipeline Preparing Dataset Build a Benchmark Model: Regression Build a Benchmark Model: Classification

Evaluation Metrics for Machine Learning Everyone should know Confusion Matrix Accuracy Precision and Recall AUC-ROC Log Loss R2 and Adjusted R2

Dealing with Missing Values Replacing Missing Values Imputing Missing Values in Data Working with Categorical Variables Working with Outliers Preprocessing Data for Model Building

Introduction to K Nearest Neighbours Determining the Right Value of K in KNN Implement KNN from Scratch Implement KNN in Python

Bias Variance Tradeoff Introduction to Overfitting and Underfitting Visualizing Overfitting and Underfitting Selecting the Right Model What is Validation?Hold-Out Validation Understanding K Fold Cross Validation

Introduction to Feature Selection Feature Selection Algorithms Missing Value Ratio Low Variance Filter High Correlation Filter Backward Feature Elimination Forward Feature Selection Implement Feature Selection in Python Implement Feature Selection in R

Introduction to Decision Tree Purity in Decision Tree Terminologies Related to Decision Tree How to Select Best Split Point in Decision Tree?Chi-Squares Information Gain Reduction in Variance Optimizing Performance of Decision Tree Train Decision Tree using Scikit Learn Pruning of Decision Trees

Introduction to Feature Engineering Feature Transformation Feature Scaling Feature Engineering Frequency Encoding Automated Feature Engineering: Feature Tools

Introduction to Naive Bayes Conditional Probability and Bayes Theorem Introduction to Bayesian Adjustment Rating: The Incredible Concept Behind Online Ratings!Working of Naive Bayes Math behind Naive Bayes Types of Naive Bayes Implementation of Naive Bayes

Understanding how to solve Multiclass and Multilabled Classification Problem Evaluation Metrics: Multi Class Classification

Introduction to Ensemble Techniques Basic Ensemble Techniques Implementing Basic Ensemble Techniques Finding Optimal Weights of Ensemble Learner using Neural Network Why Ensemble Models Work well?

Different Hyperparameter Tuning methods Implementing Different Hyperparameter Tuning methods GridsearchCV RandomizedsearchCV Bayesian Optimization for Hyperparameter Tuning Hyperopt

Understanding SVM Algorithm SVM Kernels In-depth Intuition and Practical Implementation SVM Kernel Tricks Kernels and Hyperparameters in SVM Implementing SVM from Scratch in Python and R

Introduction to Principal Component Analysis Steps to Perform Principal Compound Analysis Computation of Covariance Matrix Finding Eigenvectors and Eigenvalues Implementing PCA in python Visualizing PCA A Brief Introduction to Linear Discriminant Analysis Introduction to Factor Analysis

Introduction to Clustering Applications of Clustering Evaluation Metrics for Clustering Understanding K-Means Implementation of K-Means in Python Implementation of K-Means in R Choosing Right Value for K Profiling Market Segments using K-Means Clustering Hierarchical Clustering Implementation of Hierarchial Clustering DBSCAN Defining Similarity between clusters Build Better and Accurate Clusters with Gaussian Mixture Models

Understand Basics of Recommendation Engine with Case Study

8 Ways to Improve Accuracy of Machine Learning Models

Introduction to Dask Working with CuML

Introduction to Machine Learning Interpretability Framework and Interpretable Models model Agnostic Methods for Interpretability Implementing Interpretable Model Understanding SHAP Out-of-Core ML Introduction to Interpretable Machine Learning Models Model Agnostic Methods for Interpretability Game Theory & Shapley Values

Introduction to AutoML Implementation of MLBox Introduction to PyCaret TPOT Auto-Sklearn EvalML

Pickle and Joblib Introduction to Model Deployment

Deploying Machine Learning Model using Streamlit Deploying ML Models in Docker Deploy Using Streamlit Deploy on Heroku Deploy Using Netlify Introduction to Amazon Sagemaker Setting up Amazon SageMaker Using SageMaker Endpoint to Generate Inference Deploy on Microsoft Azure Cloud Introduction to Flask for Model Deploying ML model using Flask

Model Deployment in Android Model Deployment in Iphone

Tutorial to data preparation for training machine learning model

Vidhi Last Updated : 18 Dec, 2020

3 min read

This article was published as a part of the Data Science Blogathon.

Introduction

It happens quite often that we do not have all the features/input variables in one file, but spread across multiple files. Not just that, it might even need external information to make appropriate joins to prepare the final data in the right format for model training. In this article, we will see how data preparation and feature engineering (specifically target variable engineering) are the most time-consuming step of modeling pipeline.

Let’s get started and see what is the objective?

We will first do the necessary imports.

👁 import data preparation

We have 3 datasets namely, station data, trip data and weather data.

👁 data preparation read data

Station Data:

Let’s see how station data looks like:

👁 data preparation station data head

We have 71 stations in total over 5 cities:

👁 unique data

Trip data:

We have seen 2 files till now and understand that the target variable is not given to us as is. We need to create the target variable which is net rate of change in trips at a station in a given hour. For this, we need to create start and end hour features from start and end date respectively.

Once we have the net_rate feature created as our target variable, we need to start bringing all the relevant attributes into one dataframe.

So, we have merged station and trip data to create an intermediate df named as, df_station_netrate:

👁 target variables

Let’s see how this intermediate df ‘df_station_netrate’ looks like:

👁 station data head

Weather Data:

Note that we have yet not studied weather data, so let’s see how to bring weather information to our intermediate df ‘df_station_netrate’ created above.

👁 weather analysis

Missing values in weather data:

👁 weather column

External Data:

Note that we have zipcode in weather data and lat-long information in intermediate df ‘df_station_netrate’, so we need zipcode to lat-long mapping to be able to merge weather data with station data.

We resort to open-source data for this information: https://public.opendatasoft.com/explore/dataset/us-zip-code-latitude-and-longitude/export/

👁 external data

Now that we have the necessary mapping, we merge it with weather data:

👁 external data

Final Data:

Now, we can merge this weather information with df_station_netrate, as our final training data preparation step:

👁 final preparation

We shall also create date type features as part of the feature engineering process:

Finally, we have one flat file to train a model. So, we have seen that training data preparation and feature engineering are the most arduous task in creating a machine learning model.

Data and the source code is kept here.

Beginner Data Exploration Machine Learning Python Python Structured Data Supervised

Login to continue reading and enjoy expert-curated content.

Free Courses

👁 Generative AI
4.6

Exploratory Data Analysis with Python & GenAI

Learn EDA with Python: Transform data into insights using PandasAI & more.

👁 Generative AI
4.5

Data Science Course

Build a powerful 2026-ready data science resume using AI tools.

👁 Generative AI
4.5

No Code Predictive Analytics with Orange

No-code AI course for business pros with real-world ML use cases.

👁 Generative AI
4.7

Adaptive Email Agents with DSPy

Build adaptive email agents with DSPy using context and smart learning.

👁 Generative AI
4.9

Introduction to AI & ML

AI & ML are transforming industries. Learn their impacts in this course.

Recommended Articles

Responses From Readers

Become an Author

Share insights, grow your voice, and inspire the data community.

Reach a Global Audience
Share Your Expertise with the World
Build Your Brand & Audience

Join a Thriving AI Community
Level Up Your AI Game
Expand Your Influence in Genrative AI

Flagship Programs

GenAI Pinnacle Program| GenAI Pinnacle Plus Program| AI/ML BlackBelt Program| Agentic AI Pioneer Program

Free Courses

Generative AI| DeepSeek| OpenAI Agent SDK| LLM Applications using Prompt Engineering| DeepSeek from Scratch| Stability.AI| SSM & MAMBA| RAG Systems using LlamaIndex| Building LLMs for Code| Python| Microsoft Excel| Machine Learning| Deep Learning| Mastering Multimodal RAG| Introduction to Transformer Model| Bagging & Boosting| Loan Prediction| Time Series Forecasting| Tableau| Business Analytics| Vibe Coding in Windsurf| Model Deployment using FastAPI| Building Data Analyst AI Agent| Getting started with OpenAI o3-mini| Introduction to Transformers and Attention Mechanisms

Popular Categories

AI Agents| Generative AI| Prompt Engineering| Generative AI Application| News| Technical Guides| AI Tools| Interview Preparation| Research Papers| Success Stories| Quiz| Use Cases| Listicles

Generative AI Tools and Techniques

GANs| VAEs| Transformers| StyleGAN| Pix2Pix| Autoencoders| GPT| BERT| Word2Vec| LSTM| Attention Mechanisms| Diffusion Models| LLMs| SLMs| Encoder Decoder Models| Prompt Engineering| LangChain| LlamaIndex| RAG| Fine-tuning| LangChain AI Agent| Multimodal Models| RNNs| DCGAN| ProGAN| Text-to-Image Models| DDPM| Document Question Answering| Imagen| T5 (Text-to-Text Transfer Transformer)| Seq2seq Models| WaveNet| Attention Is All You Need (Transformer Architecture) | WindSurf| Cursor

Popular GenAI Models

Llama 4| Llama 3.1| GPT 4.5| GPT 4.1| GPT 4o| o3-mini| Sora| DeepSeek R1| DeepSeek V3| Janus Pro| Veo 2| Gemini 2.5 Pro| Gemini 2.0| Gemma 3| Claude Sonnet 3.7| Claude 3.5 Sonnet| Phi 4| Phi 3.5| Mistral Small 3.1| Mistral NeMo| Mistral-7b| Bedrock| Vertex AI| Qwen QwQ 32B| Qwen 2| Qwen 2.5 VL| Qwen Chat| Grok 3

AI Development Frameworks

n8n| LangChain| Agent SDK| A2A by Google| SmolAgents| LangGraph| CrewAI| Agno| LangFlow| AutoGen| LlamaIndex| Swarm| AutoGPT

Data Science Tools and Techniques

Python| R| SQL| Jupyter Notebooks| TensorFlow| Scikit-learn| PyTorch| Tableau| Apache Spark| Matplotlib| Seaborn| Pandas| Hadoop| Docker| Git| Keras| Apache Kafka| AWS| NLP| Random Forest| Computer Vision| Data Visualization| Data Exploration| Big Data| Common Machine Learning Algorithms| Machine Learning| Google Data Science Agent

👁 Av Logo White

Continue your learning for FREE

👁 Av Logo White

Enter OTP sent to

Edit

Wrong OTP.

Enter the OTP

Resend OTP

Resend OTP in 45s

👁 Popup Banner

👁 AI Popup Banner