![]() |
VOOZH | about |
This article was published as a part of the Data Science Blogathon.
It happens quite often that we do not have all the features/input variables in one file, but spread across multiple files. Not just that, it might even need external information to make appropriate joins to prepare the final data in the right format for model training. In this article, we will see how data preparation and feature engineering (specifically target variable engineering) are the most time-consuming step of modeling pipeline.
We will first do the necessary imports.
We have 3 datasets namely, station data, trip data and weather data.
Letβs see how station data looks like:
We have 71 stations in total over 5 cities:
We have seen 2 files till now and understand that the target variable is not given to us as is. We need to create the target variable which is net rate of change in trips at a station in a given hour. For this, we need to create start and end hour features from start and end date respectively.
Once we have the net_rate feature created as our target variable, we need to start bringing all the relevant attributes into one dataframe.
So, we have merged station and trip data to create an intermediate df named as, df_station_netrate:
Letβs see how this intermediate df βdf_station_netrateβ looks like:
Note that we have yet not studied weather data, so letβs see how to bring weather information to our intermediate df βdf_station_netrateβ created above.
Missing values in weather data:
Note that we have zipcode in weather data and lat-long information in intermediate df βdf_station_netrateβ, so we need zipcode to lat-long mapping to be able to merge weather data with station data.
We resort to open-source data for this information: https://public.opendatasoft.com/explore/dataset/us-zip-code-latitude-and-longitude/export/
Now that we have the necessary mapping, we merge it with weather data:
Now, we can merge this weather information with df_station_netrate, as our final training data preparation step:
We shall also create date type features as part of the feature engineering process:
Finally, we have one flat file to train a model. So, we have seen that training data preparation and feature engineering are the most arduous task in creating a machine learning model.
Data and the source code is kept here.
GPT-4 vs. Llama 3.1 β Which Model is Better?
Llama-3.1-Storm-8B: The 8B LLM Powerhouse Surpa...
A Comprehensive Guide to Building Agentic RAG S...
Top 10 Machine Learning Algorithms in 2026
45 Questions to Test a Data Scientist on Basics...
90+ Python Interview Questions and Answers (202...
8 Easy Ways to Access ChatGPT for Free
Prompt Engineering: Definition, Examples, Tips ...
What is LangChain?
What is Retrieval-Augmented Generation (RAG)?
Edit
Resend OTP
Resend OTP in 45s