VOOZH about

URL: https://www.analyticsvidhya.com/blog/2022/09/make-model-training-and-testing-easier-with-multitrain/

โ‡ฑ Make Model Training and Testing Easier with MultiTrain


India's Most Futuristic AI Conference Is Back โ€“ Bigger, Sharper, Bolder

  • d
  • :
  • h
  • :
  • m
  • :
  • s

Reading list

Make Model Training and Testing Easier with MultiTrain

SAMSON Last Updated : 12 Oct, 2024
5 min read

This article was published as a part of the Data Science Blogathon.

Introduction

For data scientists and machine learning engineers, developing and testing machine learning models may take a lot of time. For instance, you would need to write a few lines of code, wait for each model to run, and then go on to the following five models to train and test. This may be rather tedious. However, when I encountered a similar issue, I became so frustrated that I began to devise a way to make things simpler for myself. After four months of hard work โ€“ coding, and bug-fixing, Iโ€™m happy to share my solution with you.

MultiTrain is a Python module I created that allows you to train many machine learning models on the same dataset to analyze performance and choose the best models. The content of this article will show you how to utilize MultiTrain for a basic regression problem. Please visit here to learn how to use it for classification problems.

Now, code alongside me as we simplify the model training and testing of machine learning models with MultiTrain.

Importing Libraries and Dataset

Today, we will be testing which machine learning model would work best for the productivity prediction of garment employees. The dataset is available here on Kaggle for download.

To start working on our dataset, we need to import some libraries.

To import our dataset, you must ensure it is in the same path as your ipynb file, or else you will have to set a file path.

Python Code:

import pandas as pd
from sklearn.preprocessing import LabelEncoder
df = pd.read_csv('garments_worker_productivity.csv')
print(df.head())

Now that we have imported our dataset into our Jupyter notebook, we want to see the first five rows of the dataset. You need to use a single line of code for that.

Note: Not all columns are shown here; only a few we would be working with are present in this snapshot except the output column.

Data Preprocessing

We wouldnโ€™t be employing any primary data preprocessing techniques or EDA as our focus is on how to train and test lots of models at once using the MultiTrain library. I strongly encourage you to perform some major preprocessing techniques on your dataset, as dirty data can affect your modelโ€™s predictions. It also affects the performance of machine learning algorithms.

While you check the first five rows, you should find a column named โ€œdepartmentโ€, in which sewing is spelt as sweing. We can fix this spelling mistake with this line of code.

df["department"] = df["department"].str.replace('sweing', 'sewing')

We can see in this snapshot above that the spelling mistake is now corrected.

When you run the following lines of code, you will discover that the โ€œdepartmentโ€ column has some duplicate values that we have to take care of, and we will also need to fix that before we can start predicting.

print(f'Unique Values in Department before cleaning: {df.department.unique()}')
Output:
Unique Values in Department before cleaning: ['sewing' 'finishing ' 'finishing']

To fix this problem:

df['department'] = df.department.str.strip()
print(f'Unique Values in Department after cleaning: {df.department.unique()}')
Output:
Unique Values in Department before cleaning: ['sewing', 'finishing']

Letโ€™s replace all missing values in our dataset with an integer value of 0

for i in df.columns:
    if df[i].isnull().sum() != 0:
        df[i] = df[i].fillna(0)

Most scikit-learn algorithms canโ€™t handle categorical data, so we need to convert the columns with categorical data to numerical with label encoding. There are several other methods of encoding categorical data, e.g. ordinal encoding, target encoding, binary encoding, frequency encoding, mean encoding, and others. The listed methods have advantages and disadvantages, and itโ€™s advisable to test as many as possible to see which works best for your dataset in terms of performance.

As I mentioned previously, we would use a label encoder for this tutorial to encode our categorical columns. Firstly, we have to get a list of our categorical columns. We can do that using the following lines of code.

cat_col = []
num_col = []
for i in df.columns:
    if df[i].dtypes == object:
        cat_col.append(i)
    else:
        num_col.append(i)
#remove the target columns
num_col.remove('actual_productivity')

Now that we have gotten a list of our categorical columns inside the cat_col variable. We can now apply our label encoder to it to encode the categorical data into numerical data.

label = LabelEncoder()
    for i in cat_col:
        df[i] = label.fit_transform(df[i])

All missing values formerly indicated by NaN in column โ€˜wipโ€™ has now changed to 0 and the three categorical columns โ€“ quarter, department, and day have all been label encoded.

You may still need to fix outliers in the dataset and do some feature engineering on your own.

Model Training

Before we can begin model training, we will need to split our dataset into its training features and labels.

features = df.drop('actual_productivity', axis=1)
labels = df['actual_productivity']

Now, we would need to split the dataset into training and tests. The training sets are used to train the machine learning algorithms and the test sets are used to evaluate their performance.

train = MultiRegressor(random_state=42,
                       cores=-1,
                       verbose=True)
split = train.split(X=features,
 y=labels,
 sizeOfTest=0.2,
 randomState=42,
 normalize='StandardScaler',
 columns_to_scale=num_col,
 shuffle=True)

The normalize parameter in the split method allows you to scale your numerical columns by just passing in any scalers of your choice; the columns_to_scale parameter then receives a list of the columns youโ€™d like to scale instead of having to scale all columns automatically.

After splitting the features and labels into train, test is appended to a variable named split. This variable then holds X_train, X_test, y_train, and y_test; we would need it in the next function below.

fit = train.fit(X=features,
 y=labels,
 splitting=True,
 split_data=split)
๐Ÿ‘ Image

Run this code in your notebook to view the full model list and scores.

Visualize Model Results

For the sake of people who might prefer to view the model performance results in charts rather than a dataframe. Thereโ€™s also an option available for you to convert the dataframe into charts. All you have to do is to run the code below.

train.show(param=fit,
          t_split=True)

Conclusion

MultiTrain exists to help data scientists and machine learning engineers make their job easy. It eliminates repetition and the boring process. With just a few lines of code, you can get your model training and testing immediately.

The assessment metrics shown on the dataframe might also differ based on the problem youโ€™re attempting to solve, such as multiclass, binary, classification, regression, imbalanced datasets, or balanced datasets. You have even more freedom to work on these challenges by passing values to parameters in different methods rather than writing extensive lines of code.

After fitting the models, the results generated in the dataframe shouldnโ€™t be your final results. Since MultiTrain aims to determine the best models that work best for your particular use case, you should pick the best models and perform some hyperparameter tuning on them or even feature engineering on your dataset to further boost the performance.

The media shown in this article is not owned by Analytics Vidhya and is used at the Authorโ€™s discretion.

I am a machine learning engineer with 1 year of experience building and deploying machine learning models to production. I'm also a writer, professional photographer and an avid fan of music.

I'm currently a final year student at the University of Ilorin, Nigeria.

Login to continue reading and enjoy expert-curated content.

Free Courses

Exploratory Data Analysis with Python & GenAI

Learn EDA with Python: Transform data into insights using PandasAI & more.

Data Science Course

Build a powerful 2026-ready data science resume using AI tools.

No Code Predictive Analytics with Orange

No-code AI course for business pros with real-world ML use cases.

Adaptive Email Agents with DSPy

Build adaptive email agents with DSPy using context and smart learning.

Introduction to AI & ML

AI & ML are transforming industries. Learn their impacts in this course.

Responses From Readers

Chukwuma Jonathan

This is amazing. I tagged along and the result was amazing. Good job Shittu.

Amazing stuff. Great job Shittu.

Amazing stuff. Great job Shittu.

Flagship Programs

GenAI Pinnacle Program| GenAI Pinnacle Plus Program| AI/ML BlackBelt Program| Agentic AI Pioneer Program

Free Courses

Generative AI| DeepSeek| OpenAI Agent SDK| LLM Applications using Prompt Engineering| DeepSeek from Scratch| Stability.AI| SSM & MAMBA| RAG Systems using LlamaIndex| Building LLMs for Code| Python| Microsoft Excel| Machine Learning| Deep Learning| Mastering Multimodal RAG| Introduction to Transformer Model| Bagging & Boosting| Loan Prediction| Time Series Forecasting| Tableau| Business Analytics| Vibe Coding in Windsurf| Model Deployment using FastAPI| Building Data Analyst AI Agent| Getting started with OpenAI o3-mini| Introduction to Transformers and Attention Mechanisms

Popular Categories

AI Agents| Generative AI| Prompt Engineering| Generative AI Application| News| Technical Guides| AI Tools| Interview Preparation| Research Papers| Success Stories| Quiz| Use Cases| Listicles

Generative AI Tools and Techniques

GANs| VAEs| Transformers| StyleGAN| Pix2Pix| Autoencoders| GPT| BERT| Word2Vec| LSTM| Attention Mechanisms| Diffusion Models| LLMs| SLMs| Encoder Decoder Models| Prompt Engineering| LangChain| LlamaIndex| RAG| Fine-tuning| LangChain AI Agent| Multimodal Models| RNNs| DCGAN| ProGAN| Text-to-Image Models| DDPM| Document Question Answering| Imagen| T5 (Text-to-Text Transfer Transformer)| Seq2seq Models| WaveNet| Attention Is All You Need (Transformer Architecture) | WindSurf| Cursor

Popular GenAI Models

Llama 4| Llama 3.1| GPT 4.5| GPT 4.1| GPT 4o| o3-mini| Sora| DeepSeek R1| DeepSeek V3| Janus Pro| Veo 2| Gemini 2.5 Pro| Gemini 2.0| Gemma 3| Claude Sonnet 3.7| Claude 3.5 Sonnet| Phi 4| Phi 3.5| Mistral Small 3.1| Mistral NeMo| Mistral-7b| Bedrock| Vertex AI| Qwen QwQ 32B| Qwen 2| Qwen 2.5 VL| Qwen Chat| Grok 3

AI Development Frameworks

n8n| LangChain| Agent SDK| A2A by Google| SmolAgents| LangGraph| CrewAI| Agno| LangFlow| AutoGen| LlamaIndex| Swarm| AutoGPT

Data Science Tools and Techniques

Python| R| SQL| Jupyter Notebooks| TensorFlow| Scikit-learn| PyTorch| Tableau| Apache Spark| Matplotlib| Seaborn| Pandas| Hadoop| Docker| Git| Keras| Apache Kafka| AWS| NLP| Random Forest| Computer Vision| Data Visualization| Data Exploration| Big Data| Common Machine Learning Algorithms| Machine Learning| Google Data Science Agent
๐Ÿ‘ Av Logo White

Continue your learning for FREE

Forgot your password?
๐Ÿ‘ Av Logo White

Enter OTP sent to

Edit

Wrong OTP.

Enter the OTP

Resend OTP

Resend OTP in 45s

๐Ÿ‘ Popup Banner
๐Ÿ‘ AI Popup Banner