VOOZH about

URL: https://www.analyticsvidhya.com/blog/2022/02/auto-sklearn-accelerate-your-machine-learning-models-with-automl/

⇱ Auto-Sklearn: Accelerate your machine learning models with AutoML


India's Most Futuristic AI Conference Is Back – Bigger, Sharper, Bolder

  • d
  • :
  • h
  • :
  • m
  • :
  • s

Reading list

Auto-Sklearn: Accelerate your machine learning models with AutoML

Devashree Last Updated : 15 Mar, 2022
8 min read
This article was published as a part of the Data Science Blogathon.

Introduction

AutoML is a relatively new and upcoming subset of machine learning. The main approach in AutoML is to limit the involvement of data scientists and let the tool handle all time-consuming processes in machine learning like data preprocessing, best algorithm selection, hyperparameter tuning, etc., thus saving time for setting up these ML models and speeding up their deployment. There are several AutoML tools available in the market these days.

In one of my previous blogathon articles, I had shared a comprehensive guide to AutoML with an easy AutoGluon example. This guide included a list of several AutoML tools currently available in the market. These AutoML tools can undoubtedly save a good amount of time, especially for a large and complex dataset. We will explore one such tool called β€˜Auto-Sklearn’ in this article.

What is Auto-Sklearn?

Anyone familiar with machine learning knows about scikit-learn, the famous python package consisting of different classification and regression algorithms and is used for building machine learning models.

Auto-Sklearn is a Python-based open-source toolkit for doing AutoML. It employs the well-known Scikit-Learn machine learning package for data processing and machine learning algorithms. It also includes a Bayesian Optimization search technique to find the best model pipeline for the given dataset quickly. In this article, we’ll look at how to utilize Auto-Sklearn for classification and regression tasks.

Let us install the Auto-Sklearn package first.

pip install auto-sklearn

(If you are using google colab, ensure your SciPy version is the latest; else upgrade it using pip command and restart the runtime)

 pip install --upgrade scipy

Now that we have installed the AutoML tool, we will import the basic packages for preprocessing the dataset and visualization.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split

Classification Task

We will use the heart disease prediction dataset available on the UCI repository. For convenience, let us use the .csv version of this data from Kaggle. You can also use any classification dataset of your choice or import a toy dataset available from the sklearn library.

Dataset details: This dataset contains 303 samples and 14 attributes (the original dataset has 76 features while the .csv version has the 14 subsets of the original dataset).

Importing the dataset and printing first few rows

df=pd.read_csv('/content/heart.csv')
df.head()

πŸ‘ Classification Task | Auto-Sklearn

Let us check the target variable β€˜target’ in the dataset

df['target'].value_counts()

πŸ‘ Image

There are only two classes (0= healthy, 1= heart disease), so this is a binary classification problem. Also, This indicates that this is an imbalanced dataset. Due to this, the accuracy score of this model will be less reliable. However, we will first test the imbalanced dataset by directly feeding it to the autosklearn classifier. Later we will adjust the number of samples for these two classes and test the accuracy to see how the classifier performs.

#creating X and y 
X=df.drop(['target'],axis=1)
y=df['target']
#split into train and test sets
X_train, X_test, y_train, y_test= train_test_split(X,y, test_size=0.2, random_state=42)
X_train.shape, X_test.shape,y_train.shape, y_test.shape

πŸ‘ Auto-Sklearn

Next, we will import the classification models from autosklearn using the following command.

import autosklearn.classification

Then we will create an instance of the AutoSklearnClassifier for the classification task.

automl = autosklearn.classification.AutoSklearnClassifier( time_left_for_this_task=5*60,per_run_time_limit=30,tmp_folder='/temp/autosklearn_classification_example_tmp')

Here, we are setting the max time for this task using the β€˜time_left_for_this_task’ argument and assigning 5*60 sec or 5 mins to it. If nothing is specified for this argument, the process will run for an hour, i.e., 60mins. Then, we will also set the time allocated as 30sec to each model evaluation using the β€œper_run_time_limit” argument.

In this command, there are other arguments like n_jobs (number of parallel jobs), ensemble_size, initial_configurations_via_metalearning, which can be used to fine-tune the classifier. By default, the above search command creates an ensemble of top-performing models. In order to avoid overfitting, we can disable it by changing the setting β€œensemble_size” = 1 and β€œinitial_configurations_via_metalearning” = 0. We have excluded these while setting up the classifier to keep the tutorial simple.

We will also provide a temporary path for the log to be saved, and we can use it to print the run details later.

Now, we will fit the classifier.

automl.fit(X_train, y_train)

The sprint_statistics() function summarizes the above search and the performance of the selected best model.

pprint(automl.sprint_statistics())

πŸ‘ Output

Alternatively, we can also print a leaderboard for all the models considered by the search, organized by their ranks using the following command.

print(automl.leaderboard())

πŸ‘ Output | Auto-Sklearn

The top two models selected by the classifier were Random forest and Passive_aggressive respectively.

Additionally, we can print the information about the considered models using the following command:

pprint(automl.show_models())

Lastly, we can also print the final score of the ensemble and the confusion matrix using the following lines of code.

# Score of the final ensemble
from sklearn.metrics import accuracy_score
m1_acc_score= accuracy_score(y_test, y_pred)
m1_acc_score
from sklearn.metrics import confusion_matrix, accuracy_score
y_pred= automl.predict(X_test)
conf_matrix= confusion_matrix(y_pred, y_test)
sns.heatmap(conf_matrix, annot=True)

πŸ‘ Auto-Sklearn

We can use the following command to separate healthy and unhealthy samples in the dataset.

from sklearn.utils import resample
healthy= df[df["target"]==0]
unhealthy=df[df["target"]==1]

As the number of unhealthy samples is more, we will use the resampling technique (oversampling) and increase the samples of healthy individuals in the dataset. To adjust the skew, we can use the following commands –

up_sampled=resample(healthy, replace=True, n_samples=len(unhealthy), random_state=42)
up_sampled=pd.concat([unhealthy, up_sampled])
#check updated class counts
up_sampled['target'].value_counts()

We can also use techniques like SMOTE, Ensemble learning (bagging, boosting), NearMiss Algorithm to address the imbalance in the dataset. Additionally, we can use metrics such as F1-score, precision, and recall to evaluate the model’s performance.

Now that we have adjusted the skew, we will create X and y sets for classification again. Let us name them X1 and y1 to avoid confusion.

X1=up_sampled.drop(['target'],axis=1)
y1=up_sampled['target']

We need to repeat all the steps from setting up the classifier to printing a confusion matrix for this new X1 and y1. Complete code for this task is available on my GitHub repository.

Finally, we can compare the two accuracies for skewed data and adjusted data using –

model_eval = pd.DataFrame({'Model': ['skewed','adjusted'], 'Accuracy': [m1_acc_score,m2_acc_score]})
model_eval = model_eval.set_index('Model').sort_values(by='Accuracy',ascending=False)
fig = plt.figure(figsize=(12, 4))
gs = fig.add_gridspec(1, 2)
gs.update(wspace=0.8, hspace=0.8)
ax0 = fig.add_subplot(gs[0, 0])
sns.heatmap(model_eval,cmap="PiYG",annot=True,fmt=".1%", linewidths=4,cbar=False,ax=ax0)
plt.show()

πŸ‘ Image

From the above chart, the model accuracy has slightly reduced after over-sampling, we can see that the model is now better optimized. Although we have used quite a few additional commands for preprocessing the data and evaluating the results, running an AutoSklearn classifier requires only one single line of code. Even with skewed data, the accuracy achieved by the model is really good.

Regression Task

Now we will use the Regression models from AutoSklearn in this section.

For this task, let us use the simple β€˜flights’ dataset from the seaborn datasets library. We will load the dataset with the following command.

#loading the dataset
df = sns.load_dataset('flights')
df.head()

πŸ‘ Regression Task

Dataset details: This dataset contains 144 rows and 3 columns, namely year, month, and the number of passengers.

The task here is to predict the number of passengers using the other two features.

X=df.drop(["passengers"],axis=1)
y=df["passengers"]
X.shape, y.shape

πŸ‘ Image

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train.shape, X_test.shape,y_train.shape, y_test.shape

πŸ‘ Image

We now use autosklearnregressor for this regression task.

import autosklearn.regression
automl = autosklearn.regression.AutoSklearnRegressor(
time_left_for_this_task=5*60,per_run_time_limit=30,tmp_folder='/temp/autosklearn_regression_example_tmp')
automl.fit(X_train, y_train)
from sklearn.metrics import mean_absolute_error
from autosklearn.metrics import mean_absolute_error as auto_mean_absolute_error

Now, let us print the statistics of the model.

# summarize
print(automl.sprint_statistics())

πŸ‘ Auto-Sklearn

From the above-printed summary, we understand that the regressor ran a total of 59 models, and the calculated performance of the final regression model was R2 of 0.985, which is quite good.

Since the regressor has optimized the R2 metric by default, let us print the mean absolute error to evaluate the performance of the model better.

# evaluate the best model
y_pred = automl.predict(X_test)
mae = mean_absolute_error(y_test, y_pred)
print("MAE: %.3f" % mae)

πŸ‘ Image

The mean absolute error is acceptable looking at the R2 value achieved by the model and the size of the example dataset used for this task.

We can also plot the predicted values against the actual values using matplotlib as shown below.

plt.figure(figsize=(8,6))
plt.scatter(y_test, y_pred, c='blue')
p1 = max(max(y_pred), max(y_test))
p2 = min(min(y_pred), min(y_test))
plt.plot([p1, p2], [p1, p2], 'r-')
plt.xlabel('Actual', fontsize=10)
plt.ylabel('Predicted', fontsize=10)
plt.legend(['Actual', 'Predicted'])
plt.axis('equal')
plt.show()

πŸ‘ Graph

Overall, we can say that the MAE value is small, and the model achieved a high validation score is 0.985, indicating that the model performance is good.

Saving the trained models.

The above-trained models for classification and regression can be saved using python packages Pickle and JobLib. These saved models can then be used to make predictions directly on new data. We can save the models as:

1. Using
Pickle

import pickle
# save the model 
filename = 'final_model.sav' 
pickle.dump(model, open(filename, 'wb'))

Here β€˜wb’ argument means that we are writing the file to the disk in binary mode. Further, we can load this saved model as :

#load the model
loaded_model = pickle.load(open(filename, 'rb'))
result = loaded_model.score(X_test, Y_test)
print(result)

Here β€˜rb’ command indicates we are reading the file in binary mode.

2. Using
JobLib

Similarly, we can save the trained models in JobLib using the following command.

import joblib
# save the model 
filename = 'final_model.sav'
joblib.dump(model, filename)

We can also reload these saved models later for predictions on new data.

# load the model from disk
load_model = joblib.load(filename)
result = load_model.score(X_test, Y_test)
print(result)

Conclusion

In this article, we saw the application of the Auto-Sklearn for both classification and regression models. For both tasks, we did not require to specify a particular algorithm. Instead, the tool itself iterated through several inbuilt algorithms and achieved good results (higher accuracy in the classification model and lower mean absolute error in the regression model). Thus, AutoSklearn can be a valuable tool to build better machine learning models with a few lines of code. The complete tutorial for this article is available on my GitHub repository.

Author Bio

Devashree has an M.Eng degree in Information Technology from Germany and a Data Science background. As an Engineer, she enjoys working with numbers and uncovering hidden insights in diverse datasets from different sectors for creating beautiful visualizations to solve interesting real-world machine learning problems.

In her spare time, she loves to cook, read & write, discover new Python-Machine Learning libraries or participate in coding competitions.

You can follow her on LinkedIn, GitHub, Kaggle, Medium, Twitter.

Devashree has an M.Eng degree in Information Technology from Germany and a Data Science background. As an Engineer, she enjoys working with numbers and uncovering hidden insights in diverse datasets from different sectors to build beautiful visualizations to try and solve interesting real-world machine learning problems.

In her spare time, she loves to cook, read & write, discover new Python-Machine Learning libraries or participate in coding competitions.

Login to continue reading and enjoy expert-curated content.

Free Courses

Exploratory Data Analysis with Python & GenAI

Learn EDA with Python: Transform data into insights using PandasAI & more.

Data Science Course

Build a powerful 2026-ready data science resume using AI tools.

No Code Predictive Analytics with Orange

No-code AI course for business pros with real-world ML use cases.

Adaptive Email Agents with DSPy

Build adaptive email agents with DSPy using context and smart learning.

Introduction to AI & ML

AI & ML are transforming industries. Learn their impacts in this course.

Responses From Readers

Flagship Programs

GenAI Pinnacle Program| GenAI Pinnacle Plus Program| AI/ML BlackBelt Program| Agentic AI Pioneer Program

Free Courses

Generative AI| DeepSeek| OpenAI Agent SDK| LLM Applications using Prompt Engineering| DeepSeek from Scratch| Stability.AI| SSM & MAMBA| RAG Systems using LlamaIndex| Building LLMs for Code| Python| Microsoft Excel| Machine Learning| Deep Learning| Mastering Multimodal RAG| Introduction to Transformer Model| Bagging & Boosting| Loan Prediction| Time Series Forecasting| Tableau| Business Analytics| Vibe Coding in Windsurf| Model Deployment using FastAPI| Building Data Analyst AI Agent| Getting started with OpenAI o3-mini| Introduction to Transformers and Attention Mechanisms

Popular Categories

AI Agents| Generative AI| Prompt Engineering| Generative AI Application| News| Technical Guides| AI Tools| Interview Preparation| Research Papers| Success Stories| Quiz| Use Cases| Listicles

Generative AI Tools and Techniques

GANs| VAEs| Transformers| StyleGAN| Pix2Pix| Autoencoders| GPT| BERT| Word2Vec| LSTM| Attention Mechanisms| Diffusion Models| LLMs| SLMs| Encoder Decoder Models| Prompt Engineering| LangChain| LlamaIndex| RAG| Fine-tuning| LangChain AI Agent| Multimodal Models| RNNs| DCGAN| ProGAN| Text-to-Image Models| DDPM| Document Question Answering| Imagen| T5 (Text-to-Text Transfer Transformer)| Seq2seq Models| WaveNet| Attention Is All You Need (Transformer Architecture) | WindSurf| Cursor

Popular GenAI Models

Llama 4| Llama 3.1| GPT 4.5| GPT 4.1| GPT 4o| o3-mini| Sora| DeepSeek R1| DeepSeek V3| Janus Pro| Veo 2| Gemini 2.5 Pro| Gemini 2.0| Gemma 3| Claude Sonnet 3.7| Claude 3.5 Sonnet| Phi 4| Phi 3.5| Mistral Small 3.1| Mistral NeMo| Mistral-7b| Bedrock| Vertex AI| Qwen QwQ 32B| Qwen 2| Qwen 2.5 VL| Qwen Chat| Grok 3

AI Development Frameworks

n8n| LangChain| Agent SDK| A2A by Google| SmolAgents| LangGraph| CrewAI| Agno| LangFlow| AutoGen| LlamaIndex| Swarm| AutoGPT

Data Science Tools and Techniques

Python| R| SQL| Jupyter Notebooks| TensorFlow| Scikit-learn| PyTorch| Tableau| Apache Spark| Matplotlib| Seaborn| Pandas| Hadoop| Docker| Git| Keras| Apache Kafka| AWS| NLP| Random Forest| Computer Vision| Data Visualization| Data Exploration| Big Data| Common Machine Learning Algorithms| Machine Learning| Google Data Science Agent
πŸ‘ Av Logo White

Continue your learning for FREE

Forgot your password?
πŸ‘ Av Logo White

Enter OTP sent to

Edit

Wrong OTP.

Enter the OTP

Resend OTP

Resend OTP in 45s

πŸ‘ Popup Banner
πŸ‘ AI Popup Banner