VOOZH about

URL: https://thenewstack.io/tutorial-speed-ml-training-with-the-intel-oneapi-ai-analytics-toolkit/

⇱ Tutorial: Speed ML Training with the Intel oneAPI AI Analytics Toolkit - The New Stack


TNS
SUBSCRIBE
Join our community of software engineering leaders and aspirational developers. Always stay in-the-know by getting the most important news and exclusive content delivered fresh to your inbox to learn more about at-scale software development.
REQUIRED
It seems that you've previously unsubscribed from our newsletter in the past. Click the button below to open the re-subscribe form in a new tab. When you're done, simply close that tab and continue with this form to complete your subscription.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.
Welcome and thank you for joining The New Stack community!
Please answer a few simple questions to help us deliver the news and resources you are interested in.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Great to meet you!
Tell us a bit about your job so we can cover the topics you find most relevant.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Welcome!

We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.

What’s next?

Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.

Follow TNS on your favorite social media networks.

Become a TNS follower on LinkedIn.

Check out the latest featured and trending stories while you wait for your first TNS newsletter.

PREV
1 of 2
NEXT
VOXPOP
As a JavaScript developer, what non-React tools do you use most often?
Angular
0%
Astro
0%
Svelte
0%
Vue.js
0%
Other
0%
I only use React
0%
I don't use JavaScript
0%
Thanks for your opinion! Subscribe below to get the final results, published exclusively in our TNS Update newsletter:
NEW! Try Stackie AI
From clobbered drafts to real-time sync
Apr 14th 2026 10:00am, by David Moore
TypeScript 6.0 RC arrives as a bridge to a faster future
Mar 14th 2026 9:00am, by Darryl K. Taft
Mastra empowers web devs to build AI agents in TypeScript
Jan 28th 2026 11:00am, by Loraine Lawson
2022-03-04 11:32:42
Tutorial: Speed ML Training with the Intel oneAPI AI Analytics Toolkit
tutorial,
AI / Data

Tutorial: Speed ML Training with the Intel oneAPI AI Analytics Toolkit

The objective of this guide is to highlight how Modin and Scikit-learn extensions are a drop-in replacement for stock Pandas and Scikit-learn libraries.
Mar 4th, 2022 11:32am by Janakiram MSV
👁 Featued image for: Tutorial: Speed ML Training with the Intel oneAPI AI Analytics Toolkit
Featured image via Pixabay.

In the last post, I introduced Intel Distribution of Modin and Intel Extension for Scikit-learn, integral parts of the Intel oneAPI AI Analytics Toolkit, and the overall Intel AI Software suite.

Let’s take a closer look at Modin and Scikit-learn extensions through this tutorial. The objective of this guide is to highlight how Modin and Scikit-learn extensions are a drop-in replacement for stock Pandas and Scikit-learn libraries. You can try this tutorial either in Intel DevCloud or your workstation.

For this tutorial, I provisioned an e2-standard-4 VM on Google Compute Engine with 4 vCPUs and 16GB RAM based on the Intel Broadwell platform. It comes with Python 3.8 preinstalled which I used as the runtime for this project.

We will train a model to detect a fraudulent transaction based on the Fraud Transaction Detection dataset from Kaggle. It’s a ~500MB CSV file with over 6 million rows of data making it an ideal candidate for Modin. This gives us a chance to compare the load times of Modin vs. Pandas. Before starting the project, download the dataset and copy it to the training environment.

👁 Image

The training algorithm is based on Nearest Neighbors, an unsupervised machine learning technique to train both classification and regression models. We will train the model twice with stock Scikit-learn and Intel Extension for Scikit-learn to measure the speed and performance.

Step 1: Configuring the Environment

Let’s start by installing pip and the required modules.

curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py
python get-pip.py

Now, install Intel Distribution of Modin, Intel Extension for Scikit-learn, and Jupyter.

pip install scikit-learn-intelex
pip install modin[all]
pip install jupyter

Launch Jupyter Notebook and access it from the browser.

jupyter notebook --ip=0.0.0.0 --port=80

Step 2: Loading the Dataset and Measuring Performance

With the CSV file uploaded to your training environment, let’s load it into Modin and Pandas.

csv='PS_20174392719_1491204439457_log.csv'

import pandas as pd
%timeit pd.read_csv(csv)

import modin.pandas as pd
import os
from distributed import Client
client = Client()
os.environ["MODIN_ENGINE"] = "dask" 
%timeit pd.read_csv(csv)

As we load the dataset, we also measure the time taken by adding the %timeit magic function at the beginning of the cell.

In my environment, Pandas took ~12 seconds while Modin loaded the same dataset in ~6 seconds.

👁 Image

Intel Distribution of Modin accelerates loading the dataset with 2x speed. When using large datasets, Modin delivers even more significant performance improvements.

Step 3: Preparing and Preprocessing the Dataset

Irrespective of how we loaded the dataset, we need to prepare and preprocess it to make it useful for the training.

First, we will drop the columns that are not relevant and useful.

from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

df=pd.read_csv(csv)
df = df.drop(['nameOrig', 'nameDest', 'isFlaggedFraud'], axis = 1)

The type column in the dataset has five categories:
● CASH-IN
● CASH-OUT
● DEBIT
● PAYMENT
● TRANSFER

Let’s encode them into integers.

df['type'] = df['type'].astype('category')
type_encode = LabelEncoder()
df['type'] = type_encode.fit_transform(df.type)

Finally, we will perform One Hot Encoding to convert them into categorical columns and append them to the original dataset, and delete the original column.

type_one_hot = OneHotEncoder()
type_one_hot_encode = type_one_hot.fit_transform(df.type.values.reshape(-1,1)).toarray()
ohe_variable = pd.DataFrame(type_one_hot_encode, columns = ["type_"+str(int(i)) for i in range(type_one_hot_encode.shape[1])])
df = pd.concat([df, ohe_variable], axis=1)
df = df.drop('type', axis = 1)

Since some of the values in the dataset are null, we will perform data imputation by replacing them with zeros.

df = df.fillna(0)

The dataset is now ready for training.

👁 Image

Step 4: Training the Model and Measuring the Performance

Before kicking off the training process, let’s separate the features and labels and then split the data into train and test datasets.

features = df.drop('isFraud', axis = 1).values
target = df['isFraud'].values

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size = 0.3, random_state = 42, stratify = target)

This creates a test dataset with 30% of data and remaining for training.

First, let’s train the model with Sckit-learn and measure the performance.

from sklearn.neighbors import NearestNeighbors
knn_classifier = NearestNeighbors(n_neighbors=3)
%timeit knn_classifier.fit(X_train, y_train)

Once it is done, we will repeat the step with Intel Extension for Scikit-learn. Notice that we are explicitly loading the sklearnex module and importing NearestNeighbors.

from sklearnex.neighbors import NearestNeighbors
knn_classifier = NearestNeighbors(n_neighbors=3)
%timeit knn_classifier.fit(X_train, y_train)

👁 Image

In my environment, stock scikit-learn took 23.8 seconds while Intel Extension for Scikit-learn finished training in only 5.72 seconds, a speedup of over 4X. Though the results may vary on your machine, it is evident that Intel Extension for Scikit-learn is significantly faster than stock Scikit-learn. It accelerates training on general-purpose x86 CPUs without the need for expensive AI accelerators such as GPUs and FPGAs.

TRENDING STORIES
Janakiram MSV (Jani) is a practicing architect, research analyst, and advisor to Silicon Valley startups. He focuses on the convergence of modern infrastructure powered by cloud-native technology and machine intelligence driven by generative AI. Before becoming an entrepreneur, he spent...
Read more from Janakiram MSV
SHARE THIS STORY
TRENDING STORIES
SHARE THIS STORY
TRENDING STORIES
TNS DAILY NEWSLETTER Receive a free roundup of the most recent TNS articles in your inbox each day.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.