VOOZH about

URL: https://thenewstack.io/tutorial-use-the-amazon-sagemaker-python-sdk-to-train-automl-models-with-autopilot/

⇱ Tutorial: Use the Amazon SageMaker Python SDK to Train AutoML Models with Autopilot - The New Stack


TNS
SUBSCRIBE
Join our community of software engineering leaders and aspirational developers. Always stay in-the-know by getting the most important news and exclusive content delivered fresh to your inbox to learn more about at-scale software development.
REQUIRED
It seems that you've previously unsubscribed from our newsletter in the past. Click the button below to open the re-subscribe form in a new tab. When you're done, simply close that tab and continue with this form to complete your subscription.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.
Welcome and thank you for joining The New Stack community!
Please answer a few simple questions to help us deliver the news and resources you are interested in.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Great to meet you!
Tell us a bit about your job so we can cover the topics you find most relevant.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Welcome!

We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.

What’s next?

Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.

Follow TNS on your favorite social media networks.

Become a TNS follower on LinkedIn.

Check out the latest featured and trending stories while you wait for your first TNS newsletter.

PREV
1 of 2
NEXT
VOXPOP
As a JavaScript developer, what non-React tools do you use most often?
Angular
0%
Astro
0%
Svelte
0%
Vue.js
0%
Other
0%
I only use React
0%
I don't use JavaScript
0%
Thanks for your opinion! Subscribe below to get the final results, published exclusively in our TNS Update newsletter:
NEW! Try Stackie AI
From clobbered drafts to real-time sync
Apr 14th 2026 10:00am, by David Moore
TypeScript 6.0 RC arrives as a bridge to a faster future
Mar 14th 2026 9:00am, by Darryl K. Taft
Mastra empowers web devs to build AI agents in TypeScript
Jan 28th 2026 11:00am, by Loraine Lawson
2020-02-28 08:23:49
Tutorial: Use the Amazon SageMaker Python SDK to Train AutoML Models with Autopilot
feature,
AI / Operations

Tutorial: Use the Amazon SageMaker Python SDK to Train AutoML Models with Autopilot

In this tutorial, we will take a closer look at the Python SDK to script an end-to-end workflow to train and deploy a model. We will use batch inferencing and store the output in an Amazon S3 bucket.
Feb 28th, 2020 8:23am by Janakiram MSV
👁 Featued image for: Tutorial: Use the Amazon SageMaker Python SDK to Train AutoML Models with Autopilot
Feature image via Pixabay.

In the last tutorial, we have seen how to use Amazon SageMaker Studio to create models through Autopilot.

In this installment, we will take a closer look at the Python SDK to script an end-to-end workflow to train and deploy a model. We will use batch inferencing and store the output in an Amazon S3 bucket.

The walkthrough is based on the same dataset and problem type discussed in the previous tutorial.

Follow the steps mentioned in the previous tutorial to configure and setup the environment for Autopilot. Launch a new Jupyter notebook to run the Python code that uses the SDK.

import sagemaker
import boto3
from sagemaker import get_execution_role

region = boto3.Session().region_name

session = sagemaker.Session()
bucket = session.default_bucket()
print(bucket)
prefix = 'sagemaker/termdepo'

role = get_execution_role()

sm = boto3.Session().client(service_name='sagemaker',region_name=region)

This step initializes the environment and returns the default S3 bucket associated with SageMaker.

!wget -N https://datahub.io/machine-learning/bank-marketing/r/bank-marketing.csv
local_data_path = 'bank-marketing.csv'

We downloaded the dataset from datahub.io.

import pandas as pd

data = pd.read_csv(local_data_path)
pd.set_option('display.max_columns', 500) 
pd.set_option('display.max_rows', 10) 
data

This will verify the dataset and displays it in a grid.

👁 Image

train_data = data.sample(frac=0.8,random_state=200)
test_data = data.drop(train_data.index)
test_data = test_data.drop(columns=['Class'])
train_file = 'train_data.csv';
train_data.to_csv(train_file, index=False, header=True)
train_data_s3_path = session.upload_data(path=train_file, key_prefix=prefix + "/train")
print('Train data uploaded to: ' + train_data_s3_path)

test_file = 'test_data.csv';
test_data.to_csv(test_file, index=False, header=False)
test_data_s3_path = session.upload_data(path=test_file, key_prefix=prefix + "/test")
print('Test data uploaded to: ' + test_data_s3_path)

We split the dataset and upload it to an S3 bucket.

Now that the dataset is ready, we will define the input, output, and job configuration of an Autopilot experiment.

input_data_config = [{
 'DataSource': {
 'S3DataSource': {
 'S3DataType': 'S3Prefix',
 'S3Uri': 's3://{}/{}/train'.format(bucket,prefix)
 }
 },
 'TargetAttributeName': 'Class'
 }
 ]

job_config = {
 'CompletionCriteria': {
 'MaxRuntimePerTrainingJobInSeconds': 600,
 'MaxAutoMLJobRuntimeInSeconds': 3600
 },
}

output_data_config = {
 'S3OutputPath': 's3://{}/{}/output'.format(bucket,prefix)
 }

problem_type = 'BinaryClassification'
job_objective = { 'MetricName': 'F1' }

This cell contains the most critical parameters for an Autopilot experiment. It tells where the dataset is located, the label, where the final artifacts will be uploaded, the criterion for the job to be completed along with the problem type and the metric to evaluate the performance of the model.

from time import gmtime, strftime, sleep
timestamp_suffix = strftime('%d-%H-%M-%S', gmtime())

auto_ml_job_name = 'termdepo' + timestamp_suffix
print('AutoMLJobName: ' + auto_ml_job_name)

sm.create_auto_ml_job(AutoMLJobName=auto_ml_job_name,
 InputDataConfig=input_data_config,
 OutputDataConfig=output_data_config,
 AutoMLJobConfig=job_config,
 AutoMLJobObjective=job_objective,
 ProblemType=problem_type,
 RoleArn=role)

With the configuration in place, we will create an AutoML job.

print ('JobStatus - Secondary Status')
print('------------------------------')


describe_response = sm.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)
print (describe_response['AutoMLJobStatus'] + " - " + describe_response['AutoMLJobSecondaryStatus'])
job_run_status = describe_response['AutoMLJobStatus']
 
while job_run_status not in ('Failed', 'Completed', 'Stopped'):
 describe_response = sm.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)
 job_run_status = describe_response['AutoMLJobStatus']
 
 print (describe_response['AutoMLJobStatus'] + " - " + describe_response['AutoMLJobSecondaryStatus'])
 sleep(30)

This cell will continue to print the status of the job every 30 seconds.

👁 Image

Once the job is complete, we can retrieve the data exploration notebook, candidate definition notebook, and the name of the candidate with the best model.

job = sm.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)

job_candidate_notebook = job['AutoMLJobArtifacts']['CandidateDefinitionNotebookLocation']
job_data_notebook = job['AutoMLJobArtifacts']['DataExplorationNotebookLocation']
job_best_candidate = job['BestCandidate']
job_best_candidate_name = job_best_candidate['CandidateName']

job_candidate_notebook
job_data_notebook
job_best_candidate_name
%%sh -s $job_candidate_notebook $job_data_notebook
aws s3 cp $1 .
aws s3 cp $2 .

This will download the Jupyter notebooks from the S3 bucket to the local environment.

In the next few steps, we will create the model from the best candidate, deploy it and perform batch inferencing.

model_name = 'automl-termdepo-model-' + timestamp_suffix

model = sm.create_model(Containers=job_best_candidate['InferenceContainers'],
 ModelName=model_name,
 ExecutionRoleArn=role)

print('Model ARN corresponding to the best candidate is : {}'.format(model['ModelArn']))

To perform batch inferencing, we need to transform the test dataset stored in the S3 bucket and send it to the model.

transform_job_name = 'automl-termdepo-transform-' + timestamp_suffix

transform_input = {
 'DataSource': {
 'S3DataSource': {
 'S3DataType': 'S3Prefix',
 'S3Uri': test_data_s3_path
 }
 },
 'ContentType': 'text/csv',
 'CompressionType': 'None',
 'SplitType': 'Line'
 }

transform_output = {
 'S3OutputPath': 's3://{}/{}/inference-results'.format(bucket,prefix),
 }

transform_resources = {
 'InstanceType': 'ml.m4.xlarge',
 'InstanceCount': 1
 }

sm.create_transform_job(TransformJobName = transform_job_name,
 ModelName = model_name,
 TransformInput = transform_input,
 TransformOutput = transform_output,
 TransformResources = transform_resources
)

Wait till the job status shows it as completed.

print ('JobStatus')
print('----------')

describe_response = sm.describe_transform_job(TransformJobName = transform_job_name)
job_run_status = describe_response['TransformJobStatus']
print (job_run_status)

while job_run_status not in ('Failed', 'Completed', 'Stopped'):
 describe_response = sm.describe_transform_job(TransformJobName = transform_job_name)
 job_run_status = describe_response['TransformJobStatus']
 print (job_run_status)
 sleep(30)

👁 Image

We can now download and print the output from the inferencing job.

s3_output_key = '{}/inference-results/test_data.csv.out'.format(prefix);
local_inference_results_path = 'inference_results.csv'

s3 = boto3.resource('s3')
inference_results_bucket = s3.Bucket(session.default_bucket())

inference_results_bucket.download_file(s3_output_key, local_inference_results_path);

data = pd.read_csv(local_inference_results_path, sep=';')
pd.set_option('display.max_rows', 10) 
data

👁 Image

This step concludes the tutorial on using SageMaker Autopilot Python SDK to train models.

Janakiram MSV’s Webinar series, “Machine Intelligence and Modern Infrastructure (MI2)” offers informative and insightful sessions covering cutting-edge technologies. Sign up for the upcoming MI2 webinar at http://mi2.live.

TRENDING STORIES
Janakiram MSV (Jani) is a practicing architect, research analyst, and advisor to Silicon Valley startups. He focuses on the convergence of modern infrastructure powered by cloud-native technology and machine intelligence driven by generative AI. Before becoming an entrepreneur, he spent...
Read more from Janakiram MSV
SHARE THIS STORY
TRENDING STORIES
Amazon Web Services is a sponsor of The New Stack.
TNS owner Insight Partners is an investor in: Class.
SHARE THIS STORY
TRENDING STORIES
TNS DAILY NEWSLETTER Receive a free roundup of the most recent TNS articles in your inbox each day.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.