VOOZH about

URL: https://thenewstack.io/explore-amazon-sagemaker-serverless-inference-for-deploying-ml-models/

⇱ Explore Amazon SageMaker Serverless Inference for Deploying ML Models - The New Stack


TNS
SUBSCRIBE
Join our community of software engineering leaders and aspirational developers. Always stay in-the-know by getting the most important news and exclusive content delivered fresh to your inbox to learn more about at-scale software development.
REQUIRED
It seems that you've previously unsubscribed from our newsletter in the past. Click the button below to open the re-subscribe form in a new tab. When you're done, simply close that tab and continue with this form to complete your subscription.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.
Welcome and thank you for joining The New Stack community!
Please answer a few simple questions to help us deliver the news and resources you are interested in.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Great to meet you!
Tell us a bit about your job so we can cover the topics you find most relevant.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Welcome!

We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.

What’s next?

Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.

Follow TNS on your favorite social media networks.

Become a TNS follower on LinkedIn.

Check out the latest featured and trending stories while you wait for your first TNS newsletter.

PREV
1 of 2
NEXT
VOXPOP
As a JavaScript developer, what non-React tools do you use most often?
Angular
0%
Astro
0%
Svelte
0%
Vue.js
0%
Other
0%
I only use React
0%
I don't use JavaScript
0%
Thanks for your opinion! Subscribe below to get the final results, published exclusively in our TNS Update newsletter:
NEW! Try Stackie AI
From clobbered drafts to real-time sync
Apr 14th 2026 10:00am, by David Moore
TypeScript 6.0 RC arrives as a bridge to a faster future
Mar 14th 2026 9:00am, by Darryl K. Taft
Mastra empowers web devs to build AI agents in TypeScript
Jan 28th 2026 11:00am, by Loraine Lawson
2021-12-21 03:00:16
Explore Amazon SageMaker Serverless Inference for Deploying ML Models
analysis,sponsor-palo-alto-networks,sponsored-event-coverage,
AI / Cloud Services / Serverless

Explore Amazon SageMaker Serverless Inference for Deploying ML Models

Dec 21st, 2021 3:00am by Janakiram MSV
👁 Featued image for: Explore Amazon SageMaker Serverless Inference for Deploying ML Models
Feature Image par congerdesign de Pixabay.
Palo Alto Networks sponsored this post.

Launched at the company’s re:Invent 2021 user conference earlier this month, Amazon Web ServicesAmazon SageMaker Serverless Inference is a new inference option to deploy machine learning models without configuring and managing the compute infrastructure. It brings some of the attributes of serverless computing, such as scale-to-zero and consumption-based pricing.

With serverless inference, SageMaker decides to launch additional instances based on the concurrency and the utilization of existing compute resources. The fundamental difference between the other mechanisms and serverless inference is how the compute infrastructure is provisioned, scaled, and managed. You don’t even need to choose an instance type or define the minimum and maximum capacity.

Amazon SageMaker Serverless Inference joins existing deployment mechanisms, including real-time inference, elastic inference, and asynchronous inference.

The Workflow of Deploying Models in SageMaker

At a high level, there are four steps involved in deploying models in SageMaker. Let’s take a look at them.

👁 SageMaker architecture

1) Creating a Model – Whether you trained the model within SageMaker or brought an external pre-trained model, the first step is to register it with the platform. Amazon SageMaker expects the model artifact to be stored in an S3 bucket. The artifact is a tarball of a TensorFlow Saved Model, Keras HDF5, PyTorch .pth file, or an ONNX model. The artifact is then merged with a container image with the pre-configured inference code. SageMaker provides containers for built-in algorithms and prebuilt Docker images for some of the most common machine learning frameworks, such as Apache MXNet, TensorFlow, PyTorch, and Chainer. When creating a model, the tarball is uncompressed, and the model artifacts are copied to the /opt/ml/model directory, which is expected by the inference code. This container image becomes the fundamental unit of deployment for inference.

2) Defining the Endpoint Configuration — Once the model is registered with SageMaker, the next step is to associate it with the hosting environment defined through the endpoint configuration. It acts as the blueprint for the endpoint, which may optionally support auto-scaling. Think of the SageMaker endpoint configuration as the launch configuration of Amazon EC2 auto-scaling groups. An endpoint configuration identifies the model and the associated infrastructure, including the model variant, the GPU accelerator type such as ml.eia1.medium, and ml.eia2.xlarge, an instance type such as ml.t2.medium and ml.c5.4xlarge, and the initial number of instances.

Prisma Cloud delivers the industry’s broadest security and compliance coverage—for applications, data, and the entire cloud native technology stack—throughout the development lifecycle and across multi- and hybrid-cloud environments.
Learn More
The latest from Prisma by Palo Alto Networks

3) Creating an Endpoint — If the previous step associated the model with the compute resources (container and instance type), this step creates the actual HTTP(S) endpoint used for invoking the model. Creating an endpoint is as simple as assigning an identifier and pointing it to the endpoint configuration defined in the previous step.

4) Invoking an Endpoint — Once the endpoint is published, it can be invoked using the Python SDK or AWS CLI. It can also be easily integrated with AWS Lambda and Amazon API Gateway to expose as a standard REST API for clients to consume.

Changes with Serverless Inference

👁 Image

Luckily, the workflow doesn’t change when switching between the conventional real-time inference endpoint and the new serverless inference endpoint. The key difference comes in the second step of the workflow, where we define the endpoint configuration.

Instead of manually selecting the instance type, we will let SageMaker pick the best compute resource for us. This instance type selection is done based on the minimum amount of memory mentioned in the endpoint configuration. A serverless endpoint has a minimum memory size of 1024 MB (1 GB), and the maximum size is 6144 MB (6 GB). The memory size can be 1024 MB, 2048 MB, 3072 MB, 4096 MB, 5120 MB, or 6144 MB. Serverless inference auto-assigns compute resources proportional to the memory you select. Larger memory sizes result in more vCPUs assigned to the container/instance.

You can choose the endpoint’s memory size based on the model size. The thumb rule is that the memory size should be at least as large as your model size.

The other parameter that significantly affects compute resource allocation is concurrency. Serverless endpoints have a quota for how many concurrent invocations can be processed simultaneously. If the endpoint is invoked before processing the first request, it handles the second request concurrently.

Like other serverless environments, SageMaker inference endpoints also suffer from the latency involved in cold starts. If a serverless inference endpoint does not receive traffic for a while and then it suddenly receives new requests, SageMaker will spin up the compute resources to process the incoming requests. Since serverless endpoints provision the compute resources on-demand, the endpoint may experience cold starts. A cold start can also occur if the concurrent requests exceed the current concurrent request usage. The cold start time depends on the model size, how long it takes to download the model, and the start-up time of the container with inference code.

SageMaker serverless inference endpoints don’t have programmatic access through Amazon SageMaker Python SDK during the preview. But, you can use AWS SDK for Python (Boto3) from Jupyter Notebooks to automate the creation of endpoints.

SageMaker serverless inference is available in preview in US East (Northern Virginia), US East (Ohio), US West (Oregon), Europe (Ireland), Asia Pacific (Tokyo), and Asia Pacific (Sydney).

In the next part of this series, we will look at the steps involved in publishing a SageMaker serverless inference endpoint for a TensorFlow model. Tune in tomorrow for the next installment.

Prisma Cloud delivers the industry’s broadest security and compliance coverage — for applications, data, and the entire cloud native technology stack — throughout the development lifecycle and across multi- and hybrid-cloud environments.
Learn More
The latest from Palo Alto Networks
TRENDING STORIES
Janakiram MSV (Jani) is a practicing architect, research analyst, and advisor to Silicon Valley startups. He focuses on the convergence of modern infrastructure powered by cloud-native technology and machine intelligence driven by generative AI. Before becoming an entrepreneur, he spent...
Read more from Janakiram MSV
Palo Alto Networks sponsored this post.
SHARE THIS STORY
TRENDING STORIES
Amazon Web Services is a sponsor of The New Stack.
TNS owner Insight Partners is an investor in: Pragma, Docker.
SHARE THIS STORY
TRENDING STORIES
TNS DAILY NEWSLETTER Receive a free roundup of the most recent TNS articles in your inbox each day.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.