VOOZH about

URL: https://thenewstack.io/running-llama-3-2-on-aws-lambda/

⇱ Running Llama 3.2 on AWS Lambda - The New Stack


TNS
SUBSCRIBE
Join our community of software engineering leaders and aspirational developers. Always stay in-the-know by getting the most important news and exclusive content delivered fresh to your inbox to learn more about at-scale software development.
REQUIRED
It seems that you've previously unsubscribed from our newsletter in the past. Click the button below to open the re-subscribe form in a new tab. When you're done, simply close that tab and continue with this form to complete your subscription.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.
Welcome and thank you for joining The New Stack community!
Please answer a few simple questions to help us deliver the news and resources you are interested in.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Great to meet you!
Tell us a bit about your job so we can cover the topics you find most relevant.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Welcome!

We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.

What’s next?

Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.

Follow TNS on your favorite social media networks.

Become a TNS follower on LinkedIn.

Check out the latest featured and trending stories while you wait for your first TNS newsletter.

PREV
1 of 2
NEXT
VOXPOP
As a JavaScript developer, what non-React tools do you use most often?
Angular
0%
Astro
0%
Svelte
0%
Vue.js
0%
Other
0%
I only use React
0%
I don't use JavaScript
0%
Thanks for your opinion! Subscribe below to get the final results, published exclusively in our TNS Update newsletter:
NEW! Try Stackie AI
From clobbered drafts to real-time sync
Apr 14th 2026 10:00am, by David Moore
TypeScript 6.0 RC arrives as a bridge to a faster future
Mar 14th 2026 9:00am, by Darryl K. Taft
Mastra empowers web devs to build AI agents in TypeScript
Jan 28th 2026 11:00am, by Loraine Lawson
2024-11-04 08:30:27
Running Llama 3.2 on AWS Lambda
sponsor-nitric,sponsored-post-contributed,
AI / API Management / Large Language Models

Running Llama 3.2 on AWS Lambda

A step-by-step guide to deploying the lightweight AI model using Hugging Face and Nitric to manage infrastructure, such as API routes and deployments.
Nov 4th, 2024 8:30am by Jye Cusch
👁 Featued image for: Running Llama 3.2 on AWS Lambda
Image from Rad Radu on Shutterstock.
Nitric sponsored this post.
Llama 3.2 1B is a lightweight AI model that makes it interesting for serverless applications since it can be run relatively quickly without requiring GPU acceleration. We’ll use models from Hugging Face and Nitric to demonstrate using it and manage the surrounding infrastructure, such as API routes and deployments.

Prerequisites

Project Setup

Let’s start by creating a new project using Nitric’s Python starter template. Next, let’s install the base dependencies, then add the extra dependencies we need specifically for loading the language model.

Choose a Llama Model

Llama 3.2 is available in different sizes and configurations, each with its own trade-offs in terms of performance, accuracy and resource requirements. For serverless applications without GPU acceleration, such as AWS Lambda, it’s important to choose a model that is lightweight and efficient to ensure it runs within the constraints of that environment. We’ll use a quantized version of the lightweight Llama 1B model, specifically Llama-3.2-1B-Instruct-Q4_K_M.gguf. If you’re not familiar with quantization, it’s a technique that reduces a model’s size and resource requirements, which, in our case, makes it suitable for serverless applications but may affect the accuracy of the model. The LM Studio team provides several quantized versions of Llama 3.2 1B on Hugging Face. Consider trying different versions to find one that best fits your needs, such as Q5_K_M, which is slightly larger but of higher quality. Let’s download the chosen model and save it in a `models` directory in your project. Download link for Llama-3.2-1B-Instruct-Q4_K_M.gguf: Your folder structure should look like this:

Create a Service to Run the Model

Next, we’ll use Nitric to create an HTTP API that allows you to send prompts to the Llama model and receive the output in a response. The API will return the raw output from the model, but you can adjust this as you see fit. Replace the contents of `services/api.py` with the following code, which loads the Llama model and implements the prompt functionality. Take a little time to understand the code. It defines an API with a single endpoint /prompt that accepts a POST request with a prompt in the body. The `process_prompt` function sends the prompt to the Llama model and returns the response.

OK, Let’s Run This Thing!

Now that you have an API defined, we can test it locally. The Python starter template uses `python3.11-bookworm-slim` as its basic container image, which doesn’t have the right dependencies to load the Llama model; let’s update the Dockerfile to use `python3.11-bookworm` (the non-slim version) instead. Update line 2:
FROM python:3.11-bookworm
Update line 19:
FROMpython:3.11-bookworm
Now we can run our services locally: nitric run `nitric run` will start your application in a container that includes the dependencies to use `llama_cpp`. If you’d rather use `nitric start` you’ll need to install dependencies for `llama-cpp-python` such as CMake and LLVM. Once it starts, you can test it with the Nitric Dashboard. You can find the URL to the dashboard in the terminal running the Nitric CLI. By default it’s http://localhost:49152. Add a prompt to the body of the request and send it to the `/prompt` endpoint. 👁 Dashboard

Deploying to AWS

When you’re ready to deploy the project, we can create a new Nitric stack file that will target AWS: nitric stack new dev aws Update the stack file `nitric.dev.yaml` with the appropriate AWS region and memory allocation to handle the model. Since we’ll use Nitric’s default Pulumi AWS Provider, make sure you’re set up to deploy using it. You can find more information on how to set up the AWS Provider in the Nitric AWS Provider documentation. If you’d like to deploy with Terraform or to another cloud provider, that’s also possible. You can find more information about how Nitric can deploy to other platforms in the Nitric Providers documentation. You can then deploy using the following command: nitric up Take note of the API endpoint URL that is output after the deployment is complete. If you’re done with the project later, tear it down with `nitric down`.

Testing on AWS

To test the service, you can use any API testing tool you like, such as cURL, Postman, etc. Here’s an example using cURL: curl -X POST {your endpoint URL here}/prompt -d "Hello, how are you?"

Example Response

The response will include the results, plus other metadata. The output can be found in the `choices` array.

Summary

As you’ve seen in the code example, we’ve set up a fairly basic prompt structure, but you can expand on this to include more complex prompts, including system prompts that help restrict/guide the model’s responses or even more complex interactions with the model. Also, in this example, we expose the model directly as an API, but this limits the response time to 30 seconds on AWS with API Gateway. In future guides, we’ll show how you can go beyond simple one-time responses to more complex interactions, such as maintaining context between requests. We can also include Websockets and streamed responses to provide a better user experience for larger responses.
Nitric is the cloud-aware framework that enhances developer productivity and ops confidence, uniting backend and infrastructure code to build and ship cloud apps fast. Devs build your application, Platform determines the right infrastructure and Nitric automates provisioning that works for both.
Learn More
The latest from Nitric
Hear more from our sponsor
TRENDING STORIES
Jye Cusch, co-founder of Nitric, is an experienced software engineer with a history in FinTech innovation. Jye previously led engineering teams at prominent financial services firms including Temenos and Avoka Technologies. His hands-on experience with major banks and credit unions,...
Read more from Jye Cusch
Nitric sponsored this post.
SHARE THIS STORY
TRENDING STORIES
TNS owner Insight Partners is an investor in: Postman.
SHARE THIS STORY
TRENDING STORIES
TNS DAILY NEWSLETTER Receive a free roundup of the most recent TNS articles in your inbox each day.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.