VOOZH about

URL: https://thenewstack.io/how-to-get-started-running-small-language-models-at-the-edge/

⇱ How To Get Started Running Small Language Models at the Edge - The New Stack


TNS
SUBSCRIBE
Join our community of software engineering leaders and aspirational developers. Always stay in-the-know by getting the most important news and exclusive content delivered fresh to your inbox to learn more about at-scale software development.
REQUIRED
It seems that you've previously unsubscribed from our newsletter in the past. Click the button below to open the re-subscribe form in a new tab. When you're done, simply close that tab and continue with this form to complete your subscription.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.
Welcome and thank you for joining The New Stack community!
Please answer a few simple questions to help us deliver the news and resources you are interested in.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Great to meet you!
Tell us a bit about your job so we can cover the topics you find most relevant.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Welcome!

We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.

What’s next?

Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.

Follow TNS on your favorite social media networks.

Become a TNS follower on LinkedIn.

Check out the latest featured and trending stories while you wait for your first TNS newsletter.

PREV
1 of 2
NEXT
VOXPOP
As a JavaScript developer, what non-React tools do you use most often?
Angular
0%
Astro
0%
Svelte
0%
Vue.js
0%
Other
0%
I only use React
0%
I don't use JavaScript
0%
Thanks for your opinion! Subscribe below to get the final results, published exclusively in our TNS Update newsletter:
NEW! Try Stackie AI
From clobbered drafts to real-time sync
Apr 14th 2026 10:00am, by David Moore
TypeScript 6.0 RC arrives as a bridge to a faster future
Mar 14th 2026 9:00am, by Darryl K. Taft
Mastra empowers web devs to build AI agents in TypeScript
Jan 28th 2026 11:00am, by Loraine Lawson
2024-07-24 06:00:49
How To Get Started Running Small Language Models at the Edge
tutorial,
AI / Edge Computing / Software Development

How To Get Started Running Small Language Models at the Edge

How to set up Ollama on the Jetson Orin Developer Kit — a key step in configuring federated language models spanning the cloud and the edge.
Jul 24th, 2024 6:00am by Janakiram MSV
👁 Featued image for: How To Get Started Running Small Language Models at the Edge
Image via Unsplash+. 

In my previous article, I introduced the idea of federated language models that take advantage of large language models (LLM) running in the cloud and small language models (SLM) running at the edge.

My goal is to run an SLM at the edge that can respond to user queries based on the context that the local tools provide. One of the ideal candidates for this use case is the Jetson Orin Developer Kit from Nvidia, which runs SLMs like Microsoft Phi-3.

In this tutorial, I will walk you through the steps involved in configuring Ollama, a lightweight model server, on the Jetson Orin Developer Kit, which takes advantage of GPU acceleration to speed up the inference of Phi-3. This is one of the key steps in configuring federated language models spanning the cloud and the edge.

What Is Nvidia Jetson AGX Orin Developer Kit?

The NVIDIA Jetson AGX Orin Developer Kit represents a significant leap forward in edge AI and robotics computing. This powerful kit includes a high-performance Jetson AGX Orin module, capable of delivering up to 275 TOPS of AI performance and offering eight times the capabilities of its predecessor, the Jetson AGX Xavier. The developer kit is designed to emulate the performance and power characteristics of all Jetson Orin modules, making it an incredibly versatile tool for developers working on advanced robotics and edge AI applications across various industries.

👁 Image

At the heart of the developer kit is the Jetson AGX Orin module, featuring an Nvidia Ampere architecture GPU with 2048 CUDA cores and 64 tensor cores, alongside a 12-core Arm Cortex-A78AE CPU. The kit comes with a reference carrier board that exposes numerous standard hardware interfaces, enabling rapid prototyping and development. With options for 32GB or 64GB of memory, support for multiple concurrent AI inference pipelines, and power configurations ranging from 15W to 50W, the Jetson AGX Orin Developer Kit provides developers with a flexible and powerful platform for creating cutting-edge AI solutions in fields such as manufacturing, logistics, healthcare, and smart cities.

See also: our previous tutorial on running real-time object detection with Jetson Orin.

For this scenario, I am using the Jetson AGX Orin Developer Kit with 32GB of RAM and 64GB of eMMC storage. It runs the latest version of Jetpack, 6.0, which comes with various tools, including the CUDA runtime.

👁 Image

The most important components of Jetpack are Docker and the Nvidia Container Toolkit.

👁 Image

Running Ollama on Jetson AGX Orin Developer Kit

Ollama is a developer-friendly LLM infrastructure modeled around Docker. It’s already optimized to run on Jetson devices.

Similar to Docker, Ollama has two components: the server and the client. We will first install the client, which comes with a CLI that can talk to the inference engine.

wget https://github.com/ollama/ollama/releases/download/v0.2.8/ollama-linux-arm64

chmod +x ./ollama-linux-arm64

sudo mv ollama-linux-arm64 /usr/local/bin/ollama

The above commands download and install the Ollama client.

Verify the client with the below command:

ollama --version

Now, we will run the Ollama inference server through a Docker container. This avoids any issues you may encounter while accessing the GPU.

docker run -d \
--runtime nvidia \
--name ollama \
--network=host -v ~/models:/models \
-e OLLAMA_MODELS=/models \
 dustynv/ollama:r36.2.0 ollama serve

This command launches the Ollama server on the host network, enabling the client to directly talk to the engine. The server is listening on port 11434, which exposes an OpenAI-compatible REST endpoint.

Running the command ollama ps shows an empty list, since we haven’t downloaded the model yet.

Serving Microsoft Phi-3 SLM on Ollama

Microsoft’s Phi-3 represents a significant advancement in small language models (SLMs), offering impressive capabilities in a compact package. The Phi-3 family includes models ranging from 3.8 billion to 14 billion parameters, with the Phi-3-mini (3.8B) already available and larger versions like Phi-3-small (7B) and Phi-3-medium (14B) coming soon.

The Phi-3 models are designed for efficiency and accessibility, making them suitable for deployment on resource-constrained edge devices and smartphones. They feature a transformer decoder architecture with a default context length of 4K tokens, with a long context version (Phi-3-mini-128K) extending to 128K tokens.

For this tutorial, we will run the 4K flavor of the model, which is Phi-3 mini.

With the Ollama container running and the client installed, we can pull the image with the below command:

ollama pull phi3:mini

👁 Image

Check the model with the command ollama ls.

👁 Image

Accessing Phi-3 from a Jupyter Notebook

Since Ollama exposes an OpenAI-compatible API endpoint, we can use the standard OpenAI Python client to interact with the model.

pip install openai

👁 Image

Try the below code snippet by replacing the URL with the IP address of Jetson Orin.

from openai import OpenAI
OLLAMA_URL="YOUR_JETSON_IP::11434/v1/"

client = OpenAI(
base_url=OLLAMA_URL,
api_key='ollama'
)

prompt="When was Mahatma Gandhi born? Answer in the most concise form."
model="phi3:mini"

response = client.chat.completions.create(
model=model,
max_tokens=50,
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": prompt}
]
)

print(response.choices[0].message.content.strip())

On the Jetson device, you can monitor the consumption of the GPU with the jtop command.

👁 Image

This tutorial covered the essential steps required to run Microsoft Phi-3 SLM on a Nvidia Jetson Orin edge device. In the next part of the series, we will continue building the federated LM application by leveraging this model. Stay tuned.

TRENDING STORIES
Janakiram MSV (Jani) is a practicing architect, research analyst, and advisor to Silicon Valley startups. He focuses on the convergence of modern infrastructure powered by cloud-native technology and machine intelligence driven by generative AI. Before becoming an entrepreneur, he spent...
Read more from Janakiram MSV
SHARE THIS STORY
TRENDING STORIES
TNS owner Insight Partners is an investor in: Docker, OpenAI.
SHARE THIS STORY
TRENDING STORIES
TNS DAILY NEWSLETTER Receive a free roundup of the most recent TNS articles in your inbox each day.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.