VOOZH about

URL: https://thenewstack.io/build-a-rag-app-with-nvidia-nim-apis-and-a-vector-database/

⇱ Build a RAG App With Nvidia NIM APIs and a Vector Database - The New Stack


TNS
SUBSCRIBE
Join our community of software engineering leaders and aspirational developers. Always stay in-the-know by getting the most important news and exclusive content delivered fresh to your inbox to learn more about at-scale software development.
REQUIRED
It seems that you've previously unsubscribed from our newsletter in the past. Click the button below to open the re-subscribe form in a new tab. When you're done, simply close that tab and continue with this form to complete your subscription.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.
Welcome and thank you for joining The New Stack community!
Please answer a few simple questions to help us deliver the news and resources you are interested in.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Great to meet you!
Tell us a bit about your job so we can cover the topics you find most relevant.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Welcome!

We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.

What’s next?

Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.

Follow TNS on your favorite social media networks.

Become a TNS follower on LinkedIn.

Check out the latest featured and trending stories while you wait for your first TNS newsletter.

PREV
1 of 2
NEXT
VOXPOP
As a JavaScript developer, what non-React tools do you use most often?
Angular
0%
Astro
0%
Svelte
0%
Vue.js
0%
Other
0%
I only use React
0%
I don't use JavaScript
0%
Thanks for your opinion! Subscribe below to get the final results, published exclusively in our TNS Update newsletter:
NEW! Try Stackie AI
From clobbered drafts to real-time sync
Apr 14th 2026 10:00am, by David Moore
TypeScript 6.0 RC arrives as a bridge to a faster future
Mar 14th 2026 9:00am, by Darryl K. Taft
Mastra empowers web devs to build AI agents in TypeScript
Jan 28th 2026 11:00am, by Loraine Lawson
2024-08-19 07:20:29
Build a RAG App With Nvidia NIM APIs and a Vector Database
sponsor-aerospike,sponsored-topic,
AI / Databases / Large Language Models

Build a RAG App With Nvidia NIM APIs and a Vector Database

A tutorial on how to use NIM APIs to build a simple RAG application, using Zilliz, a version of the popular Milvus vector database.
Aug 19th, 2024 7:20am by Janakiram MSV
👁 Featued image for: Build a RAG App With Nvidia NIM APIs and a Vector Database
Image via Unsplash+. 

The Nvidia NIM platform allows developers to perform inference on generative AI models. In this article, we will explore how to consume the NIM APIs to build a simple RAG application. For the vector database, we will use Zilliz, the hosted, commercial version of the popular Milvus vector database.

We will use meta/llama3-8b-instruct as the LLM, nvidia/nv-embedqa-e5-v5 as the text embeddings model, and Zilliz to perform semantic search.

While this tutorial focuses on cloud-based APIs, the next part of this series will run the same LLM, embeddings model and vector database as containers.

The advantage of using NIM is that the APIs will be 100% compatible with the self-hosted containers running locally on a GPU machine. They also take advantage of the GPU acceleration when run locally.

Let’s get started building the application.

Step 1: Create an API Key for NIM

Visit the NIM catalog and sign up with your email address to create an API key.

👁 Image

👁 Image

Search for meta/llama3-8b-instruct and click on “Build with this NIM” to create an API key.

👁 Image

Copy the API key and save it in a safe location.

👁 Image

Step 2: Create an Instance of Free Zilliz Cluster

Sign up with Zilliz Cloud and create a cluster that comes with $100 credits, which is sufficient for experimenting with this tutorial.

👁 Image

Make sure that you copied the endpoint URI and the API key of your cluster.

👁 Image

Step 3: Create an Environment Configuration File

Create a .env file with the URIs and API keys. This comes in handy when we access the API. When we switch to local endpoints, we just need to update this file. Ensure that these match with the values you saved from the above two steps.

LLM_URI="https://integrate.api.nvidia.com/v1"
EMBED_URI="https://integrate.api.nvidia.com/v1"
VECTORDB_URI="YOUR_ZILLIZ_CLUSTER_URI"
NIM_API_KEY="YOUR_NIM_API_KEY"
ZILLIZ_API_KEY="YOUR_ZILLIZ_API_KEY"

Step 4: Create the RAG Application

Launch a Jupyter Notebook and install the required Python modules.

!pip install pymilvus
!pip install openai
pip install python-dotenv

Let’s start by importing the modules.

from pymilvus import MilvusClient
from pymilvus import connections
from openai import OpenAI
from dotenv import load_dotenv
import os
import ast

Load the environment variables and initialize the clients for LLM, embeddings and the vector database.

load_dotenv()

LLM_URI=os.getenv("LLM_URI")
EMBED_URI=os.getenv("EMBED_URI")
VECTORDB_URI=os.getenv("VECTORDB_URI")

NIM_API_KEY=os.getenv("NIM_API_KEY")
ZILLIZ_API_KEY=os.getenv("ZILLIZ_API_KEY")
llm_client = OpenAI(
 api_key=NIM_API_KEY,
 base_url=LLM_URI
)

embedding_client = OpenAI(
 api_key=NIM_API_KEY,
 base_url=EMBED_URI
)

vectordb_client = MilvusClient(
 uri=VECTORDB_URI,
 token=ZILLIZ_API_KEY
)

The next step is to create the collection in the Zilliz cluster.

if vectordb_client.has_collection(collection_name="india_facts"):
 vectordb_client.drop_collection(collection_name="india_facts")

vectordb_client.create_collection(
 collection_name="india_facts",
 dimension=1024, 
)

We set the dimension to 1,024 based on the vector size returned by the embeddings model.

Let’s create a list of strings, convert them into embedding vectors and ingest them into the database.

docs = [
 "India is the seventh-largest country by land area in the world.",
 "The Indus Valley Civilization, one of the world's oldest, originated in India around 3300 BCE.",
 "The game of chess, originally called 'Chaturanga,' was invented in India during the Gupta Empire.",
 "India is home to the world's largest democracy, with over 900 million eligible voters.",
 "The Indian mathematician Aryabhata was the first to explain the concept of zero in the 5th century.",
 "India has the second-largest population in the world, with over 1.4 billion people.",
 "The Kumbh Mela, held every 12 years, is the largest religious gathering in the world, attracting millions of devotees.",
 "India is the birthplace of four major world religions: Hinduism, Buddhism, Jainism, and Sikhism.",
 "The Indian Space Research Organisation (ISRO) successfully sent a spacecraft to Mars on its first attempt in 2014.",
 "India's Varanasi is considered one of the world's oldest continuously inhabited cities, with a history dating back over 3,000 years."
]

def embed(docs):
 response = embedding_client.embeddings.create(
 input=docs,
 model="nvidia/nv-embedqa-e5-v5",
 encoding_format="float",
 extra_body={"input_type": "query", "truncate": "NONE"}
 )
 vectors = [embedding_data.embedding for embedding_data in response.data]
 return vectors

vectors=embed(docs)

data = [
 {"id": i, "vector": vectors[i], "text": docs[i], "subject": "history"}
 for i in range(len(vectors))
]

vectordb_client.insert(collection_name="india_facts", data=data)

We will then create a helper function to retrieve the context from the vector database.

def retrieve(query):
 query_vectors = embed([query])

 search_results = vectordb_client.search(
 collection_name="india_facts",
 data=query_vectors,
 limit=3,
 output_fields=["text", "subject"]
 )

 all_texts = []
 for item in search_results:
 try:
 evaluated_item = ast.literal_eval(item) if isinstance(item, str) else item
 except:
 evaluated_item = item
 
 if isinstance(evaluated_item, list):
 all_texts.extend(subitem['entity']['text'] for subitem in evaluated_item if isinstance(subitem, dict) and 'entity' in subitem and 'text' in subitem['entity'])
 elif isinstance(evaluated_item, dict) and 'entity' in evaluated_item and 'text' in evaluated_item['entity']:
 all_texts.append(evaluated_item['entity']['text'])
 
 return " ".join(all_texts)

This retrieves the top three documents, appends the text from each document and returns a string.

With the retriever step in place, it’s time to create another helper function to generate the answer from the LLM.

def generate(context, question):
 prompt = f'''
 Based on the context: {context}
 
 Please answer the question: {question}
 ''' 
 system_prompt='''
 You are a helpful assistant that answers questions based on the given context.\n
 Don't add anything to the response. \n
 If you cannot find the answer within the context, say I do not know. 
 '''
 completion = llm_client.chat.completions.create(
 model="meta/llama3-8b-instruct",
 messages=[
 {"role": "system", "content": system_prompt},
 {"role": "user", "content": prompt}
 ],
 temperature=0,
 top_p=1,
 max_tokens=1024
 )
 return completion.choices[0].message.content

We will finally wrap these two functions inside another function called chat, which first retrieves the context and then sends it to the LLM along with the original prompt sent by the user.

def chat(prompt):
 context=retrieve(prompt)
 response=generate(context,prompt)
 return response

When we invoke the function, we will see the response from the LLM derived from the context.

👁 Image

As you can see, the response is based on the context that the vector database has retrieved.

In the next part of this series, we will run all the components of this RAG application locally on a GPU-accelerated machine. Stay tuned.

Aerospike is the real-time database built for infinite scale, speed, and savings. Our customers are ready for what’s next with the lowest latency and the highest throughput data platform. Cloud and AI-forward, we empower leading organizations like Adobe, Airtel, Criteo, Experian, and PayPal.
Learn More
The latest from Aerospike
TRENDING STORIES
Janakiram MSV (Jani) is a practicing architect, research analyst, and advisor to Silicon Valley startups. He focuses on the convergence of modern infrastructure powered by cloud-native technology and machine intelligence driven by generative AI. Before becoming an entrepreneur, he spent...
Read more from Janakiram MSV
SHARE THIS STORY
TRENDING STORIES
TNS owner Insight Partners is an investor in: OpenAI.
SHARE THIS STORY
TRENDING STORIES
TNS DAILY NEWSLETTER Receive a free roundup of the most recent TNS articles in your inbox each day.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.