VOOZH about

URL: https://thenewstack.io/tutorial-build-a-qa-bot-for-academy-awards-based-on-chatgpt/

⇱ Tutorial: Build a Q&A Bot for Academy Awards Based on ChatGPT - The New Stack


TNS
SUBSCRIBE
Join our community of software engineering leaders and aspirational developers. Always stay in-the-know by getting the most important news and exclusive content delivered fresh to your inbox to learn more about at-scale software development.
REQUIRED
It seems that you've previously unsubscribed from our newsletter in the past. Click the button below to open the re-subscribe form in a new tab. When you're done, simply close that tab and continue with this form to complete your subscription.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.
Welcome and thank you for joining The New Stack community!
Please answer a few simple questions to help us deliver the news and resources you are interested in.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Great to meet you!
Tell us a bit about your job so we can cover the topics you find most relevant.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Welcome!

We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.

What’s next?

Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.

Follow TNS on your favorite social media networks.

Become a TNS follower on LinkedIn.

Check out the latest featured and trending stories while you wait for your first TNS newsletter.

PREV
1 of 2
NEXT
VOXPOP
As a JavaScript developer, what non-React tools do you use most often?
Angular
0%
Astro
0%
Svelte
0%
Vue.js
0%
Other
0%
I only use React
0%
I don't use JavaScript
0%
Thanks for your opinion! Subscribe below to get the final results, published exclusively in our TNS Update newsletter:
NEW! Try Stackie AI
From clobbered drafts to real-time sync
Apr 14th 2026 10:00am, by David Moore
TypeScript 6.0 RC arrives as a bridge to a faster future
Mar 14th 2026 9:00am, by Darryl K. Taft
Mastra empowers web devs to build AI agents in TypeScript
Jan 28th 2026 11:00am, by Loraine Lawson
2023-07-21 06:36:24
Tutorial: Build a Q&A Bot for Academy Awards Based on ChatGPT
sponsor-promptops,sponsored-topic,tutorial,
AI / Large Language Models

Tutorial: Build a Q&A Bot for Academy Awards Based on ChatGPT

This tutorial walks you through a practical example of using Retrieval Augmented Generation with GPT 3.5 to answer questions based on a custom dataset.
Jul 21st, 2023 6:36am by Janakiram MSV
👁 Featued image for: Tutorial: Build a Q&A Bot for Academy Awards Based on ChatGPT
Feature image by Mirko Fabian from Pixabay.        

In a previous article, I introduced the concept of Retrieval Augmented Generation (RAG), which is used to provide context to Large Language Models (LLMs) to improve the accuracy of the response.

This tutorial walks you through a practical example of using RAG with GPT 3.5 to answer questions based on a custom dataset. Since the training cutoff for GPT 3.5 is 2021, it cannot answer questions based on recent events. We will use a dataset related to Oscar awards to implement RAG and have GPT 3.5 respond to questions about the 95th Academy Awards, which took place in March 2023.

This tutorial assumes that you have an active account with OpenAI and have populated the OPENAI_API_KEY environment variable with your API key.

PromptOps, powered by advanced machine learning and Large Language Models (LLMs), streamlines operational data access, team coordination, and process automation. PromptOps enables DevOps teams to maintain superior platform performance and deliver an unparalleled customer experience.
Learn More
The latest from PromptOps

Step 1 – Preparing the Dataset

Download the Oscar Award dataset from Kaggle and move the CSV file to a subdirectory named data. The dataset has all the categories, nominations, and winners of Academy Awards from 1927 to 2023. I renamed the CSV file to oscars.csv

Start by importing the Pandas library and loading the dataset:

import pandas as pd
df=pd.read_csv('./data/oscars.csv')
df.head()

👁 Image

The dataset is well-structured, with column headers and rows that represent the details of each category, including the name of the actor/technician, the film, and whether the nomination was won or lost.

Since we are most interested in awards related to 2023, let’s filter them and create a new Pandas dataframe. At the same time, we will also convert the category to lowercase while dropping the rows where the value of a film is blank. This helps us design contextual prompts sent to GPT 3.5.

df=df.loc[df['year_ceremony'] == 2023]
df=df.dropna(subset=['film'])
df['category'] = df['category'].str.lower()
df.head()

👁 Image

With the filtered and cleansed dataset, let’s add a new column to the data frame that has an entire sentence representing a nomination. This complete sentence, when sent to GPT 3.5, enables it to find the facts within the context.

df['text'] = df['name'] + ' got nominated under the category, ' + df['category'] + ', for the film ' + df['film'] + ' to win the award'
df.loc[df['winner'] == False, 'text'] = df['name'] + ' got nominated under the category, ' + df['category'] + ', for the film ' + df['film'] + ' but did not win'
df.head()['text']

Notice how we concatenate the values to generate a complete sentence. For example, the column ‘text’ in the first two rows of the data frame has the below values:

Austin Butler got nominated under the category, actor in a leading role, for the film Elvis but did not win

Colin Farrell got nominated under the category, actor in a leading role, for the film The Banshees of Inisherin but did not win

Step 2 – Generate the Word Embeddings for the Dataset

Now that we have the text that’s constructed from the dataset let’s convert it into word embeddings. This is a crucial step, as the tokens generated by the embedding model will help us perform a semantic search to retrieve the sentences from the dataset that have similar meanings.

import ast 
import openai

def text_embedding(text) -> None:
 response = openai.Embedding.create(model="text-embedding-ada-002", input=text)
 return response["data"][0]["embedding"]

df=df.assign(embedding=(df["text"].apply(lambda x : text_embedding(x))))
df.head()

In the above step, we set the embedding model to text-embedding-ada-002 and then use a lambda function to add a new column to the data frame called embedding. This directly maps to the corresponding text in the same row.

👁 Image

Step 3 – Performing a Search to Retrieve Similar Text

With the embeddings generated per row, we can now use a simple technique called cosine similarity to compare two vectors based on their meaning.

Let’s import the modules needed for this step.

import tiktoken
from scipy import spatial 

We will create a helper function to perform a cosine similarity search. It converts the query into embeddings and then compares it with each embedding available in the data frame. It returns the text along with a score that ranks the similarity. The top_n parameter defines how many sentences are sent.

def strings_ranked_by_relatedness(
 query: str,
 df: pd.DataFrame,
 relatedness_fn=lambda x, y: 1 - spatial.distance.cosine(x, y),
 top_n: int = 100
) -> tuple[list[str], list[float]]:
 
 query_embedding_response = openai.Embedding.create(
 model="text-embedding-ada-002",
 input=query,
 )
 query_embedding = query_embedding_response["data"][0]["embedding"]

 strings_and_relatednesses = [
 (row["text"], relatedness_fn(query_embedding, row["embedding"]))
 for i, row in df.iterrows()
 ]

 strings_and_relatednesses.sort(key=lambda x: x[1], reverse=True)
 strings, relatednesses = zip(*strings_and_relatednesses)
 return strings[:top_n], relatednesses[:top_n]

Let’s test this function by sending the keyword “Lady Gaga.” The goal is to get the top three values from the data frame that has references to the keyword.

strings, relatednesses = strings_ranked_by_relatedness("Lady Gaga", df, top_n=3)
for string, relatedness in zip(strings, relatednesses):
 print(f"{relatedness=:.3f}")
 display(string)

👁 Image

Obviously, the first value, with a score of 0.821, comes closest to the search. We can now inject that into our prompt to augment the context.

Step 4 – Construct the Prompt based on RAG

One thing we want to make sure of is that the token size doesn’t exceed the supported context length of the model. For GPT 3.5, the context length is 4K. The below function handles that.

def num_tokens(text: str) -> int:
 encoding = tiktoken.encoding_for_model("gpt-3.5-turbo")
 return len(encoding.encode(text))

Let’s create helper functions that make it easy to create the prompt by performing the similarity search in the data frame while respecting the token size.

def query_message(
 query: str,
 df: pd.DataFrame,
 model: str,
 token_budget: int
) -> str:
 strings, relatednesses = strings_ranked_by_relatedness(query, df)
 introduction = 'Use the below content related to the 95th Oscar awards to answer the subsequent question. If the answer cannot be found in the content, write "I could not find an answer."'
 question = f"\n\nQuestion: {query}"
 message = introduction
 for string in strings:
 next_row = f'\n\nOscar database section:\n"""\n{string}\n"""'
 if (
 num_tokens(message + next_row + question)
 > token_budget
 ):
 break
 else:
 message += next_row
 return message + question

Based on the context that the previous function generated, we will then create a function that calls the OpenAI API.

def ask(
 query: str,
 df: pd.DataFrame = df,
 model: str = "gpt-3.5-turbo",
 print_message: bool = False,
) -> str:
 message = query_message(query, df, model=model, token_budget=token_budget)
 if print_message:
 print(message)
 messages = [
 {"role": "system", "content": "You answer questions about 95th Oscar awards."},
 {"role": "user", "content": message},
 ]
 response = openai.ChatCompletion.create(
 model=model,
 messages=messages,
 temperature=0
 )
 response_message = response["choices"][0]["message"]["content"]
 return response_message

It’s time to finally ask a question to GPT 3.5 related to the 95th Academy Awards.

print(ask('What was the nomination from Lady Gaga for the 95th Oscars?'))

👁 Image

Let’s try one more query.

👁 Image

The bot seems to work well even though the model didn’t have knowledge of the recent event.

You can find the entire code below:

In the next part of this tutorial, we will explore how to use a vector database to store, search, and retrieve word embeddings. Stay tuned.

PromptOps, powered by advanced machine learning and Large Language Models (LLMs), streamlines operational data access, team coordination, and process automation. PromptOps enables DevOps teams to maintain superior platform performance and deliver an unparalleled customer experience.
Learn More
The latest from PromptOps
TRENDING STORIES
Janakiram MSV (Jani) is a practicing architect, research analyst, and advisor to Silicon Valley startups. He focuses on the convergence of modern infrastructure powered by cloud-native technology and machine intelligence driven by generative AI. Before becoming an entrepreneur, he spent...
Read more from Janakiram MSV
SHARE THIS STORY
TRENDING STORIES
TNS owner Insight Partners is an investor in: Pragma, OpenAI.
SHARE THIS STORY
TRENDING STORIES
TNS DAILY NEWSLETTER Receive a free roundup of the most recent TNS articles in your inbox each day.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.