![]() |
VOOZH | about |
While Large Language Models (LLMs) can answer questions on many topics, they may not correctly answer about the topics that are not included in the training data such as recent events, or deep web (i.e. data that is not indexed by search engines). Another missing piece is not getting exact source of the answer, making verification challenging. This is where Retrieval Augmented Generation (RAG) can be useful. RAG combines the generative capabilities of LLMs with information retrieval from external sources of data and also can cite the exact sources of its answers, greatly improving verifiability and reliability. In this article we will enhance RAG with Retrieval Augmented Fine-tuning.
In RAG, we split the data into chunks, find the top-K most similar chunks to the query, and present those chunks of content to the LLM to generate an answer. However, those top-K chunks can contain a mix of relevant and irrelevant content for the given query. The LLM should be able to find the relevant content for the query among those chunks given to it to generate the answer. So, if we can finetune the LLM for this specific task of generating an answer given both relevant and irrelevant content in the prompt, it can improve the accuracy of RAG.
As shown in the above picture, generating an answer for a query based on only the training data is like a “closed book” exam. An “Open book” exam is where the answer is generated using external data, which RAG.
In a new method, we train the LLM about how to effectively use the external data. This method significantly improved the RAG performance (https://arxiv.org/pdf/2403.10131.pdf)
Let us now dive deeper on how to prepare the data for Fine-tuning the LLM:
Let us now learn the implementation of Retrieval Augmented Fine-tuning. Initially we start with installing the required libraries using the following commands:
Then you can import the RAFTDataset:
from llama_index.packs.raft_dataset import RAFTDatasetPack
For the data preparation process for Q/A generation, the RAFTDatasetPack is configured with the following parameters:
The SemanticNodeParser operates by dissecting the data at the sentence level, initially dividing the text into smaller segments or ‘chunks’. Here’s how the process unfolds:
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
import os
OPENAI_API_KEY = os.environ['OPENAI_API_KEY']
llm = OpenAI(model="gpt-3.5-turbo")
embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")
!wget --user-agent "Mozilla" "https://raw.githubusercontent.com/run-llama/llama_index/main
/docs/docs/examples/data/paul_graham/paul_graham_essay.txt" -O './paul_graham_essay.txt'
# create RAFT Dataset object
raft_dataset = RAFTDatasetPack(file_path="./paul_graham_essay.txt",
llm = llm, embed_model=embed_model,
num_questions_per_chunk=1, num_distract_docs=2, chunk_size=1024,
default_breakpoint_percentile_threshold=99)
# create the dataset
dataset = raft_dataset.run()
# create the dataset
dataset = raft_dataset.run()
# save the dataset in jsonl format
output_path = './raft_dataset'
dataset.to_json(output_path + ".jsonl")
with open('./raft_dataset.jsonl', 'r') as json_file:
dataset = list(json_file)
# We can access the dataset with the following
json.loads(dataset[0]).keys()
# output
# dict_keys(['id', 'type', 'question', 'context', 'oracle_context', 'cot_answer', 'instruction'])
json.loads(dataset[0])['question']
# output
# 'What were the two main things the author worked on before college?'
The adoption of RAG alongside Large Language Models significantly mitigates their limitations by enabling accurate, verifiable responses to queries beyond their initial training scope. Fine-tuning LLMs with specifically prepared datasets and leveraging preprocessing techniques like the Semantic Splitter Node Parser enhances model performance. This approach marks a significant step forward in the evolution of AI applications, highlighting the importance of innovation in artificial intelligence for more reliable and sophisticated solutions.
A. RAG is a technique that enhances Large Language Models (LLMs) by incorporating external data sources into their answering process. This allows LLMs to provide more accurate, verifiable, and up-to-date answers. Especially for queries about topics not included in their original training data.
A. LLMs’ ability to prioritize relevant information significantly improves when fine-tuned with a specific dataset, including both relevant and irrelevant contexts. This process leads to the generation of more precise and contextually accurate responses to complex queries.
A.The RAFT Dataset specifically designs for fine-tuning LLMs in a RAG setup. It includes a meticulously prepared dataset with questions, oracle contexts for correct answers, and distractor contexts to challenge the model. The setup teaches the LLM to efficiently utilize external data for precise and dependable responses, utilizing the RAG model’s strengths.
I am working as an Associate Data Scientist at Analytics Vidhya, a platform dedicated to building the Data Science ecosystem. My interests lie in the fields of Natural Language Processing (NLP), Deep Learning, and AI Agents.
GPT-4 vs. Llama 3.1 – Which Model is Better?
Llama-3.1-Storm-8B: The 8B LLM Powerhouse Surpa...
A Comprehensive Guide to Building Agentic RAG S...
Top 10 Machine Learning Algorithms in 2026
45 Questions to Test a Data Scientist on Basics...
90+ Python Interview Questions and Answers (202...
8 Easy Ways to Access ChatGPT for Free
Prompt Engineering: Definition, Examples, Tips ...
What is LangChain?
What is Retrieval-Augmented Generation (RAG)?
Edit
Resend OTP
Resend OTP in 45s