VOOZH about

URL: https://thenewstack.io/overcoming-the-limits-of-rag-with-colbert/

⇱ Overcoming the Limits of RAG with ColBERT - The New Stack


TNS
SUBSCRIBE
Join our community of software engineering leaders and aspirational developers. Always stay in-the-know by getting the most important news and exclusive content delivered fresh to your inbox to learn more about at-scale software development.
REQUIRED
It seems that you've previously unsubscribed from our newsletter in the past. Click the button below to open the re-subscribe form in a new tab. When you're done, simply close that tab and continue with this form to complete your subscription.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.
Welcome and thank you for joining The New Stack community!
Please answer a few simple questions to help us deliver the news and resources you are interested in.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Great to meet you!
Tell us a bit about your job so we can cover the topics you find most relevant.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Welcome!

We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.

What’s next?

Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.

Follow TNS on your favorite social media networks.

Become a TNS follower on LinkedIn.

Check out the latest featured and trending stories while you wait for your first TNS newsletter.

PREV
1 of 2
NEXT
VOXPOP
As a JavaScript developer, what non-React tools do you use most often?
Angular
0%
Astro
0%
Svelte
0%
Vue.js
0%
Other
0%
I only use React
0%
I don't use JavaScript
0%
Thanks for your opinion! Subscribe below to get the final results, published exclusively in our TNS Update newsletter:
NEW! Try Stackie AI
From clobbered drafts to real-time sync
Apr 14th 2026 10:00am, by David Moore
TypeScript 6.0 RC arrives as a bridge to a faster future
Mar 14th 2026 9:00am, by Darryl K. Taft
Mastra empowers web devs to build AI agents in TypeScript
Jan 28th 2026 11:00am, by Loraine Lawson
2024-03-06 03:00:57
Overcoming the Limits of RAG with ColBERT
sponsor-datastax,sponsored-post-contributed,
AI / Data / Large Language Models

Overcoming the Limits of RAG with ColBERT

ColBERT is a new way of scoring passage relevance using a BERT language model that substantially solves the problems with dense passage retrieval.
Mar 6th, 2024 3:00am by Jonathan Ellis
👁 Featued image for: Overcoming the Limits of RAG with ColBERT
Image from Everett Collection on Shutterstock
DataStax sponsored this post.

Retrieval augmented generation (RAG) is by now a standard part of generative artificial intelligence (AI) applications. Supplementing your application prompt with relevant context retrieved from a vector database can dramatically increase accuracy and reduce hallucinations. This means that increasing relevance in vector search results has a direct correlation to the quality of your RAG application.

There are two reasons RAG remains popular and increasingly relevant even as large language models (LLMs) increase their context window:

  1. LLM response time and price both increase linearly with context length.
  2. LLMs still struggle with both retrieval and reasoning across massive contexts.

But RAG isn’t a magic wand. In particular, the most common design, dense passage retrieval (DPR), represents both queries and passages as a single embedding vector and uses straightforward cosine similarity to score relevance. This means DPR relies heavily on the embeddings model having the breadth of training to recognize all the relevant search terms.

Unfortunately, off-the-shelf models struggle with unusual terms, including names, that are not commonly in their training data. DPR also tends to be hypersensitive to chunking strategy, which can cause a relevant passage to be missed if it’s surrounded by a lot of irrelevant information. All of this creates a burden on the application developer to “get it right the first time,” since a mistake usually results in the need to rebuild the index from scratch.

Solving DPR’s Challenges with ColBERT

ColBERT is a new way of scoring passage relevance using a BERT language model that substantially solves the problems with DPR. This diagram from the first ColBERT paper shows why it’s so exciting:

👁 Image

This compares the performance of ColBERT with other state-of-the-art solutions for the MS-MARCO dataset. (MS-MARCO is a set of Bing queries for which Microsoft scored the most relevant passages by hand. It’s one of the better retrieval benchmarks.) Lower and to the right is better.

In short, ColBERT handily outperforms the field of mostly significantly more complex solutions at the cost of a small increase in latency.

To test this, I created a demo and indexed over 1,000 Wikipedia articles with both ada002 DPR and ColBERT. I found that ColBERT delivers significantly better results on unusual search terms.

The following screenshot shows that DPR fails to recognize the unusual name of William H. Herndon, an associate of Abraham Lincoln, while ColBERT finds the reference in the Springfield article. Also note that ColBERT’s No. 2 result is for a different William, while none of DPR’s results are relevant.

👁 Image

ColBERT is often described in dense machine learning jargon, but it’s actually very straightforward. I’ll show how to implement ColBERT retrieval and scoring on DataStax Astra DB with only a few lines of Python and Cassandra Query Language (CQL).

The Big Idea

Instead of traditional, single-vector-based DPR that turns passages into a single “embedding” vector, ColBERT generates a contextually influenced vector for each token in the passages. ColBERT similarly generates vectors for each token in the query.

(Tokenization refers to breaking up input into fractions of words before processing by an LLM. Andrej Karpathy, a founding member of the OpenAI team, just released an outstanding video on how this works.)

Then, the score of each document is the sum of the maximum similarity of each query embedding to any of the document embeddings:

(@ is the PyTorch operator for dot product and is the most common measure of vector similarity.)

That’s it — you can implement ColBERT scoring in four lines of Python! Now you understand ColBERT better than 99% of the people posting about it on X (formerly known as Twitter).

The rest of the ColBERT papers deal with:

  1. How do you fine-tune the BERT model to generate the best embeddings for a given data set?
  2. How do you limit the set of documents for which you compute the (relatively expensive) score shown here?

The first question is optional and out of scope for this writeup. I’ll use the pretrained ColBERT checkpoint. But the second is straightforward to do with a vector database like DataStax Astra DB.

ColBERT on Astra DB

There is a popular Python all-in-one library for ColBERT called RAGatouille; however, it assumes a static dataset. One of the powerful features of RAG applications is responding to dynamically changing data in real time. So instead, I’m going to use Astra’s vector index to narrow the set of documents I need to score down to the best candidates for each subvector.

There are two steps when adding ColBERT to a RAG application: ingestion and retrieval.

Ingestion

Because each document chunk will have multiple embeddings associated with it, I’ll need two tables:

After installing the ColBERT library (pip install colbert-ai) and downloading the pretrained BERT checkpoint, I can load documents into these tables:

(I like to encapsulate my DB logic in a dedicated module; you can access the full source in my GitHub repository.)

Retrieval

Then retrieval looks like this:

Here is the query being executed for the most-relevant-documents part (db.query_colbert_ann_stmt):

Beyond the Basics: RAGStack

This article and the linked repository briefly introduce how ColBERT works. You can implement this today with your own data and see immediate results. As with everything in AI, best practices are changing daily, and new techniques are constantly emerging.

To make keeping up with the state of the art easier, DataStax is rolling this and other enhancements into RAGStack, our production-ready RAG library leveraging LangChain and LlamaIndex. Our goal is to provide developers with a consistent library for RAG applications that puts them in control of the step-up to new functionality. Instead of having to keep up with the myriad changes in techniques and libraries, you have a single stream, so you can focus on building your application. You can use RAGStack today to incorporate best practices for LangChain and LlamaIndex out of the box; advances like ColBERT will come to RAGstack in upcoming releases.

DataStax, an IBM company, provides the real-time vector data tools that Gen AI apps need, with seamless integration with developers’ stacks of choice.
Learn More
The latest from DataStax
TRENDING STORIES
Jonathan Ellis is co-founder and CTO of DataStax. He served as Apache Cassandra Project Chair for six years.
Read more from Jonathan Ellis
DataStax sponsored this post.
SHARE THIS STORY
TRENDING STORIES
TNS owner Insight Partners is an investor in: OpenAI.
SHARE THIS STORY
TRENDING STORIES
TNS DAILY NEWSLETTER Receive a free roundup of the most recent TNS articles in your inbox each day.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.