SemanticKernel.Rankers.BM25 1.3.5

.NET 8.0

dotnet add package SemanticKernel.Rankers.BM25 --version 1.3.5

NuGet\Install-Package SemanticKernel.Rankers.BM25 -Version 1.3.5

This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.

<PackageReference Include="SemanticKernel.Rankers.BM25" Version="1.3.5" />

For projects that support PackageReference, copy this XML node into the project file to reference the package.

<PackageVersion Include="SemanticKernel.Rankers.BM25" Version="1.3.5" />
 

 Directory.Packages.props

<PackageReference Include="SemanticKernel.Rankers.BM25" />
 

 Project file

For projects that support Central Package Management (CPM), copy this XML node into the solution Directory.Packages.props file to version the package.

paket add SemanticKernel.Rankers.BM25 --version 1.3.5

The NuGet Team does not provide support for this client. Please contact its maintainers for support.

#r "nuget: SemanticKernel.Rankers.BM25, 1.3.5"

#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.

#:package SemanticKernel.Rankers.BM25@1.3.5

#:package directive can be used in C# file-based apps starting in .NET 10 preview 4. Copy this into a .cs file before any lines of code to reference the package.

#addin nuget:?package=SemanticKernel.Rankers.BM25&version=1.3.5
 

 Install as a Cake Addin

#tool nuget:?package=SemanticKernel.Rankers.BM25&version=1.3.5
 

 Install as a Cake Tool

The NuGet Team does not provide support for this client. Please contact its maintainers for support.

BM25 Ranker

A robust C# library for reranking search results using the classic BM25 algorithm with advanced natural language processing, leveraging the Catalyst NLP library.

Introduction

This project provides a flexible C# implementation of BM25, a state-of-the-art ranking function used by search engines, enhanced with advanced natural language processing capabilities.
With this library, you can rerank search results or candidate passages using sophisticated tokenization, lemmatization, stop word removal, and multi-language support through the Catalyst NLP library.

Why BM25 with NLP?

Traditional BM25 relies on exact token overlap between query and document. However, raw text processing can be noisy:

Text contains punctuation, stop words, and varying word forms.
"running" vs "run", "cars" vs "car", mixed case, etc.
Different languages require different processing approaches.

By incorporating advanced NLP preprocessing:

The reranker uses lemmatization to normalize word forms (running → run).
Automatic language detection ensures proper processing for multilingual content.
Stop words are filtered out to focus on meaningful terms.
Part-of-speech tagging helps identify important content words.

NLP preprocessing enhances the precision and effectiveness of traditional BM25 scoring.

Features

BM25 core algorithm: Highly tunable (k1, b parameters).
Advanced NLP processing: Powered by the Catalyst library for tokenization and linguistic analysis.
Multi-language support: Automatic language detection with support for English, French, German, and more.
Intelligent preprocessing: Lemmatization, stop word removal, and part-of-speech filtering.
Asynchronous processing: Async tokenization and scoring for high performance.
Easy to extend: Customizable parameters and configurable language models.

Getting Started

Prerequisites

.NET 8.0+

Installation

Install the package via NuGet Package Manager or via the .NET CLI:

dotnet add package SemanticKernel.Rankers.BM25

Usage Example

using SemanticKernel.Rankers.BM25;

// Sample documents to rank
var documents = new List<string>
{
 "The quick brown fox jumps over the lazy dog.",
 "A brown dog jumps over another dog.",
 "The quick brown fox.",
 "Machine learning is a subset of artificial intelligence.",
 "Natural language processing helps computers understand human language."
};

// Create BM25 reranker
var bm25 = new BM25Reranker();

// Method 1: Basic scoring - get all document scores
Console.WriteLine("=== Basic Scoring ===");
await foreach (var (document, score) in bm25.ScoreAsync("quick brown fox", documents.ToAsyncEnumerable()))
{
 Console.WriteLine($"Score: {score:F4} | Document: \"{document}\"");
}

// Method 2: Top-N ranking - get only the best results
Console.WriteLine("\n=== Top-N Ranking ===");
await foreach (var (document, score) in bm25.RankAsync("quick brown fox", documents.ToAsyncEnumerable(), topN: 3))
{
 Console.WriteLine($"Score: {score:F4} | Document: \"{document}\"");
}

// Method 3: Optimized approach with pre-computed corpus statistics
Console.WriteLine("\n=== Optimized with Corpus Statistics ===");
var corpusStats = await bm25.ComputeCorpusStatisticsAsync(documents.ToAsyncEnumerable());
var optimizedBm25 = new BM25Reranker(corpusStats);

await foreach (var (document, score) in optimizedBm25.ScoreAsync("machine learning", documents.ToAsyncEnumerable()))
{
 Console.WriteLine($"Score: {score:F4} | Document: \"{document}\"");
}

// Extension method to convert List to IAsyncEnumerable
public static class ListExtensions
{
 public static async IAsyncEnumerable<T> ToAsyncEnumerable<T>(this IEnumerable<T> source)
 {
 foreach (var item in source)
 {
 yield return item;
 await Task.Yield(); // Simulate async behavior
 }
 }
}

How It Works

Document Preprocessing: Each document is processed through the Catalyst NLP pipeline:
- Automatic language detection
- Tokenization into individual words
- Lemmatization to normalize word forms
- Stop word removal
- Part-of-speech filtering (removes punctuation and symbols)
Index Building: The system builds an inverted index with:
- Document frequency (DF) for each term
- Document lengths and average document length
- Preprocessed token lists for efficient scoring
Query Processing: Query text undergoes the same NLP preprocessing as documents
BM25 Scoring: For each document, calculates the BM25 score using:
- Term frequency (TF) in the document
- Inverse document frequency (IDF)
- Document length normalization
- Tunable parameters k1 and b

Customization

BM25 Parameters

You can customize the BM25 algorithm behavior by passing parameters to the scoring methods:

// Custom k1 and b parameters
var bm25 = new BM25Reranker();

// Use custom parameters in scoring
await foreach (var (document, score) in bm25.ScoreAsync("query", documents.ToAsyncEnumerable(), k1: 2.0, b: 0.5))
{
 Console.WriteLine($"Score: {score:F4} | Document: \"{document}\"");
}

// Or in ranking with top-N
await foreach (var (document, score) in bm25.RankAsync("query", documents.ToAsyncEnumerable(), topN: 5, k1: 2.0, b: 0.5))
{
 Console.WriteLine($"Score: {score:F4} | Document: \"{document}\"");
}

k1 (default: 1.5): Controls term frequency saturation. Higher values give more weight to repeated terms.
b (default: 0.75): Controls document length normalization. 0 = no normalization, 1 = full normalization.

Language Support

The library automatically detects document language and applies appropriate NLP models. You can also optionally restrict the supported languages:

using Catalyst;

// Create reranker with specific language support
var supportedLanguages = new HashSet<Language> { Language.English, Language.French, Language.German };
var bm25 = new BM25Reranker(supportedLanguages: supportedLanguages);

// Or combine with corpus statistics
var corpusStats = await new BM25Reranker().ComputeCorpusStatisticsAsync(documents.ToAsyncEnumerable());
var optimizedBm25 = new BM25Reranker(corpusStats, supportedLanguages);

Supported languages include:

English
French
German
Additional languages supported by Catalyst

Performance Optimization

The library includes several performance optimizations:

Caching: The reranker automatically caches tokenization results for better performance with repeated queries or documents:

var bm25 = new BM25Reranker();

// First run - cache miss
await foreach (var result in bm25.ScoreAsync("query", documents.ToAsyncEnumerable())) { }

// Second run - cache hit (much faster)
await foreach (var result in bm25.ScoreAsync("query", documents.ToAsyncEnumerable())) { }

// Clear cache when needed
BM25Reranker.ClearCache();

Corpus Statistics: For multiple queries on the same document set, pre-compute corpus statistics:

var bm25 = new BM25Reranker();
var corpusStats = await bm25.ComputeCorpusStatisticsAsync(documents.ToAsyncEnumerable());
var optimizedBm25 = new BM25Reranker(corpusStats);

// Now all queries will be faster
await foreach (var result in optimizedBm25.ScoreAsync("query1", documents.ToAsyncEnumerable())) { }
await foreach (var result in optimizedBm25.ScoreAsync("query2", documents.ToAsyncEnumerable())) { }

License

This project is licensed under the MIT License - see the file for details.

Product	Versions Compatible and additional computed target framework versions.
.NET	net8.0 net8.0 is compatible. net8.0-android net8.0-android was computed. net8.0-browser net8.0-browser was computed. net8.0-ios net8.0-ios was computed. net8.0-maccatalyst net8.0-maccatalyst was computed. net8.0-macos net8.0-macos was computed. net8.0-tvos net8.0-tvos was computed. net8.0-windows net8.0-windows was computed. net9.0 net9.0 was computed. net9.0-android net9.0-android was computed. net9.0-browser net9.0-browser was computed. net9.0-ios net9.0-ios was computed. net9.0-maccatalyst net9.0-maccatalyst was computed. net9.0-macos net9.0-macos was computed. net9.0-tvos net9.0-tvos was computed. net9.0-windows net9.0-windows was computed. net10.0 net10.0 was computed. net10.0-android net10.0-android was computed. net10.0-browser net10.0-browser was computed. net10.0-ios net10.0-ios was computed. net10.0-maccatalyst net10.0-maccatalyst was computed. net10.0-macos net10.0-macos was computed. net10.0-tvos net10.0-tvos was computed. net10.0-windows net10.0-windows was computed.

Product

Versions Compatible and additional computed target framework versions.

.NET

net8.0 net8.0 is compatible. net8.0-android net8.0-android was computed. net8.0-browser net8.0-browser was computed. net8.0-ios net8.0-ios was computed. net8.0-maccatalyst net8.0-maccatalyst was computed. net8.0-macos net8.0-macos was computed. net8.0-tvos net8.0-tvos was computed. net8.0-windows net8.0-windows was computed. net9.0 net9.0 was computed. net9.0-android net9.0-android was computed. net9.0-browser net9.0-browser was computed. net9.0-ios net9.0-ios was computed. net9.0-maccatalyst net9.0-maccatalyst was computed. net9.0-macos net9.0-macos was computed. net9.0-tvos net9.0-tvos was computed. net9.0-windows net9.0-windows was computed. net10.0 net10.0 was computed. net10.0-android net10.0-android was computed. net10.0-browser net10.0-browser was computed. net10.0-ios net10.0-ios was computed. net10.0-maccatalyst net10.0-maccatalyst was computed. net10.0-macos net10.0-macos was computed. net10.0-tvos net10.0-tvos was computed. net10.0-windows net10.0-windows was computed.

Compatible target framework(s)

Included target framework(s) (in package)

Learn more about Target Frameworks and .NET Standard.

net8.0
- Catalyst (>= 1.0.54164)
- Catalyst.Models.English (>= 1.0.30952)
- Catalyst.Models.French (>= 1.0.30952)
- Catalyst.Models.German (>= 1.0.30952)
- Microsoft.SemanticKernel.Abstractions (>= 1.65.0)
- SemanticKernel.Rankers.Abstractions (>= 1.3.5)
- System.Numerics.Tensors (>= 9.0.9)

NuGet packages (2)

Showing the top 2 NuGet packages that depend on SemanticKernel.Rankers.BM25:

Package	Downloads
SemanticKernel.Agents.DatabaseAgent Microsoft's Semantic Kernel NL2SQL agent for databases. This agent can be used to generate SQL queries from natural language prompts.
SemanticKernel.Rankers.Pipelines A flexible pipeline implementation for RAG systems that chains multiple rankers in cascade.

GitHub repositories

This package is not used by any popular GitHub repositories.

Version	Downloads	Last Updated
1.3.5	1,238	10/3/2025
1.3.4	260	9/27/2025
1.3.1	2,209	9/3/2025
1.3.0	1,401	9/2/2025
1.2.0	234	9/2/2025
1.1.0	231	9/1/2025
1.0.0	260	9/1/2025

URL: https://www.nuget.org/packages/SemanticKernel.Rankers.BM25/

⇱ NuGet Gallery | SemanticKernel.Rankers.BM25 1.3.5