![]() |
VOOZH | about |
dotnet add package SemanticKernel.Rankers.BM25 --version 1.3.5
NuGet\Install-Package SemanticKernel.Rankers.BM25 -Version 1.3.5
<PackageReference Include="SemanticKernel.Rankers.BM25" Version="1.3.5" />
<PackageVersion Include="SemanticKernel.Rankers.BM25" Version="1.3.5" />Directory.Packages.props
<PackageReference Include="SemanticKernel.Rankers.BM25" />Project file
paket add SemanticKernel.Rankers.BM25 --version 1.3.5
#r "nuget: SemanticKernel.Rankers.BM25, 1.3.5"
#:package SemanticKernel.Rankers.BM25@1.3.5
#addin nuget:?package=SemanticKernel.Rankers.BM25&version=1.3.5Install as a Cake Addin
#tool nuget:?package=SemanticKernel.Rankers.BM25&version=1.3.5Install as a Cake Tool
A robust C# library for reranking search results using the classic BM25 algorithm with advanced natural language processing, leveraging the Catalyst NLP library.
This project provides a flexible C# implementation of BM25, a state-of-the-art ranking function used by search engines, enhanced with advanced natural language processing capabilities.
With this library, you can rerank search results or candidate passages using sophisticated tokenization, lemmatization, stop word removal, and multi-language support through the Catalyst NLP library.
Traditional BM25 relies on exact token overlap between query and document. However, raw text processing can be noisy:
By incorporating advanced NLP preprocessing:
NLP preprocessing enhances the precision and effectiveness of traditional BM25 scoring.
k1, b parameters).dotnet add package SemanticKernel.Rankers.BM25
using SemanticKernel.Rankers.BM25;
// Sample documents to rank
var documents = new List<string>
{
"The quick brown fox jumps over the lazy dog.",
"A brown dog jumps over another dog.",
"The quick brown fox.",
"Machine learning is a subset of artificial intelligence.",
"Natural language processing helps computers understand human language."
};
// Create BM25 reranker
var bm25 = new BM25Reranker();
// Method 1: Basic scoring - get all document scores
Console.WriteLine("=== Basic Scoring ===");
await foreach (var (document, score) in bm25.ScoreAsync("quick brown fox", documents.ToAsyncEnumerable()))
{
Console.WriteLine($"Score: {score:F4} | Document: \"{document}\"");
}
// Method 2: Top-N ranking - get only the best results
Console.WriteLine("\n=== Top-N Ranking ===");
await foreach (var (document, score) in bm25.RankAsync("quick brown fox", documents.ToAsyncEnumerable(), topN: 3))
{
Console.WriteLine($"Score: {score:F4} | Document: \"{document}\"");
}
// Method 3: Optimized approach with pre-computed corpus statistics
Console.WriteLine("\n=== Optimized with Corpus Statistics ===");
var corpusStats = await bm25.ComputeCorpusStatisticsAsync(documents.ToAsyncEnumerable());
var optimizedBm25 = new BM25Reranker(corpusStats);
await foreach (var (document, score) in optimizedBm25.ScoreAsync("machine learning", documents.ToAsyncEnumerable()))
{
Console.WriteLine($"Score: {score:F4} | Document: \"{document}\"");
}
// Extension method to convert List to IAsyncEnumerable
public static class ListExtensions
{
public static async IAsyncEnumerable<T> ToAsyncEnumerable<T>(this IEnumerable<T> source)
{
foreach (var item in source)
{
yield return item;
await Task.Yield(); // Simulate async behavior
}
}
}
Document Preprocessing: Each document is processed through the Catalyst NLP pipeline:
Index Building: The system builds an inverted index with:
Query Processing: Query text undergoes the same NLP preprocessing as documents
BM25 Scoring: For each document, calculates the BM25 score using:
You can customize the BM25 algorithm behavior by passing parameters to the scoring methods:
// Custom k1 and b parameters
var bm25 = new BM25Reranker();
// Use custom parameters in scoring
await foreach (var (document, score) in bm25.ScoreAsync("query", documents.ToAsyncEnumerable(), k1: 2.0, b: 0.5))
{
Console.WriteLine($"Score: {score:F4} | Document: \"{document}\"");
}
// Or in ranking with top-N
await foreach (var (document, score) in bm25.RankAsync("query", documents.ToAsyncEnumerable(), topN: 5, k1: 2.0, b: 0.5))
{
Console.WriteLine($"Score: {score:F4} | Document: \"{document}\"");
}
The library automatically detects document language and applies appropriate NLP models. You can also optionally restrict the supported languages:
using Catalyst;
// Create reranker with specific language support
var supportedLanguages = new HashSet<Language> { Language.English, Language.French, Language.German };
var bm25 = new BM25Reranker(supportedLanguages: supportedLanguages);
// Or combine with corpus statistics
var corpusStats = await new BM25Reranker().ComputeCorpusStatisticsAsync(documents.ToAsyncEnumerable());
var optimizedBm25 = new BM25Reranker(corpusStats, supportedLanguages);
Supported languages include:
The library includes several performance optimizations:
Caching: The reranker automatically caches tokenization results for better performance with repeated queries or documents:
var bm25 = new BM25Reranker();
// First run - cache miss
await foreach (var result in bm25.ScoreAsync("query", documents.ToAsyncEnumerable())) { }
// Second run - cache hit (much faster)
await foreach (var result in bm25.ScoreAsync("query", documents.ToAsyncEnumerable())) { }
// Clear cache when needed
BM25Reranker.ClearCache();
Corpus Statistics: For multiple queries on the same document set, pre-compute corpus statistics:
var bm25 = new BM25Reranker();
var corpusStats = await bm25.ComputeCorpusStatisticsAsync(documents.ToAsyncEnumerable());
var optimizedBm25 = new BM25Reranker(corpusStats);
// Now all queries will be faster
await foreach (var result in optimizedBm25.ScoreAsync("query1", documents.ToAsyncEnumerable())) { }
await foreach (var result in optimizedBm25.ScoreAsync("query2", documents.ToAsyncEnumerable())) { }
This project is licensed under the MIT License - see the file for details.
| Product | Versions Compatible and additional computed target framework versions. |
|---|---|
| .NET | net8.0 net8.0 is compatible. net8.0-android net8.0-android was computed. net8.0-browser net8.0-browser was computed. net8.0-ios net8.0-ios was computed. net8.0-maccatalyst net8.0-maccatalyst was computed. net8.0-macos net8.0-macos was computed. net8.0-tvos net8.0-tvos was computed. net8.0-windows net8.0-windows was computed. net9.0 net9.0 was computed. net9.0-android net9.0-android was computed. net9.0-browser net9.0-browser was computed. net9.0-ios net9.0-ios was computed. net9.0-maccatalyst net9.0-maccatalyst was computed. net9.0-macos net9.0-macos was computed. net9.0-tvos net9.0-tvos was computed. net9.0-windows net9.0-windows was computed. net10.0 net10.0 was computed. net10.0-android net10.0-android was computed. net10.0-browser net10.0-browser was computed. net10.0-ios net10.0-ios was computed. net10.0-maccatalyst net10.0-maccatalyst was computed. net10.0-macos net10.0-macos was computed. net10.0-tvos net10.0-tvos was computed. net10.0-windows net10.0-windows was computed. |
Showing the top 2 NuGet packages that depend on SemanticKernel.Rankers.BM25:
| Package | Downloads |
|---|---|
|
SemanticKernel.Agents.DatabaseAgent
Microsoft's Semantic Kernel NL2SQL agent for databases. This agent can be used to generate SQL queries from natural language prompts. |
|
|
SemanticKernel.Rankers.Pipelines
A flexible pipeline implementation for RAG systems that chains multiple rankers in cascade. |
This package is not used by any popular GitHub repositories.