Last indexed: 18 May 2026 (ecd184)

Reranking

The LLamaReranker class provides specialized functionality for computing relevance scores between a query (input) and a set of documents. Unlike text generation or standard embedding generation, reranking is designed to evaluate how well a specific document answers or relates to a given prompt. This is a critical component in Retrieval-Augmented Generation (RAG) pipelines to refine the results returned by a vector search.

Overview and Purpose

LLamaReranker leverages models specifically trained for cross-encoding, where the query and document are processed together to produce a single scalar score representing their relationship LLama/LLamaReranker.cs12-21

Key characteristics include:

Pooling Type: Reranking requires the model and context to be configured with LLamaPoolingType.Rank LLama/LLamaReranker.cs40-41
Non-Causal Models: It is primarily intended for non-causal models where BatchSize must equal UBatchSize LLama/LLamaReranker.cs36-37
Score Normalization: Scores can be returned as raw logits or normalized using a Sigmoid function to a range of (0, 1) LLama/LLamaReranker.cs145

Technical Implementation

The reranker operates by tokenizing the query and document, concatenating them, and performing a single forward pass (encoding or decoding depending on the model architecture) LLama/LLamaReranker.cs65-72

Data Flow: Document Scoring

The following diagram illustrates how text inputs are transformed into relevance scores within the LLamaReranker infrastructure.

Reranking Data Flow

Sources: LLama/LLamaReranker.cs62-91 LLama/LLamaReranker.cs119-141 LLama/LLamaReranker.cs160-180

Key Classes and Functions

LLamaReranker

The primary entry point for reranking operations. It manages a LLamaContext internally and ensures the native handle is configured for embedding output via SetEmbeddings(true) LLama/LLamaReranker.cs15-44

GetRelevanceScores: Processes a list of documents against a single input string. It automatically handles batching if the combined tokens exceed the ContextSize LLama/LLamaReranker.cs62-91
GetRelevanceScoreWithTokenCount: A helper method for scoring a single document that also returns the total token count used (query + document), useful for usage tracking LLama/LLamaReranker.cs103-146
CalcRelevanceScores: An internal asynchronous method that manages the execution of the model (either EncodeAsync or DecodeAsync) and retrieves the resulting rank embeddings from sequence zero LLama/LLamaReranker.cs148-185

Code Entity Mapping

This diagram bridges the high-level reranking logic to the specific code entities in the LLamaSharp library.

Code Entity Space: Reranking Subsystem

Sources: LLama/LLamaReranker.cs15-44 LLama/LLamaReranker.cs141 LLama.Unittest/LLamaRerankerTests.cs18-25

Usage Example

To use the reranker, the model must be loaded with the Rank pooling type.

Sources: LLama.Unittest/LLamaRerankerTests.cs18-46 LLama/LLamaReranker.cs40-41

Important Constraints

Model Support: The model must support producing embeddings. Encoder-decoder models are currently not supported for ranking in this implementation LLama/LLamaReranker.cs38-39
Batching: For non-causal models (common in rerankers), BatchSize and UBatchSize in IContextParams must be identical LLama/LLamaReranker.cs36-37
Memory Management: The reranker explicitly clears the KV cache using Context.NativeHandle.MemoryClear() between scoring operations to ensure independence between document evaluations LLama/LLamaReranker.cs113 LLama/LLamaReranker.cs143 LLama/LLamaReranker.cs154
Context Size: The combined tokens of the query and documents in a single batch must fit within the configured ContextSize. If a single document exceeds the context size when combined with the query, it is handled via batch clearing logic LLama/LLamaReranker.cs74-83

Sources:

Refresh this wiki

URL: https://deepwiki.com/SciSharp/LLamaSharp/5.5-reranking

⇱ Reranking | SciSharp/LLamaSharp | DeepWiki