RSRank: Learning Relevance from Representational Shifts

Archit Gupta¹ Sai Sundaresan¹ Debabrata Mahapatra¹

¹Adobe Research, India

Corresponding author: dmahapatra@adobe.com

Abstract

As enterprises deploy RAG-based systems to provide grounded responses to user queries, reranking has become a critical component for the final filtering step that separates relevant from distracting or irrelevant documents. Existing rerankers often rely on heuristic thresholds to achieve optimal filtering. Moreover, for relevance scoring, state-of-the-art methods use a language model’s logit signals, which are designed for next-token prediction, not for assessing relevance. To address these limitations, we identify a principled signal for relevance: the representational shift (RS) induced in a query’s internal state when conditioned on a document. We observe that the alignment between (a) RS induced by a candidate document and (b) RS induced by an oracle document-set provides a robust indicator of relevance. Building on this insight, we introduce a lightweight training framework that learns projections mapping RS to calibrated relevance scores. Our training objectives naturally filter irrelevant content at a zero threshold, reducing dependence on heuristic tuning. Across diverse retrieval datasets, our method delivers gains over SOTA rerankers.

Archit Gupta¹ Sai Sundaresan¹ Debabrata Mahapatra¹^†^†thanks: Corresponding author: dmahapatra@adobe.com ¹Adobe Research, India

1 Introduction

1.1 Role of Rerankers in Enterprise Systems

Information retrieval (IR) systems form foundational infrastructure for enterprises that serve users at scale. Search engines, knowledge bases, and AI assistants increasingly depend on their ability to identify relevant information from a large corpus for a user query. A fundamental tradeoff that governs the design of these systems is efficiency vs. accuracy. Traditional methods like BM25 (Robertson and Zaragoza, 2009) and modern dense embedding approaches (Khattab and Zaharia, 2020; Nogueira et al., 2020) are efficient, as documents are encoded once and indexed for fast lookup, but suffer from an information bottleneck, being unable to represent query-specific information when constrained to a single document representation (Luan et al., 2021).

The introduction of reranking through a two-stage retrieval pipeline (Wang et al., 2011) addresses this representational limitation. The first stage retrieves a broad candidate set using efficient methods (accepting some loss in precision), and the second stage applies more expensive models to rerank these candidates. Rerankers jointly represent query and document tokens, capturing semantic relationships that independent encodings miss, improving retrieval quality (Nogueira and Cho, 2020). More recently, LLM-based rerankers have extended this paradigm, achieving SOTA results (Zhang et al., 2025). This two-stage architecture is now standard in enterprise systems (Liu et al., 2017; Microsoft Research, 2021).

1.2 Reranking in Retrieval Systems

Retrieval systems (Lewis et al., 2020) have widely adopted rerankers, with vector database platforms providing native reranker support (Pinecone, 2024b; Weaviate, 2024), and cloud providers offering reranking through APIs (Amazon Web Services, 2024; Google Cloud, 2025). In these systems, retrieved documents enter the LM’s context window, and irrelevant documents degrade response quality (Liu et al., 2024; Wu et al., 2024), increase latency, and raise API costs (Pinecone, 2024a). Accurate selection is therefore critical, and the tradeoff between accuracy and efficiency becomes even more consequential in this setting.

In practice, the standard approach is to retrieve a broad candidate set (e.g., top-100) via embedding search, rerank, and then select a subset for the LM’s context. The selection step is typically performed via fixed top- selection or score thresholding.

1.3 Limitations of Existing Rerankers

Neither approach discussed above adequately addresses the efficiency–accuracy tradeoff. A fixed top- selection ignores that queries differ in the number of relevant documents, often selecting too many or too few documents per query. Determining a score cutoff is also difficult because rerankers are not calibrated for absolute relevance; optimal values vary across domains and even across queries. We quantify these inefficiencies empirically in Sec. 2.

These limitations stem, in part, from how rerankers derive their signal. Current approaches rely on signals tuned for next-token prediction (internal states, attention maps, logits), rather than relevance assessment (Zhang et al., 2025; Chen et al., 2024, 2025a). The resulting scores are effective for ranking—ordering documents by relevance—but poorly calibrated for selection—deciding which documents are relevant.

1.4 Toward a Calibrated Relevance Signal

The limitations above motivate a search for a different relevance signal—one that is inherently calibrated rather than repurposed from the next-token prediction objective. We observe that relevance fundamentally concerns how a document changes the model’s internal representation of a query: a relevant document should shift the model’s internal representation characteristically. This observation leads us to formalize and study representational shifts, the change in the model’s representation of the query induced by a document in context. In Sec. 4, we show that the geometry of RS encodes relevance information that, when transformed through a learned projection can output scores that are calibrated towards a natural decision boundary.

1.5 Contributions

Our key contributions are as follows: We highlight the threshold inconsistency problem in current SOTA rerankers and provide a means to quantify its impact. We identify representational shifts as a relevance signal: changes in the query’s value vectors induced by conditioning on a document in context. We introduce a lightweight learning framework that maps the representational shift space to calibrated scores, yielding a consistent decision boundary across datasets. We demonstrate competitive performance across six diverse retrieval datasets, achieving 2.0 and 7.2-point gains in Recall@5 and F1 at the natural threshold relative to baselines, while using only 2.3M trained parameters on top of frozen LLM representations.

2 Threshold Inconsistency in Rerankers

We evaluate Qwen-Reranker-8B (Zhang et al., 2025) on six retrieval datasets (Sec. 5) and analyze the resulting score distributions to show the threshold inconsistency problem.

2.1 Optimal Thresholds Vary Across Datasets

For each dataset, we compute the optimal threshold, the threshold achieving the highest mean F1, across 500 queries. We plot this against the per-dataset score range after applying a global min-max normalization (mapping the score range to ) so that models with different native scales can be compared directly. To quantify dataset-level calibration we report two metrics: Bias, the absolute offset between the optimal threshold and the model’s natural decision boundary (), and Variance, the spread of per-dataset optimal thresholds around their mean. Fig. 1 shows the results for Qwen3-Reranker-8B: bias=0.379, variance=0.023. This reveals that the natural threshold () consistently overshoots the true decision boundary, while the optimal threshold varies substantially across domains. Consequently, effective deployment requires labeled data for calibration, limiting out-of-the-box performance.

2.2 Fixed Thresholds Hurt Individual Queries

Even within a single dataset, the optimal threshold varies substantially across queries. Fig. 2 illustrates this effect on HotpotQA by showing the fraction of queries for which the F1 score obtained using the dataset-level optimal threshold falls short of the per-query optimal F1. For Qwen3-Reranker-8B, 63% of queries incur an F1 loss greater than 0.1, and 30% incur a loss greater than 0.3.

Table 1: Paired -test: per-query optimal F1 vs. dataset-optimal threshold F1 for Qwen3-Reranker-8B.

Dataset		Q-Opt	D-Opt	Gap
2WikiMQA	500	73.0	55.0	18.1	24.22
Fever	500	100.0	99.5	0.5	2.73
FiQA	500	99.1	95.0	4.0	8.05
HotpotQA	500	79.7	61.0	18.8	23.94
MuSiQue	500	77.7	61.0	16.7	22.93
NFCorpus	323	82.1	64.9	17.2	15.40

To quantify performance loss attributable specifically to poor calibration rather than reranking quality, we conduct a paired t-test comparing dataset-level optimal F1 scores with per-query optimal F1 scores. Table 1 reports the results for Qwen3-Reranker-8B across datasets. In all cases, the difference between the two scores is statistically significant (), indicating that a large portion of the observed performance gap arises from calibration error rather than limitations in ranking ability.

👁 Refer to caption

Figure 1: Optimal threshold for Qwen3-Reranker-8B for F1 across datasets. The -axis shows the range of scores (globally normalized); the optimal threshold is indicated by the red dot.

👁 Refer to caption

Figure 2: F1 gap CDF for Qwen3-Reranker-8B on HotpotQA. The -axis shows the F1 gap between the dataset-level optimal threshold and the per-query optimal threshold; the -axis shows the fraction of queries exceeding that gap. 63% of queries lose 0.1 F1 from using a fixed threshold.

3 Related Work

Ranking paradigms.

Reranking methods can be broadly categorized into three paradigms. Pointwise methods score each query–document pair independently, enabling efficient threshold-based selection Nogueira et al. (2020); Ma et al. (2024); Zhang et al. (2025). Listwise methods condition on the entire candidate set and directly optimize or generate ranked permutations Pradeep et al. (2023); Sun et al. (2023). Setwise methods iteratively identify the most relevant document from subsets of candidates Chen et al. (2025b); Zhuang et al. (2024). Since listwise and setwise approaches incur substantially higher computational and latency overheads for large candidate sets, we primarily compare against pointwise rerankers.

Model architectures.

Reranking models fall into three architectural families. Cross-encoders jointly encode the query and document for relevance prediction Nogueira et al. (2020); Pradeep et al. (2021); Khattab and Zaharia (2020). Open-source LLMs have been adapted for pointwise, listwise, and setwise reranking Ma et al. (2024); Pradeep et al. (2023); Zhang et al. (2025); Meng et al. (2025); Sun et al. (2026); BehnamGhader et al. (2026). Closed-source LLMs are often used as zero-shot rerankers via prompting Sun et al. (2023). Across these families, relevance is typically inferred from language modeling objectives rather than explicit relevance supervision. We primarily compare against cross-encoder and open-source LLM rerankers of equivalent scale.

Threshold Calibration.

Prior work studies calibration techniques for making reranker scores comparable across queries and suitable for threshold-based decisions. Methods based on Platt scaling and related calibration approaches Platt (1999); Posokhov et al. (2025); Ren et al. (2025); Yu et al. (2025) convert raw scores into calibrated confidence estimates, but still require task-specific calibration and externally defined thresholds. Other approaches derive statistically grounded thresholds Li et al. (2022) or use predictive uncertainty for selective acceptance Yoon and Sael (2025); Yoon et al. (2025), yet they also depend on auxiliary decision rules. RSRank learns a consistent relevance boundary directly during training, while remaining compatible with post-hoc calibration and thresholding techniques.

Probing LLM Representations.

Recent work shows that intermediate LLM representations encode task-relevant signals. Intermediate layers improve embedding quality (Skean et al., 2025), contrastive layer analysis enhances factuality (Zhang et al., 2024), hidden states support document attribution (Phukan et al., 2024), and attention patterns have been used for reranking (Chen et al., 2025a). These findings highlight the importance of leveraging model internals beyond next-token prediction. In contrast to prior work on hidden states or attention weights, our approach uses value vector shifts to capture how a document updates the model’s internal representation of a query, directly aligning the representation with the relevance decision.

Intrinsic Geometry in LLMs.

Prior work shows that neural representations are highly anisotropic, often forming a cone-shaped geometry in embedding space (Ait-Saada and Nadif, 2023), a property observed consistently across layers and architectures (Razzhigaev et al., 2024; Skean et al., 2025). Rather than being a training artifact, this geometry has been argued to encode meaningful semantic and structural information (Godey et al., 2024; Kudrjashov et al., 2025). Although some methods attempt to suppress anisotropy through normalization or whitening, recent evidence suggests that doing so can degrade generation and downstream performance, implying that anisotropy itself carries useful signals (Godey et al., 2024; Kudrjashov et al., 2025). These observations motivate our approach: we leverage the anisotropic structure of RS to extract a relevance-discriminative signal.

4 Methodology

4.1 Finite Difference as Representational Shift

We consider a decoder-only Transformer with layers and attention heads per layer. Let denote the vocabulary and let a document and query be token sequences and , respectively. We study how the document prefix alters internal representations of query tokens during prefill.

Pre-attention value vectors.

Fix a layer and head with head dimension . Let denote the residual stream vector entering layer at position for input sequence , and let be the value projection matrix for head . The pre-attention value vector at position is

(1)

In our reranking setup, we focus on value vectors of the query tokens in the concatenated input .

Controlling for prefix length.

Prepending a document changes both (i) the content available for attention and (ii) the absolute positions of query tokens. Since our goal is to isolate only prefix content effects, we need to compare the document prefix to a length-matched null prefix that carries minimal semantic content (e.g., padding or benign filler tokens), yielding a controlled finite-difference feature. Specifically, for each query token position , we define the document-induced delta of the value vector at head as

(2)

a standard construction in discrete/finite-difference calculus (Appendix A). However, specifically for value vector based signals, we can simplify this construction to Eq. (3) when our base model uses RoPE (Su et al., 2024), which encodes relative positions. Under RoPE, the QK attention between tokens of the query remains the same regardless of how far the query is shifted, and since the value vectors themselves are not subject to RoPE, they are not affected by their position.

(3)

Representational shift tensor.

Let be a selected set of layers and heads. Collecting the per-head deltas across query token positions , we define the representational shift tensor as

(4)

where is the head dimension. Intuitively, captures how the document prefix re-contextualizes each query token in value space, at a resolution indexed by . Given a set of documents , we write for the shift induced by conditioning on all documents in .

4.2 Representational Shift Models Relevance

Notation.

For a query with candidate document set , let denote the set of relevant and the irrelevant documents.

Oracle shift.

We define the oracle shift as the representational shift induced by conditioning on all relevant documents simultaneously. This shift encodes the aggregate effect that the complete relevant context has on the model’s internal representations of the query.

Oracle-similarity ranking.

For each candidate document , we compute the cosine similarity between its individual shift and the oracle shift , and rank documents accordingly. On the 2WikiMQA validation set, this oracle-similarity ranking achieves R@5 = 89.5, P@5 = 42.8, and F1@5 = 57.9—demonstrating that alignment with oracle is effective in separating relevant and irrelevant documents, and that the geometry of the shift space encodes relevance structure.

From oracle similarity to learned projection.

The oracle shift is unavailable at inference time. However, the experiment above suggests that if we can learn a projection that transforms the RS space such that falls in a chosen orthant specifically aligning to across queries, then the oracle similarity reduces to , applicable at inference. This motivates our learning framework.

4.3 Learning Calibrated Projections

4.3.1 Projection Matrix

We learn a projection matrix , where is the projection dimension. For each layer , head , the submatrix projects the -dimensional shift into a -dimensional space.

Scoring Function.

Given a candidate document , we compute its relevance score as:

(5)

(6)

where denotes the all-ones vector in . Intuitively, learns to extract a “relevance direction” from representational shift vectors: documents whose projected shifts have high cosine similarity with receive high scores.

4.3.2 Training Objectives

Our training objective consists of five terms designed to achieve calibrated separation of relevant and irrelevant documents at a fixed threshold.

Calibration Loss.

The core objective pushes relevant documents to have positive scores and irrelevant documents to have negative scores:

(7)

where denotes the ReLU function and , are the relevant and irrelevant document sets as defined in Sec. 4.2. This loss creates a natural decision boundary at .

Margin Loss.

To ensure robust separation, we enforce a margin between classes:

(8)

The first two terms push relevant scores above and irrelevant scores below . The third term ensures that the lowest-scoring relevant document exceeds the highest-scoring irrelevant document by at least . We use in all experiments.

Orthogonality Regularization.

To prevent dimension collapse in the projection, we regularize each to have orthonormal rows:

(9)

This ensures all dimensions are effectively utilized and prevents the projection from degenerating to a lower-rank mapping.

Oracle Alignment.

We provide explicit supervision on the direction of the oracle shift projection:

(10)

where is the oracle representational shift induced by the full set of relevant documents (Sec. 4.2). This anchors the “ideal relevance direction” to the ones vector.

Magnitude Constraint.

For training stability, we bound the Frobenius norm of :

(11)

Total Objective.

The complete training loss is:

(12)

5 Experimental Setup

5.1 Baseline

We compare against Qwen3-Reranker-8B (Zhang et al., 2025), a state-of-the-art LLM reranker built on Qwen3-8B (Team, 2025) that produces calibrated binary relevance scores. For fairness, RSRank uses frozen representations from the same backbone. We focus on Qwen3-Reranker-8B because it substantially outperforms earlier rerankers such as BGE-Reranker (Chen et al., 2024) and Jina Reranker v2 (JinaAI, 2025). We additionally compare against MonoT5 (Nogueira et al., 2020), a standard cross-encoder baseline, and LLM2Vec-Gen (BehnamGhader et al., 2026) (Qwen3-8B) to isolate gains beyond representation quality alone.

5.2 Datasets

We evaluate on six retrieval datasets spanning multi-hop reasoning and domain specific retrieval.

Multi-hop QA.

2WikiMultihopQA (Ho et al., 2020) focuses on compositional reasoning across pairs of Wikipedia articles with sentence-level supporting facts. HotpotQA (Yang et al., 2018) emphasizes multi-hop reasoning in a distractor setting, where each question is paired with 2 supporting and 8 TF-IDF-retrieved paragraphs. MuSiQue (Trivedi et al., 2022) increases reasoning complexity by composing multiple single-hop questions into multi-hop questions and includes adversarial unanswerable examples.

Domain-Specific Retrieval.

FiQA (Maia et al., 2018) contains opinion-based financial question answering data. FEVER (Thorne et al., 2018) is a fact verification dataset where claims must be supported or refuted using evidence from Wikipedia. NFCorpus (Boteva et al., 2016) is a biomedical retrieval dataset linking natural-language nutrition and medical queries to scientific documents.

Document granularity.

The multi-hop QA datasets (2WikiMQA, HotpotQA) operate at sentence-level granularity, while BEIR datasets and MuSiQue use paragraph-level documents. In practice, RAG pipelines operate on chunked passages granularities, which is the primary setting we target. Our method is trained and evaluated jointly on these mixed chunk lengths, demonstrating generalisation across the passage sizes typical of chunked retrieval. Reranking over longer-form documents (e.g., entire articles) is an interesting direction but falls outside the scope of this work.

5.3 Training

Training protocol.

RS features are pre-computed by running a forward pass per query-document pair. We train the projection matrix (2.3M parameters; Sec. 4.3) on just 2000 samples from the datasets listed above (excluding Fever, which serves as a zero-shot test). Layer 0 is excluded from training since its value vectors are raw token embeddings rather than contextualized representations. Complete training hyperparameters are provided in Appendix B.4.

Training cost.

RSRank has a distinct advantage in the training phase compared to the regular finetuning carried out in Qwen3-Reranker-8B because it works on top of frozen LLM representations, training an independent projection matrix. Because of this, the expensive forward pass of the LLM has to be done only once and the backpropagation does not need to update the parameters of the LLM.

5.4 Evaluation

Evaluation protocol.

For each dataset, we evaluate on up to 500 sampled queries from the test split. For multi-hop datasets, each query comes with its original set of gold and distractor documents. For BEIR datasets, we pair each query with its relevant documents plus 15 randomly sampled negatives from the corpus.

Metrics.

We report the following metrics:

1.

NDCG@5: Normalized Discounted Cumulative Gain at rank 5, measuring ranking quality.
2.

Recall@5: Fraction of relevant documents appearing in the top 5 ranked positions, measuring retrieval coverage.
3.

F1@: F1 score computed by thresholding scores at each method’s natural decision boundary. This metric measures how well a method separates relevant from irrelevant documents without dataset-specific tuning, and is our primary metric for evaluating calibration quality.

Inference cost.

RSRank requires an additional query-only forward pass over baselines to compute the null-prefix baseline, followed by a lightweight projection and cosine similarity step. However, because the query-only pass is amortized across all documents for a query, the effective per-document overhead is negligible. As shown in Table 2, the full RSRank pipeline adds only 15.4 ms (+0.88%) over standard query+document forwards on an A100-80GB GPU.

Table 2: Time to rerank 100 docs on A100-80GB GPU (meanstd over 30 runs) for Qwen3-8B. 100 fwd: query+document prompt batched. 101 fwd: + one query-only prompt. 101 fwd + score: full RSRank pipeline.

Stage	Prompt length	Time (ms)
100 fwd	—
101 fwd	()
101 fwd + scoring	same	()

6 Results

Table 3: Results across six retrieval datasets. Best result per column in bold; ties within 1 point are co-bolded. F1@ uses each method’s default decision boundary without dataset-specific tuning.

NDCG@5
Method	2WikiMQA	HotpotQA	MuSiQue	FiQA	Fever	NFCorpus	Avg
MonoT5-base (0.2B)	61.8	68.6	66.7	96.5	99.9	80.0	78.9
MonoT5-3B	64.4	73.7	73.5	98.5	100.0	84.9	82.5
LLM2Vec-Gen (Qwen3-8B)	55.7	60.0	62.9	97.0	99.7	83.6	76.5
Qwen3-Reranker-8B	71.6	81.1	78.8	99.4	100.0	86.9	86.3
RSRank (Ours)	80.2	79.6	84.0	97.5	99.2	83.4	87.3
Recall@5
MonoT5-base (0.2B)	63.0	71.6	68.6	94.4	99.8	31.0	71.4
MonoT5-3B	66.0	76.4	76.7	96.3	99.9	34.6	75.0
LLM2Vec-Gen (Qwen3-8B)	60.5	63.9	69.8	95.1	99.8	35.1	70.7
Qwen3-Reranker-8B	74.0	84.3	84.0	96.7	99.8	35.8	79.1
RSRank (Ours)	85.1	83.0	88.8	95.6	99.7	34.3	81.1
F1@ (natural threshold)
MonoT5-base ()	45.9	47.3	49.0	53.9	91.3	11.5	49.8
MonoT5-3B ()	48.1	53.3	55.4	62.2	91.5	9.1	53.3
LLM2Vec-Gen (Qwen3-8B) ()	16.7	11.9	18.2	23.8	13.1	61.5	24.2
Qwen3-Reranker-8B ()	51.9	60.6	57.1	80.6	98.2	13.5	60.3
RSRank (Ours) ()	60.5	51.8	61.2	85.9	83.4	62.2	67.5

6.1 Ranking Quality

As shown in Table 3, RSRank achieves the best average NDCG@5 (87.3) and Recall@5 (81.1) across all six datasets. The largest gains appear on the multi-hop benchmarks 2WikiMQA and MuSiQue, where RSRank outperforms Qwen3-Reranker-8B by 8.6 and 5.2pp in NDCG@5, respectively. On HotpotQA and the BEIR benchmarks, performance is largely comparable, with Qwen3-Reranker-8B holding a slight edge on HotpotQA and NFCorpus. Both methods achieve near-perfect results on FEVER and FiQA, while NFCorpus remains challenging for both. These results demonstrate that representational shifts provide a competitive alternative to fully trained rerankers.

6.2 Selection Quality

RSRank, evaluated at its designed threshold of , achieves the highest average F1 score (67.5), outperforming Qwen3-Reranker-8B at its default threshold of (60.3) by 7.2pp. The largest gap appears on NFCorpus, where Qwen3-Reranker-8B is severely miscalibrated: its default threshold yields an F1 of only 13.5, despite the dataset-optimal threshold lying near zero. RSRank also shows consistent gains on 2WikiMQA, MuSiQue, and FiQA, suggesting that representational shifts provide a more robust relevance signal under fixed-threshold evaluation.

👁 Refer to caption

Figure 3: Optimal threshold for RSRank for best mean F1 across datasets. The -axis shows the range of scores (globally normalized) given by the reranker; the optimal threshold is indicated by the red dot. Optimal threshold Bias: 0.0221, Variance: 0.0005.

6.3 Threshold Stabilization Across Datasets

RSRank produces substantially more consistent thresholds across datasets than Qwen3-Reranker-8B, reducing threshold bias from 0.379 to 0.022 (17 lower) and variance from 0.023 to 0.0005 (47 lower). Fig. 3 shows that RS training aligns per-dataset optimal thresholds under a shared global normalization, improving calibration robustness across datasets.

7 Analysis

We conduct ablation studies on the 2WikiMQA validation set to understand which design choices drive RSRank’s performance. Ablations on architectural choices, loss components and comparisons with analytical baselines are presented in Appendix. B

7.1 Separability of Representational Shift

Fig. 4 visualizes RS vectors via UMAP for 100 queries (3115 irrelevant, 243 relevant documents). Raw shifts (left) show relevant and irrelevant documents thoroughly intermixed—the shift signal alone does not linearly separate classes. After the learned projection (right), relevant documents consolidate into a distinct cluster, confirming that extracts a relevance-discriminative subspace.

👁 Refer to caption

Figure 4: UMAP visualization of representational shifts on 2WikiMQA. Left: Raw shifts, where relevant (red) and irrelevant (blue) documents overlap. Right: After projection with , relevant documents form a clearly separated cluster.

7.2 Sample Efficiency

The stability of the RS space allows for the learned projection to converge with remarkably few examples. Table 4 shows that just 50 samples achieve 89.2% R@5 (5pp of the 800-sample model), and performance saturates around 400 samples.

Table 4: Sample efficiency on 2WikiMQA (10 epochs, Qwen3-8B). Performance saturates by 400 samples.

Samples	0	50	100	200	400	800
R@5	28.7	89.2	90.7	91.1	93.8	93.9
F1@0	13.2	74.5	76.9	77.7	80.8	82.1

7.3 Per-Query Calibration

As established in Sec. 2, threshold calibration operates at two levels: dataset-level and query-level. Sec. 6.3 showed that RSRank effectively addresses dataset-level calibration. We now examine the query-level picture.

Fig. 5 decomposes each model’s F1 into two components: the dataset-optimal F1 and the additional headroom to per-query optimal F1. RSRank achieves a higher per-query optimal F1 than Qwen3-Reranker-8B on average, indicating stronger underlying ranking quality. The headroom from dataset-optimal to query-optimal is larger for RSRank (+15.9 vs. +12.6 on average). These results indicate that RSRank provides a strong foundation with better dataset-level calibration. The superior per-query ranking of RSRank indicates that future work on per-query calibration can further improve the performance.

👁 Refer to caption

Figure 5: Dataset-optimal F1 and headroom to per-query optimal F1 for Qwen3-Reranker-8B and RSRank. RSRank achieves higher per-query optimal F1 on average (86.3 vs. 85.3), indicating better ranking quality

8 Conclusion

We present RSRank, a reranking method that uses representational shifts (RS) of value vectors to produce relevance scores calibrated at a natural decision boundary. Our motivation identifies two levels of threshold calibration—dataset-level and query-level—and shows how existing rerankers suffer in both areas. RSRank addresses dataset-level calibration by learning a lightweight projection (2.3M params) that maps RS to scores, reducing dataset-level threshold bias by and variance by compared to baselines. Across six diverse retrieval benchmarks, RSRank achieves the highest average NDCG@5, Recall@5, and F1 at its natural threshold, outperforming SOTA Qwen3-Reranker-8B. A headroom analysis further shows that RSRank attains higher per-query optimal F1 than the baseline, confirming stronger underlying ranking quality.

Future Work.

The remaining headroom from dataset to per-query optimal F1 (+15.9pp on average) shows that query-level threshold selection could unlock further gains without retraining. Another direction is end-to-end evaluation within a full RAG pipeline, integrating RSRank with upstream embedding retrieval and downstream generation to measure its impact on final answer quality.

Limitations

Random negatives vs. hard negatives.

Our BEIR evaluation pairs queries with randomly sampled corpus negatives rather than hard negatives from a retrieval stage. Random negatives could be topically unrelated and easier for any model to distinguish. While this reflects a realistic RAG context filtering scenario—where a first-stage retriever has already narrowed the candidate set—it overestimates absolute performance compared to full-corpus retrieval benchmarks. Evaluation with hard negatives from a BM25 or dense retrieval first stage is needed to assess robustness in more adversarial settings.

Per-query calibration.

While RSRank’s zero-threshold design provides better cross-dataset stability than baselines (Sec. 2), query-level calibration remains imperfect. On HotpotQA, the gap between F1@ and F1@ indicates that many individual queries would benefit from a query-specific threshold. Improving per-query calibration is an important direction for future work.

Ethical Considerations

We used AI assistants for support tasks such as improving writing clarity, grammar, and formatting. All technical content, experimental design, analyses, and conclusions were developed, and verified by the authors.

References

M. Ait-Saada and M. Nadif (2023) Is anisotropy truly harmful? a case study on text clustering. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), A. Rogers, J. Boyd-Graber, and N. Okazaki (Eds.), Toronto, Canada, pp. 1194–1203. External Links: Link, Document Cited by: §3.
Amazon Web Services (2024) Amazon Bedrock now supports Rerank API to improve accuracy of RAG applications. Note: Accessed: 2025-01-26 External Links: Link Cited by: §1.2.
P. BehnamGhader, V. Adlakha, F. D. Schmidt, N. Chapados, M. Mosbach, and S. Reddy (2026) LLM2Vec-gen: generative embeddings from large language models. External Links: 2603.10913, Link Cited by: §3, §5.1.
V. Boteva, D. Gholipour, A. Sokolov, and S. Riezler (2016) A full-text learning to rank dataset for medical information retrieval. In Proceedings of the 38th European Conference on Information Retrieval (ECIR), Padova, Italy. External Links: Link Cited by: §5.2.
J. Chen, S. Xiao, P. Zhang, K. Luo, D. Lian, and Z. Liu (2024) M3-embedding: multi-linguality, multi-functionality, multi-granularity text embeddings through self-knowledge distillation. In Findings of the Association for Computational Linguistics: ACL 2024, L. Ku, A. Martins, and V. Srikumar (Eds.), Bangkok, Thailand, pp. 2318–2335. External Links: Link, Document Cited by: §1.3, §5.1.
S. Chen, B. J. Gutierrez, and Y. Su (2025a) Attention in large language models yields efficient zero-shot re-rankers. In The Thirteenth International Conference on Learning Representations, External Links: Link Cited by: §1.3, §3.
Y. Chen, Q. Liu, Y. Zhang, W. Sun, X. Ma, W. Yang, D. Shi, J. Mao, and D. Yin (2025b) TourRank: utilizing large language models for documents ranking with a tournament-inspired strategy. In Proceedings of the ACM Web Conference 2025, WWW ’25. External Links: Document Cited by: §3.
N. Godey, É. Clergerie, and B. Sagot (2024) Anisotropy is inherent to self-attention in transformers. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), Y. Graham and M. Purver (Eds.), St. Julian’s, Malta, pp. 35–48. External Links: Link, Document Cited by: §3.
Google Cloud (2025) Boost your search and RAG agents with Vertex AI’s new state-of-the-art Ranking API. Note: Accessed: 2025-01-26 External Links: Link Cited by: §1.2.
X. Ho, A. Duong Nguyen, S. Sugawara, and A. Aizawa (2020) Constructing a multi-hop QA dataset for comprehensive evaluation of reasoning steps. In Proceedings of the 28th International Conference on Computational Linguistics, D. Scott, N. Bel, and C. Zong (Eds.), Barcelona, Spain (Online), pp. 6609–6625. External Links: Link, Document Cited by: §5.2.
JinaAI (2025) Jinaai/jina-reranker-v2-base-multilingual · hugging face. Note: [Online; accessed 2026-02-09] External Links: Link Cited by: §5.1.
O. Khattab and M. Zaharia (2020) ColBERT: efficient and effective passage search via contextualized late interaction over bert. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’20, New York, NY, USA, pp. 39–48. External Links: ISBN 9781450380164, Link, Document Cited by: §1.1, §3.
S. Kudrjashov, O. Karpik, and E. Klyshinsky (2025) Shrink the longest: improving latent space isotropy with simplicial geometry. In Analysis of Images, Social Networks and Texts, A. Panchenko, D. Gubanov, M. Khachay, A. Kutuzov, N. Loukachevitch, A. Kuznetsov, I. Nikishina, M. Panov, P. M. Pardalos, A. V. Savchenko, E. Tsymbalov, E. Tutubalina, A. Kasieva, and D. I. Ignatov (Eds.), Cham, pp. 120–130. External Links: ISBN 978-3-031-88036-0 Cited by: §3.
P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W. Yih, T. Rocktäschel, S. Riedel, and D. Kiela (2020) Retrieval-augmented generation for knowledge-intensive nlp tasks. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS ’20, Red Hook, NY, USA. External Links: ISBN 9781713829546 Cited by: §1.2.
M. Li, X. Zhang, J. Xin, H. Zhang, and J. Lin (2022) Certified error control of candidate set pruning for two-stage relevance ranking. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Y. Goldberg, Z. Kozareva, and Y. Zhang (Eds.), Abu Dhabi, United Arab Emirates, pp. 333–345. External Links: Link, Document Cited by: §3.
N. F. Liu, K. Lin, J. Hewitt, A. Paranjape, M. Bevilacqua, F. Petroni, and P. Liang (2024) Lost in the middle: how language models use long contexts. Transactions of the Association for Computational Linguistics 12, pp. 157–173. External Links: Link, Document Cited by: §1.2.
S. Liu, F. Xiao, W. Ou, and L. Si (2017) Cascade ranking for operational e-commerce search. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’17, New York, NY, USA, pp. 1557–1565. External Links: ISBN 9781450348874, Link, Document Cited by: §1.1.
Y. Luan, J. Eisenstein, K. Toutanova, and M. Collins (2021) Sparse, dense, and attentional representations for text retrieval. Transactions of the Association for Computational Linguistics 9, pp. 329–345. External Links: Link, Document Cited by: §1.1.
X. Ma, L. Wang, N. Yang, F. Wei, and J. Lin (2024) Fine-tuning llama for multi-stage text retrieval. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 2421–2425. External Links: Link Cited by: §3, §3.
M. Maia, S. Handschuh, A. Freitas, B. Davis, R. McDermott, M. Zarrouk, and A. Balahur (2018) WWW’18 open challenge: financial opinion mining and question answering. In Companion Proceedings of the The Web Conference 2018, WWW ’18, Republic and Canton of Geneva, CHE, pp. 1941–1942. External Links: ISBN 9781450356404, Link, Document Cited by: §5.2.
S. Meng, J. Liu, Y. Chen, S. Mao, P. Cai, G. Yan, B. Shi, and D. Wang (2025) From ranking to selection: a simple but efficient dynamic passage selector for retrieval augmented generation. CoRR abs/2508.09497. External Links: Link Cited by: §3.
Microsoft Research (2021) The science behind semantic search: how AI from Bing is powering Azure Cognitive Search. Note: External Links: Link Cited by: §1.1.
R. Nogueira and K. Cho (2020) Passage re-ranking with bert. External Links: 1901.04085, Link Cited by: §1.1.
R. Nogueira, Z. Jiang, R. Pradeep, and J. Lin (2020) Document ranking with a pretrained sequence-to-sequence model. In Findings of the Association for Computational Linguistics: EMNLP 2020, T. Cohn, Y. He, and Y. Liu (Eds.), Online, pp. 708–718. External Links: Link, Document Cited by: §1.1, §3, §3, §5.1.
A. Phukan, S. Somasundaram, A. Saxena, K. Goswami, and B. V. Srinivasan (2024) Peering into the mind of language models: an approach for attribution in contextual question answering. In Findings of the Association for Computational Linguistics: ACL 2024, L. Ku, A. Martins, and V. Srikumar (Eds.), Bangkok, Thailand, pp. 11481–11495. External Links: Link, Document Cited by: §3.
Pinecone (2024a) Introducing reranking to pinecone inference. Note: https://www.pinecone.io/blog/introducing-reranking-to-pinecone-inference/Accessed: 2026-02-06 Cited by: §1.2.
Pinecone (2024b) Rerankers and two-stage retrieval. Note: https://pinecone.io/learn/series/rag/rerankersAccessed: 2025-01-26 Cited by: §1.2.
J. C. Platt (1999) Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In Advances in Large Margin Classifiers, A. J. Smola, P. Bartlett, B. Schölkopf, and D. Schuurmans (Eds.), pp. 61–74. Cited by: §3.
P. Posokhov, S. Masliukhin, S. Stepan, D. Tirskikh, and O. Makhnytkina (2025) Relevance scores calibration for ranked list truncation via TMP adapter. In Findings of the Association for Computational Linguistics: ACL 2025, W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar (Eds.), Vienna, Austria, pp. 7728–7734. External Links: Link, Document, ISBN 979-8-89176-256-5 Cited by: §3.
R. Pradeep, R. F. Nogueira, and J. Lin (2021) The expando-mono-duo design pattern for text ranking with pretrained sequence-to-sequence models. CoRR abs/2101.05667. External Links: Link Cited by: §3.
R. Pradeep, S. Sharifymoghaddam, and J. Lin (2023) RankZephyr: effective and robust zero-shot listwise reranking is a breeze!. arXiv preprint arXiv:2312.02724. Cited by: §3, §3.
A. Razzhigaev, M. Mikhalchuk, E. Goncharova, I. Oseledets, D. Dimitrov, and A. Kuznetsov (2024) The shape of learning: anisotropy and intrinsic dimensions in transformer-based models. In Findings of the Association for Computational Linguistics: EACL 2024, Y. Graham and M. Purver (Eds.), St. Julian’s, Malta, pp. 868–874. External Links: Link Cited by: §3.
R. Ren, Y. Wang, K. Zhou, W. X. Zhao, W. Wang, J. Liu, J. Wen, and T. Chua (2025) Self-calibrated listwise reranking with large language models. In Proceedings of the ACM on Web Conference 2025, WWW ’25, New York, NY, USA, pp. 3692–3701. External Links: ISBN 9798400712746, Link, Document Cited by: §3.
S. Robertson and H. Zaragoza (2009) The probabilistic relevance framework: bm25 and beyond. Found. Trends Inf. Retr. 3 (4), pp. 333–389. External Links: ISSN 1554-0669, Link, Document Cited by: §1.1.
O. Skean, M. R. Arefin, D. Zhao, N. N. Patel, J. Naghiyev, Y. LeCun, and R. Shwartz-Ziv (2025) Layer by layer: uncovering hidden representations in language models. In Forty-second International Conference on Machine Learning, External Links: Link Cited by: §3, §3.
J. Su, M. Ahmed, Y. Lu, S. Pan, W. Bo, and Y. Liu (2024) RoFormer: enhanced transformer with rotary position embedding. Neurocomput. 568 (C). External Links: ISSN 0925-2312, Link, Document Cited by: §4.1.
J. Sun, P. Jiang, S. Wang, J. Fan, H. Wang, S. Ouyang, M. Zhong, Y. Jiao, C. Huang, X. Xu, P. Han, P. Li, J. Huang, G. Liu, H. Ji, and J. Han (2026) Rethinking the reranker: boundary-aware evidence selection for robust retrieval-augmented generation. External Links: 2602.03689, Link Cited by: §3.
W. Sun, L. Yan, X. Ma, S. Wang, P. Ren, Z. Chen, D. Yin, and Z. Ren (2023) Is ChatGPT good at search? investigating large language models as re-ranking agents. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, H. Bouamor, J. Pino, and K. Bali (Eds.), Singapore, pp. 14918–14937. External Links: Link, Document Cited by: §3, §3.
Q. Team (2025) Qwen3 technical report. External Links: 2505.09388, Link Cited by: §5.1.
J. Thorne, A. Vlachos, C. Christodoulopoulos, and A. Mittal (2018) FEVER: a large-scale dataset for fact extraction and VERification. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), M. Walker, H. Ji, and A. Stent (Eds.), New Orleans, Louisiana, pp. 809–819. External Links: Link, Document Cited by: §5.2.
H. Trivedi, N. Balasubramanian, T. Khot, and A. Sabharwal (2022) MuSiQue: multihop questions via single-hop question composition. Transactions of the Association for Computational Linguistics 10, pp. 539–554. External Links: Link, Document Cited by: §5.2.
L. Wang, J. Lin, and D. Metzler (2011) A cascade ranking model for efficient ranked retrieval. In Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’11, New York, NY, USA, pp. 105–114. External Links: ISBN 9781450307574, Link, Document Cited by: §1.1.
Weaviate (2024) Cohere reranker models with Weaviate. Note: https://weaviate.io/developers/weaviate/model-providers/cohere/rerankerAccessed: 2025-01-26 Cited by: §1.2.
S. Wu, J. Xie, J. Chen, T. Zhu, K. Zhang, and Y. Xiao (2024) How easily do irrelevant inputs skew the responses of large language models?. In First Conference on Language Modeling, External Links: Link Cited by: §1.2.
Z. Yang, P. Qi, S. Zhang, Y. Bengio, W. Cohen, R. Salakhutdinov, and C. D. Manning (2018) HotpotQA: a dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, E. Riloff, D. Chiang, J. Hockenmaier, and J. Tsujii (Eds.), Brussels, Belgium, pp. 2369–2380. External Links: Link, Document Cited by: §5.2.
J. Yoon and L. Sael (2025) Document re-ranking with evidential neural networks. IEEE Access 13 (), pp. 161964–161972. External Links: Document Cited by: §3.
S. Yoon, G. Kim, G. CHO, and seung-won hwang (2025) AcuRank: uncertainty-aware adaptive computation for listwise reranking. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, External Links: Link Cited by: §3.
P. Yu, D. Cohen, H. Lamba, J. R. Tetreault, and A. Jaimes (2025) Explain then rank: scale calibration of neural rankers using natural language explanations from LLMs. In Findings of the Association for Computational Linguistics: ACL 2025, W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar (Eds.), Vienna, Austria, pp. 22716–22730. External Links: Link, Document, ISBN 979-8-89176-256-5 Cited by: §3.
J. Zhang, D. Juan, C. Rashtchian, C. Ferng, H. Jiang, and Y. Chen (2024) SLED: self logits evolution decoding for improving factuality in large language models. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, External Links: Link Cited by: §3.
Y. Zhang, M. Li, D. Long, X. Zhang, H. Lin, B. Yang, P. Xie, A. Yang, D. Liu, J. Lin, F. Huang, and J. Zhou (2025) Qwen3 embedding: advancing text embedding and reranking through foundation models. arXiv preprint arXiv:2506.05176. Cited by: §1.1, §1.3, §2, §3, §3, §5.1.
S. Zhuang, H. Zhuang, B. Koopman, and G. Zuccon (2024) A setwise approach for effective and highly efficient zero-shot ranking with large language models. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’24, New York, NY, USA, pp. 38–47. External Links: ISBN 9798400704314, Link, Document Cited by: §3.

Appendix A Discrete Difference Calculus on Prefix Space

This appendix provides a formal viewpoint for the feature construction in Sec. 4.1. The main paper uses the first-order, length-matched difference features (Equations (2)–(4)). The material below justifies terminology such as “finite difference” and clarifies how deltas behave under sequences of prefix edits.

A.1 Shift Operators on Prefixes

Let be the vocabulary and the set of all finite token sequences. We view as a rooted directed graph (a -ary tree) whose vertices are prefixes and whose edges correspond to appending a token: for any .

For a function , define the shift operator (append-) by

(13)

The associated forward difference operator is

(14)

Equation (14) is the standard discrete/finite-difference construction “difference = shift minus identity” in a non-numeric domain.

A.2 Telescoping Identity

Finite differences compose along a path in the prefix graph. Let be a suffix and . Then for any ,

(15)

Equation (15) is the discrete analogue of the fundamental theorem of calculus: the total change equals the sum of incremental changes along a path.

A.3 Instantiation for Decoder-Only Transformers

Fix layer and head . For an input sequence , let denote the pre-attention value vector at position . For a fixed query and document , define the document-conditioned value vector for query position :

(16)

The main paper’s controlled finite difference is exactly the first-order difference between and a length-matched null prefix :

(17)

and the representational shift tensor collects these deltas across .

Appendix B Ablations

We conducted ablation studies on our techniques to identify the most effective variations and determine which configurations yield the best results.

B.1 Representation and Optimization Methods

Here we analyze alternative representation and optimization methods. Table 5 compares our learned approach on shifts against: (a) learning on raw representations without subtraction to see the effect of “shifting”, and (b) three closed-form projections on shifts to compare analytical methods against our learned optimisation.

Shift vs. raw representations.

The shift is critical for calibration:while the direct approach achieves comparable R@5(98.3vs97.8), its F1@0 drops by 6.4pp because raw representations lack the centering that makes zero a natural threshold.

Closed-form baselines.

We consider three analytical solutions:

1.

Oracle alignment: , where is the mean shift, so that .
2.

Separation: solves a ridge regression with targets , for relevant, irrelevant shifts.
3.

Combined: adds an oracle constraint to the objective: subject to , yielding a covariance-weighted projection.

All three preserve reasonable ranking (R@5 up to 90.8) but fail at calibration (F1@0 30.8). These methods optimize alignment and separation targets in projection space, but scoring uses cosine similarity with , which introduces a non-linear normalization. As a result, pushing irrelevant projections toward does not yield negative scores—it yields near-zero-norm vectors with noisy cosine similarities. Our learned objective optimizes directly on the scores (cosine similarities), explicitly pushing relevant scores above zero and irrelevant scores below zero with a margin.

Table 5: Shift vs. direct representations on 2WikiMQA. The shift is essential for threshold calibration (F1@).

Method	R@5	F1@
Learned on shifts (ours)	97.8	82.2
Learned on raw repr. (no shift)	98.3	75.8
Closed-form oracle	75.3	17.9
Closed-form separation	90.8	23.1
Closed-form combined	86.8	30.8

B.2 Choice of Internal Signal

We compare three internal LLM signals: value vectors (our choice), key vectors, and hidden states. Table 6 shows that value vectors achieve the best performance, closely followed by key vectors. Hidden states perform worst, suggesting that the head-level decomposition provides useful structure for relevance detection.

Table 6: Extraction type ablation on 2WikiMQA. Value vectors provide the best signal.

Signal	R@5	F1@5
Value vectors (ours)	98.3	66.3
Key vectors	97.8	66.0
Hidden states	96.5	65.0

B.3 Loss Components Ablation

Table 7 shows the effect of removing each loss component. The calibration loss is the most critical: removing it drops F1@ by 8.3pp, as the model loses its ability to anchor the decision boundary at zero. Removing the margin loss causes a moderate 1.9-point F1@ drop by weakening class separation. Orthogonality regularization and oracle alignment have minimal individual impact.

Table 7: Loss component ablation on 2WikiMQA validation set. is relative to the full model.

Configuration	R@5	F1@	R@5	F1@
Full model	94.0	80.2	—	—
Calibration loss	88.2	71.9	5.8	8.3
Margin loss	92.9	78.3	1.1	1.9
Ortho. reg.	93.4	79.9	0.6	0.3
Oracle alignment	93.3	80.0	0.7	0.2
Norm constraint	93.6	79.4	0.4	0.8

B.4 Training Details

Table 8: Training hyperparameters used across all experiments.

Initial learning rate
Hyperparameter	Value
Optimizer	AdamW
Learning rate schedule	Cosine annealing
Final learning rate
Weight decay
Training precision	Mixed precision
Maximum epochs
Early stopping patience




Margin
Projection dimension

URL: https://arxiv.org/html/2606.17468v1