Reranking & Hybrid Retrieval

Interactive

Learn why two-stage retrieval and keyword+vector fusion improve relevance in real-world RAG systems.

Try the interactive tools (1)
Difficulty intermediate
Read time 10 min
reranking hybrid-retrieval bm25 vector-search rag cross-encoder retrieval
Updated February 27, 2026

What Is Reranking & Hybrid Retrieval?

Think of retrieval like hiring from a huge pile of resumes.

  • First pass: a recruiter quickly scans thousands of resumes and picks a short list that looks promising.
  • Second pass: a hiring manager reads only that short list carefully and picks the true top candidates.

That is the core idea of two-stage retrieval.

In many AI systems, the first pass is fast search over a large corpus. It often gets you “pretty relevant” results. But when your question is nuanced, “pretty relevant” is not enough. You need a second pass that understands the query-document pair at a deeper level. That second pass is reranking.

Now add hybrid retrieval: instead of relying on only keyword search or only vector search, you combine both. Keyword search (like BM25) is great for exact terms, product codes, and rare names. Vector search is great for semantic similarity and paraphrases. Together, they usually outperform either one alone.

Technical definition:

  • Hybrid retrieval combines lexical and semantic retrieval signals to produce a better candidate set.
  • Reranking applies a stronger relevance model to that candidate set and reorders it so the final top-k is actually useful.

Why Does It Matter?

If retrieval is weak, your assistant fails even when the generation model is strong. RAG systems do not hallucinate only because of “model weakness” - they also hallucinate when retrieval sends weak context.

Reranking and hybrid retrieval matter because they improve:

  • Answer quality: Better top documents means better grounded answers.
  • Precision at top-k: The first few chunks matter most because context windows are limited.
  • Robustness: Hybrid methods handle both exact lookups and fuzzy semantic questions.
  • Trust: Users see fewer wrong citations and fewer “almost right” responses.

In production, this often has a measurable effect on business metrics: higher answer acceptance, fewer support escalations, and less manual correction.

How It Works

A practical pipeline is:

  1. Build two retrieval channels

    • Lexical index (BM25 or equivalent) over tokenized text.
    • Vector index over embeddings.
  2. Run both channels for each query

    • Lexical retrieval returns documents with exact token overlap.
    • Vector retrieval returns semantically similar chunks.
  3. Fuse the candidate lists

    A common method is Reciprocal Rank Fusion (RRF). Intuition: a document is valuable if it ranks well in one or more lists.

    A simple version is:

    RRF(d) = sum_i 1 / (k + rank_i(d))

    where rank_i(d) is document d in list i, and k is a smoothing constant (often around 60).

  4. Take a wider candidate set

    For example, keep top 50 or top 100 fused candidates. This keeps recall high before expensive scoring.

  5. Rerank with a stronger model

    Use a cross-encoder or instruction-tuned reranker that reads [query, document] jointly and outputs a relevance score.

    • Bi-encoders (embedding search) are fast because query and docs are encoded separately.
    • Cross-encoders are slower but more accurate because they compare tokens across query and doc directly.
  6. Select final context

    Keep top N chunks after reranking, then apply context assembly rules (deduplicate, diversify sources, respect metadata filters).

  7. Pass to generation model

    Now the LLM receives fewer but stronger chunks.

Simple example

Query: “How do I rotate API keys without downtime?”

  • Keyword search finds docs with exact phrase “API key rotation” and “zero downtime”.
  • Vector search also finds “credential rollover” and “grace-period token migration” docs.
  • Fusion combines both sets.
  • Reranker pushes documents that specifically discuss migration sequence and rollback checks to the top.

Result: the final context is not just related to security - it is specifically relevant to safe rotation procedure.

Key Terminology

  • Lexical retrieval (BM25): Search that relies on token overlap and term statistics.
  • Vector retrieval: Search in embedding space for semantically similar content.
  • Hybrid retrieval: Combining lexical and vector signals in one retrieval stack.
  • Reranker: A stronger model that reorders candidates by deeper relevance.
  • RRF (Reciprocal Rank Fusion): Rank-based fusion method that combines multiple ranked lists.

Real-World Applications

  • Customer support copilots: Blend exact policy lookups with semantic FAQ matching, then rerank for issue-specific passages.
  • Enterprise search: Combine strict keyword constraints (legal terms, IDs) with semantic retrieval across wiki pages.
  • E-commerce search: Match product codes exactly while still understanding intent like “lightweight trail shoes for wet weather.”
  • Code assistants: Retrieve by symbol names and function signatures, then rerank by actual implementation relevance.

Common Misconceptions

  1. “Vector search makes keyword search obsolete.” Not true. Exact matching remains critical for identifiers, formulas, product SKUs, and compliance language.

  2. “Reranking is optional polish.” In many systems it is a major quality lever, especially when the top few retrieved chunks determine answer quality.

  3. “Hybrid retrieval is always too slow.” With practical candidate limits and batched reranking, latency is often acceptable for significant relevance gains.

Further Reading

  • Cormack, Clarke, and Buettcher (2009), Reciprocal Rank Fusion Outperforms Condorcet and Individual Rank Learning Methods.
  • Robertson and Zaragoza (2009), The Probabilistic Relevance Framework: BM25 and Beyond.
  • Nogueira and Cho (2019), Passage Re-ranking with BERT.

Read the article first

These tools reinforce the concepts above — you'll get more out of them after reading through the article.

Interactive: Hybrid Retrieval Workbench

Blend lexical and semantic retrieval, inspect candidate fusion, then apply reranking to compare quality, latency, and rerank token cost.

Guided Scenario

Need the production policy window and safe migration steps for rotating API keys.

Goal: Balance exact policy terms with operational migration guidance in one query.

BM25 - Vector blend50 / 50

Left favors BM25. Right favors vector similarity.

Candidate pool size8

Pool is the merged set before reranking.

Final top-k3

Affects Recall@k and nDCG@k directly.

RerankerON

Uses transparent feature scoring over fused candidates.

Recall@3

75%

3 of 4 relevant docs found.

nDCG@3

1.000

Captures ranking quality using graded relevance (0-3).

Estimated latency

105 ms

Lexical + vector + fusion + optional rerank cost model.

Rerank token cost

358 tok

Query + candidate tokens scored by reranker.

Channel Retrieval

BM25 Top 5

#1bm25 15.47

Production API Key Rotation Policy

Relevant
#2bm25 4.94

Customer Migration Exception Procedure

Relevant
#3bm25 4.63

Sandbox API Key Onboarding

Distractor
#4bm25 3.13

Rotation Audit Evidence Checklist

Partial
#5bm25 0.00

Credential Rollover Playbook

Relevant

Vector Top 5

#1vec 0.99

Customer Migration Exception Procedure

Relevant
#2vec 0.97

Production API Key Rotation Policy

Relevant
#3vec 0.97

Auth Incident Triage Runbook

Distractor
#4vec 0.97

Credential Rollover Playbook

Relevant
#5vec 0.96

Token Lifecycle Standard

Distractor

Hybrid Fusion

Union from both channels, ranked by fusion score: 0.50*bm25Norm + 0.50*vectorNorm

#1hybrid 0.993

Production API Key Rotation Policy

BM25VectorRelevant
#2hybrid 0.660

Customer Migration Exception Procedure

BM25VectorRelevant
#3hybrid 0.583

Rotation Audit Evidence Checklist

BM25VectorPartial
#4hybrid 0.494

Sandbox API Key Onboarding

BM25VectorDistractor
#5hybrid 0.492

Auth Incident Triage Runbook

BM25VectorDistractor
#6hybrid 0.490

Credential Rollover Playbook

BM25VectorRelevant
#7hybrid 0.485

Token Lifecycle Standard

BM25VectorDistractor
#8hybrid 0.259

Legacy Freeze Communication

VectorDistractor

Reranked Final

Final order uses proxy reranker score.

#1score 0.856

Production API Key Rotation Policy

RelevantLift -0.14

intent 0.36 | exact 1.00 | metadata 1.00 | contradiction -0.00

#2score 0.575

Credential Rollover Playbook

RelevantLift +0.08

intent 0.18 | exact 1.00 | metadata 1.00 | contradiction -0.00

#3score 0.479

Customer Migration Exception Procedure

RelevantLift -0.18

intent 0.27 | exact 0.00 | metadata 1.00 | contradiction -0.00

Comparison modes

ModeRecall@3nDCG@3LatencyRerank cost
BM25 only50%0.85628 ms0 tok
Vector only50%0.71435 ms0 tok
Hybrid75%0.90469 ms0 tok
Hybrid + rerank75%1.000105 ms358 tok

Interpretation

  • BM25 and vector recall are tied here; ranking quality will depend mostly on fusion and reranking.
  • Reranking improved nDCG by 9.6 points, but added 36 ms of latency.
  • A larger candidate pool improves recall headroom but increases rerank compute and token cost.
  • The blend is balanced between lexical precision and semantic coverage.

Continue learning

Continue directly from here instead of returning to the top navigation.