Services Process Blog Demo

Get in touch

hello@sovont.com
Back to blog
· Sovont · 3 min read

Reranking Is Not Optional

Your retrieval pipeline returns 20 chunks. Your LLM sees 5. What happens in between that gap is either thoughtful or a coin flip.

RAG & Knowledge Systems

Your retrieval pipeline returns 20 chunks. Your LLM sees 5. What happens in the gap between those two numbers is either thoughtful engineering or a coin flip.

Most teams ship the coin flip.

The Dirty Secret of Top-K Retrieval

Vector search gives you semantic similarity scores. They are not relevance scores. They are not quality scores. They are distance metrics in a high-dimensional embedding space, and they correlate with relevance in ways that are useful but not reliable.

When you return the top 5 chunks by cosine similarity and hand them directly to the LLM, you’re betting that the embedding model and your chunking strategy together produce a ranking that actually serves the user’s query. Sometimes they do. Often they don’t.

The failure modes:

  • Redundancy. Chunks 1, 2, and 3 all say the same thing with slightly different wording. You burned 60% of your context window on one idea.
  • Off-topic precision. The chunk is semantically close to the query but answers a subtly different question. The LLM confidently uses it anyway.
  • Recency blindness. Your vector index doesn’t know that the document from 2022 is superseded by the one from 2025. Both score similarly. Only one is correct.

What a Reranker Actually Does

A cross-encoder reranker takes each candidate chunk and the original query together, computes a joint relevance score, and re-orders the list. Unlike a bi-encoder (which embeds query and document separately), a cross-encoder can model the relationship between them directly.

The result: a ranked list that reflects actual relevance to this specific query, not just geometric proximity in embedding space.

Tools like Cohere Rerank, Jina Reranker, or open-source cross-encoders from Sentence Transformers will add latency. Usually 50–200ms depending on candidate count and model size. That latency is almost always worth it.

Where Teams Skip It (And Pay Later)

Reranking gets cut when teams are optimizing for demo speed. The retrieval pipeline is fast, the answers look plausible, and nobody’s measuring whether the right chunks are actually being used.

Then the product ships, edge cases accumulate, and support tickets start coming in for questions that have good answers in your knowledge base — answers that kept getting ranked 8th.

The fix is cheap. The neglect isn’t.

Do It Right

The minimum viable setup:

  1. Retrieve top 20 candidates by vector similarity
  2. Rerank with a cross-encoder against the original query
  3. Pass top 5 (or top N by token budget) to the LLM

If you want to go further: filter by metadata before retrieval, use hybrid search (sparse + dense) before reranking, and evaluate each stage separately. But start with the reranker. It’s the highest-leverage improvement most RAG pipelines are missing.

Your retrieval step finds the candidates. Your reranker picks the winners. If you’re skipping step two, you’re not doing RAG — you’re doing search with extra steps.