The Query Rewriter You're Not Using
Most RAG systems retrieve against the user's raw query. That's the problem. Query rewriting is the highest-leverage improvement most teams skip entirely.
Most RAG pipelines retrieve against whatever the user typed. Word for word, character for character, straight into the vector store.
That’s a mistake. Not a fatal one — just a quietly expensive one that compounds every time a user phrases something slightly differently from how your documents were written.
The gap between how users ask and how documents answer is real.
A user asks: “How do I cancel my subscription?” Your documentation says: “Account termination and offboarding.” Vector similarity will partially bridge that gap. But partially isn’t good enough when your retrieval quality directly determines whether the answer is correct, hallucinated, or uselessly vague.
The raw query is optimized for conversation, not retrieval. It often contains filler (“how do I”, “what’s the best way to”), implicit context the user carries in their head, and natural language phrasing that doesn’t match technical documentation language. You’re asking your embedding model to paper over a fundamental mismatch.
Query rewriting closes that gap before retrieval even starts.
What query rewriting actually looks like:
The simplest version: pass the user’s query to an LLM with a prompt that says “rewrite this as a standalone, information-dense search query.” Strip the conversational wrapper. Make the intent explicit. Add relevant terms that might appear in documents.
“How do I cancel my subscription?” becomes “subscription cancellation account termination steps.” That retrieves better. Every time.
But you can go further:
Multi-query expansion. Generate 3–5 distinct reformulations of the same query. Retrieve against all of them, deduplicate, rerank. This covers different phrasings without needing perfect alignment on any single one.
HyDE (Hypothetical Document Embeddings). Generate a hypothetical document that would answer the question, then embed that document and retrieve against it. You’re searching document-space, not query-space. It sounds counterintuitive. It works.
Contextual compression. After retrieval, pass the returned chunks through a step that asks: “what part of this chunk is actually relevant to the query?” Strip the noise before sending to the LLM. Better context, lower token spend, fewer hallucinations.
Why teams skip this.
Because it adds latency and complexity. You’re adding an LLM call before the retrieval step, which means more tokens, more milliseconds, more things to monitor. It feels expensive for an optimization most users won’t notice.
Except they notice. They notice when the search returns garbage and the answer is wrong. They notice when the same question phrased two different ways returns two different answers. They don’t know why — they just know your product doesn’t work.
Query rewriting is a pre-retrieval investment that pays off in retrieval accuracy, answer quality, and user trust. The latency is real but bounded. The accuracy improvement compounds across every query your system handles.
Your retrieval pipeline isn’t just the vector search.
It starts the moment the user submits a query. Everything before the embedding call — normalization, rewriting, expansion, context injection — shapes what you retrieve. Most teams treat that stage as a passthrough.
Stop treating the raw query as the input. Treat it as the starting point.
Rewrite first. Retrieve second. The gap you’re closing isn’t a UX polish item — it’s the delta between a system that works and one that almost works.
Almost works doesn’t ship.