The Metadata You Forgot to Index
You built a RAG system that retrieves semantically. You forgot to build the one that retrieves precisely. Metadata filtering isn't an optimization — it's the difference between a search engine and a lucky guess.
Most RAG systems are semantic search systems wearing a chatbot costume. You embed documents, embed queries, find nearest neighbors, pass chunks to the LLM. It works — until the user asks something time-sensitive, department-specific, or scoped to a particular product version.
Then it falls apart. Not because retrieval failed. Because retrieval didn’t know what it was supposed to exclude.
Semantic similarity doesn’t know about scope.
When a user asks “what’s our refund policy?” your vector search will happily return the most semantically similar chunks — which might include the policy from 2022, the draft nobody approved, and the policy for a product that no longer exists. They’re all semantically close. They’re all wrong answers.
This is the metadata problem. Embeddings capture meaning. They don’t capture date, source, document type, department, version, access tier, or any other structured attribute that determines whether a chunk is actually relevant to this user’s actual question.
You can’t embed your way out of that. You need metadata — and you need it indexed.
What metadata filtering actually looks like.
Every chunk in your vector store should carry structured attributes alongside its embedding. At minimum:
- Date / version — when was this created or last updated?
- Source type — policy doc, support article, internal memo, contract?
- Scope — product line, team, region, customer tier?
- Status — published, draft, deprecated, superseded?
At query time, you apply pre-filters before or alongside semantic search. Not after — before. You’re not looking for the most semantically similar chunk in the entire corpus. You’re looking for the most semantically similar chunk within the set of chunks that are actually eligible to answer this question.
Every vector database supports this. Almost no one sets it up on day one.
The extraction problem people ignore.
You can’t filter on metadata you don’t have. Which means the work starts before ingestion.
Metadata has to be extracted — from filenames, from document headers, from creation timestamps, from folder paths, from whatever naming conventions your content team uses and never documented. Sometimes you extract it automatically. Sometimes you run an LLM pass over the document to infer it. Sometimes you ask humans to tag it.
None of those are free. All of them are worth it.
The team that skips metadata extraction spends the next six months explaining to stakeholders why the RAG system “sometimes returns old information.” The answer is always the same: because it couldn’t tell old from new.
Metadata also powers the things you haven’t built yet.
Time-bounded retrieval. Access control (don’t return documents a user doesn’t have permission to see). Source citation that actually tells users where information came from. Analytics on which document types are getting retrieved vs. ignored. Feedback loops that let you identify stale content.
None of that works without metadata. All of that is table stakes for a production knowledge system.
You built the retrieval. You skipped the scaffolding.
Semantic search is powerful. It’s also blunt. Metadata is the precision layer that makes retrieval trustworthy — not just accurate in aggregate, but correct in the specific case, for the specific user, right now.
If your RAG system can’t filter by date, source, or scope, it’s not a knowledge system. It’s a very expensive autocomplete.
Tag your documents. Index the metadata. Retrieve with intent.