Services Process Blog Demo

Get in touch

hello@sovont.com
Back to blog
· Sovont · 3 min read

The Document Freshness Problem Nobody Talks About

Your RAG pipeline retrieves the right document. The problem is it was last updated eight months ago.

RAG & Knowledge Systems

The retrieval is working. Cosine similarity is pulling the right chunks. The reranker is doing its job. The model is generating a confident, well-structured answer.

And it’s wrong. Because the document it’s reading from was last updated in September.

This is the document freshness problem, and it doesn’t show up in your evals because your eval dataset was built from the same stale corpus you’re retrieving from. Everything looks internally consistent. Nothing is current.


Stale knowledge in a RAG system is worse than no knowledge. No knowledge produces hedged, uncertain answers. Stale knowledge produces confident, authoritative answers about things that are no longer true. Pricing that changed. Policies that got revised. API endpoints that were deprecated. Org charts that shifted after a reorg.

The model doesn’t know the document is out of date. You didn’t tell it. And the user reading the answer has no way to tell either.


Where freshness breaks down:

Indexing is a one-time job in most stacks. Teams build the pipeline, index the corpus, and ship. Re-indexing is a manual process that gets scheduled, then de-prioritized, then forgotten. Six months later the knowledge base is a museum.

Metadata doesn’t flow. Even when documents have last_modified timestamps, that attribute rarely makes it through the chunking pipeline into the vector store. So you can’t filter by recency. You can’t surface a freshness warning. The date just disappears.

Change detection is hard, so nobody does it. Confluence pages update silently. SharePoint doesn’t emit webhooks by default. PDFs get replaced in S3 with no versioning. The source of truth changes; your index doesn’t.


The fix isn’t glamorous. It’s operational discipline applied to your knowledge pipeline:

Propagate timestamps. Every chunk in your vector store should carry source_last_modified. Index it as a filterable field. Query it.

Set freshness thresholds per document type. Product pricing: re-index weekly. Legal policies: re-index on update event. HR handbooks: monthly. Treat freshness as a quality dimension, not an afterthought.

Surface staleness to the model. When a chunk is older than your threshold, include the metadata in the context. Let the model hedge appropriately. “According to documentation last updated March 2025…” is better than a confident hallucination in present tense.

Build a re-indexing pipeline, not a re-indexing script. Scripts get run once. Pipelines get monitored, retried, and alerted on.


Your retrieval system is only as current as the last time you fed it. If you don’t know when that was, that’s the answer to why your users stopped trusting it.

Freshness isn’t a feature. It’s a precondition for correctness.