Why RAG freshness matters more than other staleness
A stale search index just shows yesterday's result, and the visitor recovers as soon as they land on the current page. A stale RAG index is worse: the agent states an old price or a discontinued product in chat, and the visitor never sees the live page to correct it.
The cost is lopsided. A wrong answer damages trust far faster than a missing one, and a single confident mistake can undo a run of correct replies.
The 2026 freshness windows
Different kinds of pages change at different rates, so the recrawl pace should match:
- Pricing pages: 24h at minimum, ideally webhook-driven (near-instant)
- Product pages with inventory: 12-24h or webhook
- Policy pages (terms, privacy, returns): 7 days
- Blog content (top 20% of traffic): 24h
- Blog content (long tail): 7 days
- Homepage / landing: 6h or webhook
- FAQ pages: 24h or webhook
A good recrawl schedule sorts pages into these categories and re-checks each one at the right pace.
The two ways to invalidate the index
A solid setup uses both:
- Push (webhook-driven). The site fires a webhook whenever something changes that should update the index, a Shopify product edit, a WordPress save, a deploy hook, and that change is re-embedded for the affected URL within seconds to minutes.
- Pull (scheduled). A crawler walks the sitemap on a schedule, compares ETag or Last-Modified headers against the index, and re-embeds anything that changed. The average lag is about half the crawl interval.
Running both together catches far more than either alone: webhooks handle the urgent changes, and scheduled crawls sweep up whatever slips past.
Checking citations at answer time
Even with both paths, the odd stale entry gets through. The safeguard is to validate each cited link just before the answer goes out: any URL that returns a 404 is dropped, the index entry is invalidated, and the answer is regenerated from the remaining valid sources.
How it differs from related concepts
- Crawl frequency: how often source pages are re-fetched. Freshness is the broader goal; crawling is one of its mechanisms.
- Index rebuild: rebuilding the whole index from scratch. Freshness is incremental; a rebuild is the heavy-handed option.
- Embedding-model migration: switching to a different model. A separate concern; freshness assumes the model stays put.
- Citation grounding: showing sources in the answer. Freshness is what keeps those sources current.
Related terms
- RAG over website — the kind of corpus freshness applies to
- Retrieval-grounded chat — the architecture freshness lives inside
- Citation grounding — how cited URLs are validated at answer time
- Onsite agent category — the agent category that keeps its index fresh
See also
- RAG freshness: recrawl and invalidation — the implementation deep dive
- URL discovery and prioritisation — what the recrawl schedule draws from
- Multi-tenant RAG — per-tenant freshness scoping
- RAG chunking strategy — chunking interacts with re-embedding cost
- RAG-grounded chat: 2026 architecture — the full retrieval and generation stack
First defined: May 30, 2026.