RAG Freshness — Glossary

Keeping the index in sync

A website page changing its price on one side and the retrieval index updating to match on the other, with a clock showing the short window between them.

When a page changes, the index has to catch up before the agent answers.

Why RAG freshness matters more than other staleness

A stale search index just shows yesterday's result, and the visitor recovers as soon as they land on the current page. A stale RAG index is worse: the agent states an old price or a discontinued product in chat, and the visitor never sees the live page to correct it.

The cost is lopsided. A wrong answer damages trust far faster than a missing one, and a single confident mistake can undo a run of correct replies.

The 2026 freshness windows

Different kinds of pages change at different rates, so the recrawl pace should match:

Pricing pages: 24h at minimum, ideally webhook-driven (near-instant)
Product pages with inventory: 12-24h or webhook
Policy pages (terms, privacy, returns): 7 days
Blog content (top 20% of traffic): 24h
Blog content (long tail): 7 days
Homepage / landing: 6h or webhook
FAQ pages: 24h or webhook

A good recrawl schedule sorts pages into these categories and re-checks each one at the right pace.

The two ways to invalidate the index

A solid setup uses both:

Push (webhook-driven). The site fires a webhook whenever something changes that should update the index, a Shopify product edit, a WordPress save, a deploy hook, and that change is re-embedded for the affected URL within seconds to minutes.
Pull (scheduled). A crawler walks the sitemap on a schedule, compares ETag or Last-Modified headers against the index, and re-embeds anything that changed. The average lag is about half the crawl interval.

Running both together catches far more than either alone: webhooks handle the urgent changes, and scheduled crawls sweep up whatever slips past.

Checking citations at answer time

Even with both paths, the odd stale entry gets through. The safeguard is to validate each cited link just before the answer goes out: any URL that returns a 404 is dropped, the index entry is invalidated, and the answer is regenerated from the remaining valid sources.

Crawl frequency: how often source pages are re-fetched. Freshness is the broader goal; crawling is one of its mechanisms.
Index rebuild: rebuilding the whole index from scratch. Freshness is incremental; a rebuild is the heavy-handed option.
Embedding-model migration: switching to a different model. A separate concern; freshness assumes the model stays put.
Citation grounding: showing sources in the answer. Freshness is what keeps those sources current.

RAG over website — the kind of corpus freshness applies to
Retrieval-grounded chat — the architecture freshness lives inside
Citation grounding — how cited URLs are validated at answer time
Onsite agent category — the agent category that keeps its index fresh

Why RAG freshness matters more than other staleness

The 2026 freshness windows

The two ways to invalidate the index

Checking citations at answer time

How it differs from related concepts

Related terms

See also