RAG over Website — Glossary

Answering from the site

A website's pages, FAQ, and product catalog feeding into a search index that an onsite chat agent draws on to answer a shopper's question.

The agent answers from the site's own pages instead of generic knowledge.

Why RAG over a website is its own problem

It differs from RAG over docs, code, or a help center mainly in the shape of the content and how fast it changes:

RAG over docs (Glean, Mendable, Inkeep) indexes developer documentation, tuned for code snippets and API references. It re-crawls weekly, with average chunks of 80-250 tokens.
RAG over a knowledge base (Intercom Fin, Zendesk AI) indexes a help center to deflect tickets. It re-crawls weekly, with 200-500 token chunks.
RAG over a website indexes a marketing or commerce site, a mix of long-form articles and short product cards. Prices and inventory need daily crawls, content can wait for weekly, and chunks run small and vary by section.

To handle that, a website RAG pipeline has to:

Discover the URLs the site exposes (sitemap.xml, robots.txt, <link rel> tags)
Crawl in priority order (homepage, pricing, key products first; blog last)
Chunk by section (paragraphs for blog, cards for products, accordions for FAQ)
Embed with a model suited to short marketing copy
Re-rank higher-stakes queries like pricing and policy
Refresh on a cadence that matches how often the site actually changes
Cite the source page so the visitor can check it

Failure modes specific to website RAG

The pattern is well understood; the trouble is in the details.

Stale catalog. The agent quotes a price that changed yesterday. The fix is a priority recrawl on price and inventory pages every 24h.

Thin chunks. Marketing copy is dense, so default chunking can leave a whole blog post as one or two chunks, which makes retrieval too coarse. The fix is to split by paragraph or by H2/H3 boundary.

Branded-language collision. "Yokaify" matches every Yokaify page, which doesn't help you find the right one. The fix is hybrid retrieval (BM25 plus vector) so rare phrases still surface.

A cited page that no longer exists. The site changed, but a stale entry still points at a deleted URL, and the agent confidently cites a 404. The fix is to invalidate entries on URL deletion via a sitemap diff.

Generic LLM: no retrieval at all; it answers from training data and invents site-specific facts.
Static FAQ chatbot: keyword-matches a fixed FAQ; brittle, with no real retrieval.
RAG over docs: the same pattern, but the corpus is developer documentation, with different chunking and refresh rules.
Tool-using agent: the model calls external APIs at chat time, such as a product-search API. A different mechanism that can be combined with RAG.

Onsite agent category — the agent category that answers from your site's content
In-session engagement — the chat surface this powers
Behavioral intervention — the layer that decides when the agent speaks up

Why RAG over a website is its own problem

Failure modes specific to website RAG

How it differs from related approaches

Related terms

See also