Glossary

RAG over Website

RAG over website is the application of retrieval-augmented generation specifically to a website's own content — pages, FAQ, policies, product catalog — so that an AI agent on that site can answer with the site's facts rather than generic LLM knowledge.

BBaidyanathMay 25, 20264 min readUpdated May 31, 2026
Yokaify
The agent answers from the site's own pages instead of generic knowledge.

Why RAG over a website is its own problem

It differs from RAG over docs, code, or a help center mainly in the shape of the content and how fast it changes:

  • RAG over docs (Glean, Mendable, Inkeep) indexes developer documentation, tuned for code snippets and API references. It re-crawls weekly, with average chunks of 80-250 tokens.
  • RAG over a knowledge base (Intercom Fin, Zendesk AI) indexes a help center to deflect tickets. It re-crawls weekly, with 200-500 token chunks.
  • RAG over a website indexes a marketing or commerce site, a mix of long-form articles and short product cards. Prices and inventory need daily crawls, content can wait for weekly, and chunks run small and vary by section.

To handle that, a website RAG pipeline has to:

  • Discover the URLs the site exposes (sitemap.xml, robots.txt, <link rel> tags)
  • Crawl in priority order (homepage, pricing, key products first; blog last)
  • Chunk by section (paragraphs for blog, cards for products, accordions for FAQ)
  • Embed with a model suited to short marketing copy
  • Re-rank higher-stakes queries like pricing and policy
  • Refresh on a cadence that matches how often the site actually changes
  • Cite the source page so the visitor can check it

Failure modes specific to website RAG

The pattern is well understood; the trouble is in the details.

Stale catalog. The agent quotes a price that changed yesterday. The fix is a priority recrawl on price and inventory pages every 24h.

Thin chunks. Marketing copy is dense, so default chunking can leave a whole blog post as one or two chunks, which makes retrieval too coarse. The fix is to split by paragraph or by H2/H3 boundary.

Branded-language collision. "Yokaify" matches every Yokaify page, which doesn't help you find the right one. The fix is hybrid retrieval (BM25 plus vector) so rare phrases still surface.

A cited page that no longer exists. The site changed, but a stale entry still points at a deleted URL, and the agent confidently cites a 404. The fix is to invalidate entries on URL deletion via a sitemap diff.

  • Generic LLM: no retrieval at all; it answers from training data and invents site-specific facts.
  • Static FAQ chatbot: keyword-matches a fixed FAQ; brittle, with no real retrieval.
  • RAG over docs: the same pattern, but the corpus is developer documentation, with different chunking and refresh rules.
  • Tool-using agent: the model calls external APIs at chat time, such as a product-search API. A different mechanism that can be combined with RAG.

See also

First defined: May 24, 2026.