Why RAG over a website is its own problem
It differs from RAG over docs, code, or a help center mainly in the shape of the content and how fast it changes:
- RAG over docs (Glean, Mendable, Inkeep) indexes developer documentation, tuned for code snippets and API references. It re-crawls weekly, with average chunks of 80-250 tokens.
- RAG over a knowledge base (Intercom Fin, Zendesk AI) indexes a help center to deflect tickets. It re-crawls weekly, with 200-500 token chunks.
- RAG over a website indexes a marketing or commerce site, a mix of long-form articles and short product cards. Prices and inventory need daily crawls, content can wait for weekly, and chunks run small and vary by section.
To handle that, a website RAG pipeline has to:
- Discover the URLs the site exposes (sitemap.xml, robots.txt,
<link rel>tags) - Crawl in priority order (homepage, pricing, key products first; blog last)
- Chunk by section (paragraphs for blog, cards for products, accordions for FAQ)
- Embed with a model suited to short marketing copy
- Re-rank higher-stakes queries like pricing and policy
- Refresh on a cadence that matches how often the site actually changes
- Cite the source page so the visitor can check it
Failure modes specific to website RAG
The pattern is well understood; the trouble is in the details.
Stale catalog. The agent quotes a price that changed yesterday. The fix is a priority recrawl on price and inventory pages every 24h.
Thin chunks. Marketing copy is dense, so default chunking can leave a whole blog post as one or two chunks, which makes retrieval too coarse. The fix is to split by paragraph or by H2/H3 boundary.
Branded-language collision. "Yokaify" matches every Yokaify page, which doesn't help you find the right one. The fix is hybrid retrieval (BM25 plus vector) so rare phrases still surface.
A cited page that no longer exists. The site changed, but a stale entry still points at a deleted URL, and the agent confidently cites a 404. The fix is to invalidate entries on URL deletion via a sitemap diff.
How it differs from related approaches
- Generic LLM: no retrieval at all; it answers from training data and invents site-specific facts.
- Static FAQ chatbot: keyword-matches a fixed FAQ; brittle, with no real retrieval.
- RAG over docs: the same pattern, but the corpus is developer documentation, with different chunking and refresh rules.
- Tool-using agent: the model calls external APIs at chat time, such as a product-search API. A different mechanism that can be combined with RAG.
Related terms
- Onsite agent category — the agent category that answers from your site's content
- In-session engagement — the chat surface this powers
- Behavioral intervention — the layer that decides when the agent speaks up
See also
- RAG-grounded chat: the 2026 architecture for accurate site agents
- Next.js chatbot 2026 architecture guide — RAG implementation in Next.js
- The onsite-agent category reference — where RAG over website fits in the stack
First defined: May 24, 2026.