Benchmarking Three RAG Architectures
We tested three RAG architectures on 500 real user queries across four domains: legal documents, technical documentation, financial reports, and customer support logs.
The architectures
Naive RAG
Embed query → cosine similarity → top-k chunks → stuff into prompt.
def naive_rag(query: str, k: int = 5) -> str:
embedding = embed(query)
chunks = vector_store.search(embedding, top_k=k)
context = "\n".join(c.text for c in chunks)
return llm.generate(f"Context: {context}\n\nQuestion: {query}")Hybrid search
Combine dense retrieval (embeddings) with sparse retrieval (BM25). Re-rank the union with a cross-encoder.
def hybrid_rag(query: str, k: int = 5) -> str:
dense_results = vector_store.search(embed(query), top_k=k * 2)
sparse_results = bm25_index.search(query, top_k=k * 2)
combined = deduplicate(dense_results + sparse_results)
reranked = cross_encoder.rerank(query, combined)[:k]
context = "\n".join(c.text for c in reranked)
return llm.generate(f"Context: {context}\n\nQuestion: {query}")Agentic RAG
The retriever is an agent that decides what to search for, evaluates results, and iterates.
async def agentic_rag(query: str, max_iterations: int = 3) -> str:
plan = await planner.decompose(query)
context_parts = []
for sub_query in plan.sub_queries:
results = await hybrid_search(sub_query)
evaluation = await evaluator.check_relevance(sub_query, results)
if evaluation.score < 0.7:
refined = await planner.refine_query(sub_query, evaluation)
results = await hybrid_search(refined)
context_parts.extend(results[:3])
return await synthesizer.generate(query, context_parts)Results
| Metric | Naive RAG | Hybrid | Agentic |
|---|---|---|---|
| Accuracy (legal) | 62% | 78% | 89% |
| Accuracy (tech docs) | 71% | 83% | 87% |
| Accuracy (financial) | 58% | 74% | 91% |
| Accuracy (support) | 68% | 79% | 82% |
| Mean accuracy | 65% | 79% | 87% |
| Latency (p50) | 1.2s | 2.1s | 4.8s |
| Latency (p95) | 2.4s | 3.8s | 12.3s |
| Cost per query | $0.003 | $0.008 | $0.024 |
Key findings
The accuracy gap between naive and hybrid is 14 points — almost free performance by adding BM25 and a cross-encoder. The gap between hybrid and agentic is 8 points — but at 3x the cost and 2.3x the latency.
- Legal documents showed the largest improvement with agentic RAG (+27% over naive). Legal queries require precise terminology matching that benefits from query decomposition and iterative refinement.
- Customer support showed the smallest improvement (+14% over naive). Support queries are typically simple and direct — the agent's iteration doesn't add much.
- Financial reports had the most variance in agentic RAG (some queries took 15+ seconds). The agent would sometimes enter loops trying to find data that didn't exist in the corpus.
The right architecture depends on the domain. Don't assume agentic RAG is always better — it's 8x more expensive than naive RAG for 22 percentage points of accuracy.
The hidden cost: evaluation
These numbers took three weeks to produce. Not because the systems were hard to build — each took 1-2 days. The evaluation was the bottleneck:
- Defining "correct" for 500 queries across 4 domains
- Building domain-specific evaluation rubrics
- Running human evaluation (3 annotators per query)
- Reconciling disagreements
- Re-running after each architecture change
The evaluation infrastructure is now more complex than any of the RAG systems. This is normal and expected.