Benchmarking Three RAG Architectures

We tested three RAG architectures on 500 real user queries across four domains: legal documents, technical documentation, financial reports, and customer support logs.

The architectures

Naive RAG

Embed query → cosine similarity → top-k chunks → stuff into prompt.

def naive_rag(query: str, k: int = 5) -> str:
    embedding = embed(query)
    chunks = vector_store.search(embedding, top_k=k)
    context = "\n".join(c.text for c in chunks)
    return llm.generate(f"Context: {context}\n\nQuestion: {query}")

Hybrid search

Combine dense retrieval (embeddings) with sparse retrieval (BM25). Re-rank the union with a cross-encoder.

def hybrid_rag(query: str, k: int = 5) -> str:
    dense_results = vector_store.search(embed(query), top_k=k * 2)
    sparse_results = bm25_index.search(query, top_k=k * 2)

    combined = deduplicate(dense_results + sparse_results)
    reranked = cross_encoder.rerank(query, combined)[:k]

    context = "\n".join(c.text for c in reranked)
    return llm.generate(f"Context: {context}\n\nQuestion: {query}")

Agentic RAG

The retriever is an agent that decides what to search for, evaluates results, and iterates.

async def agentic_rag(query: str, max_iterations: int = 3) -> str:
    plan = await planner.decompose(query)
    context_parts = []

    for sub_query in plan.sub_queries:
        results = await hybrid_search(sub_query)
        evaluation = await evaluator.check_relevance(sub_query, results)

        if evaluation.score < 0.7:
            refined = await planner.refine_query(sub_query, evaluation)
            results = await hybrid_search(refined)

        context_parts.extend(results[:3])

    return await synthesizer.generate(query, context_parts)

Results

MetricNaive RAGHybridAgentic
Accuracy (legal)62%78%89%
Accuracy (tech docs)71%83%87%
Accuracy (financial)58%74%91%
Accuracy (support)68%79%82%
Mean accuracy65%79%87%
Latency (p50)1.2s2.1s4.8s
Latency (p95)2.4s3.8s12.3s
Cost per query$0.003$0.008$0.024

Key findings

The accuracy gap between naive and hybrid is 14 points — almost free performance by adding BM25 and a cross-encoder. The gap between hybrid and agentic is 8 points — but at 3x the cost and 2.3x the latency.

  • Legal documents showed the largest improvement with agentic RAG (+27% over naive). Legal queries require precise terminology matching that benefits from query decomposition and iterative refinement.
  • Customer support showed the smallest improvement (+14% over naive). Support queries are typically simple and direct — the agent's iteration doesn't add much.
  • Financial reports had the most variance in agentic RAG (some queries took 15+ seconds). The agent would sometimes enter loops trying to find data that didn't exist in the corpus.

The right architecture depends on the domain. Don't assume agentic RAG is always better — it's 8x more expensive than naive RAG for 22 percentage points of accuracy.

The hidden cost: evaluation

These numbers took three weeks to produce. Not because the systems were hard to build — each took 1-2 days. The evaluation was the bottleneck:

  • Defining "correct" for 500 queries across 4 domains
  • Building domain-specific evaluation rubrics
  • Running human evaluation (3 annotators per query)
  • Reconciling disagreements
  • Re-running after each architecture change

The evaluation infrastructure is now more complex than any of the RAG systems. This is normal and expected.