The 16 Types of RAG: A Practical Field Guide

Table of Contents

  1. Why So Many Flavors of RAG?
  2. The Common Vocabulary
  3. Standard (Naive) RAG
  4. Hybrid RAG
  5. HyDE RAG
  6. Contextual Retrieval RAG
  7. Recursive / Multi-Step RAG
  8. Self-RAG
  9. Modular RAG
  10. Memory-Augmented RAG
  11. GraphRAG
  12. Knowledge-Enhanced RAG
  13. Agentic RAG
  14. Multi-Modal RAG
  15. Multi-Model RAG
  16. Federated RAG
  17. Streaming RAG
  18. ODQA RAG
  19. Domain-Specific RAG
  20. Side-by-Side Comparison
  21. Decision Guide: Which RAG Should You Pick?

1. Why So Many Flavors of RAG?

The original RAG paper (Lewis et al., 2020) described one thing: encode a query, retrieve nearby document vectors, condition the generator on them. That is it. Six years later "RAG" is an umbrella term covering dozens of architectures, because the original recipe breaks down in production for predictable reasons:

  • Vector similarity is bad at exact identifiers, codes, and names.
  • A short query is a poor proxy for what the user actually wants retrieved.
  • One retrieval step cannot answer a question that needs three.
  • Static indexes go stale the moment your domain changes.
  • Different data shapes (logs, tables, images, graphs) need different retrievers.
  • LLMs hallucinate even with retrieved context if nothing checks the answer.

Each "type of RAG" is a named answer to one (or several) of those failure modes. They are not mutually exclusive. A serious production system is usually three or four of them at once: e.g. Hybrid + Contextual + Self-RAG over a Domain-Specific corpus.

A note on the list. The 16 names below are the ones that recur most in 2024–2026 literature, vendor blogs, and the survey paper by Gao et al. (Retrieval-Augmented Generation for Large Language Models: A Survey, 2023). They do not all live at the same level of abstraction — some are pipeline shapes, some are retrieval techniques, some are deployment topologies. The article calls that out where it matters.


2. The Common Vocabulary

Before diving in, the moving parts every RAG variant rearranges:

┌─────────────────────────────────────────────────────────────┐
│  RAG building blocks                                        │
│                                                             │
│  • Indexer        → reads source data, chunks, embeds       │
│  • Index/store    → vector DB, BM25, SQL, graph, KV cache   │
│  • Retriever      → finds candidate chunks for a query      │
│  • Reranker       → re-scores candidates with a stronger    │
│                     (often cross-encoder) model              │
│  • Compressor     → trims/summarises before the LLM          │
│  • Generator      → the LLM that writes the answer           │
│  • Evaluator      → judges answer quality / groundedness     │
│  • Memory         → persists facts/preferences across turns  │
│  • Router         → decides which retriever/tool to call     │
│  • Tools          → APIs, SQL, code, browsers                │
└─────────────────────────────────────────────────────────────┘

Every variant in this article is a particular wiring of these blocks.


3. Standard (Naive) RAG

Definition

The original 2020 recipe. Embed the user's query, do a top-k cosine search against a vector store, paste the retrieved chunks into the LLM prompt, generate an answer.

Components

BlockTypical choice
IndexerFixed-size chunker (~500 tokens, 50-token overlap)
Embeddingstext-embedding-3-small / bge-small-en
IndexPinecone / Qdrant / pgvector / Chroma
RetrieverCosine top-k (k = 4–8)
GeneratorGPT-4o / Claude / Llama

Flow

Documents ──▶ Chunk ──▶ Embed ──▶ Vector store
                                        │
User query ──▶ Embed ──▶ top-k search ──┘
                                        │
                                        ▼
                          [System] + [chunks] + [query]
                                        │
                                        ▼
                                       LLM
                                        │
                                        ▼
                                     Answer

Where it fits

  • "Chat with my PDFs" prototypes
  • Internal wiki Q&A where the corpus is small and homogeneous
  • Anywhere you need a baseline before measuring whether something better is worth the cost

Where to avoid it

  • Queries that contain exact identifiers (INV-8472, error codes, SKUs) — vector similarity is weak here; reach for Hybrid.
  • Multi-hop questions ("Who paid the vendor that supplied the item that failed inspection?") — needs Recursive or GraphRAG.
  • Domains where "the answer" is spread across dozens of chunks — Standard RAG retrieves locally, never globally.
  • Anything where wrong answers are expensive — there is no checking step.

Pros / cons vs. the rest

  • ✅ Cheapest and fastest to build. Good baseline for measuring everything else.
  • ❌ Brittle on real corpora. In benchmark studies (e.g. Anthropic's own Contextual Retrieval evaluation, Sept 2024) Standard RAG retrieval failure rate is typically 20–40% before reranking.

4. Hybrid RAG

Definition

Run two or more retrievers in parallel — almost always dense (vector / embedding) and sparse (BM25 / keyword) — and fuse their results. Often extended with structured retrieval (SQL), metadata filters, and a reranker.

Components

            Query
              │
   ┌──────────┼──────────┐
   ▼          ▼          ▼
 Vector     BM25       SQL / metadata
 search    (sparse)    filter
   │          │          │
   └──────────┼──────────┘
              ▼
    Result fusion (RRF / weighted)
              │
              ▼
          Reranker
              │
              ▼
             LLM

RRF (Reciprocal Rank Fusion) is the most common merge — it scores each doc by Σ 1/(k + rank_i) across retrievers, no weight tuning needed.

Flow

  1. Same query goes to dense and sparse indexes simultaneously.
  2. Each returns its own top-N.
  3. RRF or weighted-sum produces a unified top-K.
  4. Cross-encoder reranker (e.g. Cohere Rerank, BGE-Reranker) reorders.
  5. Top 3–5 go to the LLM.

Where it fits

  • Any corpus that contains a mix of natural language and identifiers, code, error messages, product codes, model numbers.
  • Customer support, e-commerce search, log analysis.
  • Anything previously served well by Elasticsearch — don't throw BM25 away when you add vectors.

Where to avoid it

  • Trivially small corpora (< few thousand chunks) where the engineering overhead isn't worth it.
  • Pure semantic-similarity tasks (e.g. "find me articles with the same vibe as this one") where keywords add noise.

Pros / cons

  • ✅ Reliably beats vector-only RAG on almost every public benchmark (BEIR, MTEB retrieval). Microsoft, Anthropic, and Cohere all publish numbers showing 10–30% recall improvement from adding BM25.
  • ❌ Two indexes to keep in sync. Slightly higher latency.

5. HyDE RAG

Definition

Hypothetical Document Embeddings (Gao et al., 2022). Before searching, ask an LLM to write a fake answer to the query, then embed that fake answer and use its vector to search. The hypothetical document is discarded — only retrieval uses it.

Components

Query ──▶ LLM ──▶ Hypothetical document
                          │
                          ▼
                    Embedding model
                          │
                          ▼
                    Vector search
                          │
                          ▼
                  Real retrieved chunks
                          │
                          ▼
                         LLM (final answer
                              uses real chunks,
                              not the hypothetical)

Why it works

Embeddings are trained on document-shaped text. Short, ambiguous queries ("queue issue laravel") sit in a part of the embedding space that nothing in your index occupies. The hypothetical answer ("Laravel queue failures are typically caused by Horizon worker timeouts, Redis connection drops, or…") lands in document space, near actual matching content.

Where it fits

  • Short or vague queries.
  • Zero-shot retrieval against a corpus you can't fine-tune embeddings on.
  • Cross-domain retrieval where the query language and the document language differ in style (a developer's Slack question vs. formal docs).

Where to avoid it

  • Queries that contain exact entities — HyDE will hallucinate context around the entity and may pull retrieval away from the actual record.
  • Latency-sensitive paths — you've added a generation call before retrieval.
  • Anything you'll bill per token at scale — every query now costs a hypothetical generation.

Pros / cons

  • ✅ Cheap engineering win, no model training needed. Original paper shows it competing with fine-tuned dense retrievers.
  • ❌ Bias from the hypothetical: if the LLM's first guess is in the wrong direction, retrieval follows it there. Pair with a reranker.

6. Contextual Retrieval RAG

Definition

There are two things people mean by this term — be explicit about which one:

  1. Anthropic's "Contextual Retrieval" (Sept 2024). Before embedding each chunk, prepend a short LLM-generated summary explaining where in the source document the chunk lives. Index the augmented chunk.
  2. Generic "context-aware retrieval" — incorporate conversation history, user profile, current task state into the retrieval query.

The Anthropic technique is the one with a reproducible benchmark.

The Anthropic recipe

For each chunk in the corpus:
    context = LLM("Document: {whole_doc}
                    Chunk: {chunk}
                    Give 50–100 tokens locating this chunk
                    in the document.")
    augmented_chunk = context + "\n\n" + chunk
    embed(augmented_chunk) ──▶ vector store
    bm25_index(augmented_chunk)

At query time the retrieval is otherwise standard (Hybrid is recommended).

Flow (generic context-aware variant)

User turn N
      │
      ▼
┌──────────────────┐
│ Context builder  │ ← conversation history
│                  │ ← user profile
│                  │ ← active task / workflow state
└──────────────────┘
      │
      ▼
Query rewriter (resolves "it", "that", "the one I mentioned")
      │
      ▼
Retrieval (dense + sparse)
      │
      ▼
LLM

Where it fits

  • Long technical documents where chunks are meaningless without context (legal contracts, API specs, scientific papers).
  • Multi-turn chatbots where users use pronouns and ellipsis ("how do I fix it?").
  • Anthropic reports ~35% reduction in retrieval failure rate vs. naive RAG, ~49% with reranking.

Where to avoid it

  • Already-self-contained chunks (FAQ entries, product cards) — adding context wastes tokens and may hurt.
  • Single-turn search APIs where there is no conversation history.
  • Very large corpora where re-embedding everything with LLM-generated context is expensive (Anthropic suggests prompt caching to mitigate).

Pros / cons

  • ✅ One of the best price/quality wins of 2024–2025. Implementation is ~50 lines of glue code.
  • ❌ Indexing cost is now proportional to LLM-token-cost-per-chunk, not embedding-cost-per-chunk. For 1M chunks that's significant.

7. Recursive / Multi-Step RAG

Definition

Run multiple retrieve-then-reason cycles, where each cycle's output decides what to retrieve next. Sometimes called iterative retrieval, multi-hop RAG, or in research papers IRCoT (Trivedi et al., 2022) and Self-Ask (Press et al., 2022).

Components

                    User question
                          │
                          ▼
              ┌───────────────────────┐
              │ Decompose / plan      │
              └───────────────────────┘
                          │
              ┌───────────┴────────────┐
              ▼                        ▼
        Sub-question 1           Sub-question 2
              │                        │
              ▼                        ▼
          Retrieval                 Retrieval
              │                        │
              ▼                        ▼
          Partial answer 1        Partial answer 2
              │                        │
              └───────────┬────────────┘
                          ▼
              Synthesise / reason
                          │
            ┌─────────────┴────────────┐
            ▼                          ▼
   "Need more info?" ──Yes──▶ Loop with new sub-question
            │
            No
            ▼
       Final answer

Where it fits

  • Multi-hop questions: "What was the second-largest acquisition by the company that acquired the maker of Photoshop?"
  • Investigations / root-cause analysis where the next thing to look at depends on what you just learned.
  • Research-style synthesis across many sources.

Where to avoid it

  • Latency-sensitive UX (each hop is at least one LLM call + one retrieval; 4 hops is 4× the cost and time of Standard RAG).
  • Single-fact lookups — you'll often hallucinate sub-questions when none are needed.
  • Without a hard hop-limit and a reflection step, the loop can drift into irrelevant territory.

Pros / cons

  • ✅ Only practical way to answer genuine multi-hop questions without a graph index.
  • ❌ Cost and latency scale linearly with hops. Drift is common — pair with Self-RAG style critique to bound it.

8. Self-RAG

Definition

Self-Reflective RAG (Asai et al., 2023). The model is taught (or prompted) to emit reflection tokens that decide:

  • Should I retrieve at all? ([Retrieve] / [No Retrieve])
  • Is this retrieved passage relevant? ([Relevant] / [Irrelevant])
  • Is my draft answer supported by the passage? ([Supported] / [Partial] / [Not supported])
  • How useful is the answer to the user? ([Useful=5..1])

If the answer to any of these is bad, retrieve again, rewrite, or refuse.

Components

   Query
     │
     ▼
 Should I retrieve?
     │
  No │── (model answers from parametric knowledge)
     │
  Yes
     ▼
  Retrieve
     │
     ▼
 Per-passage relevance critique
     │
     ▼
  Generate draft
     │
     ▼
 Groundedness critique  ── "Not supported" ──▶ retrieve / rewrite
     │
     ▼
 Utility critique
     │
     ▼
   Answer

Where it fits

  • Production systems where hallucinations are expensive (legal, medical, financial advice).
  • Mixed query streams — some questions need retrieval ("what's our refund policy?"), some don't ("what's 2+2?"). Self-RAG lets the system not pay retrieval cost when it's pointless.
  • Anywhere you want the model to refuse rather than hallucinate.

Where to avoid it

  • Tiny QA bots over a tiny corpus — the critique steps are overhead.
  • Fully open-weight stacks where you can't easily fine-tune for reflection tokens (you can do a prompted approximation, but it is weaker than the trained version).

Pros / cons

  • ✅ Empirically reduces hallucination rate substantially. Pairs naturally with Recursive RAG.
  • ❌ More LLM calls per answer. The trained reflection-token version requires a fine-tuned model; the prompted version is more brittle.

9. Modular RAG

Definition

Less a specific architecture, more a design philosophy: the RAG pipeline is built as a graph of swappable, single-responsibility modules — query rewriter, router, retriever(s), reranker, compressor, generator, evaluator — orchestrated rather than hard-coded. The Gao et al. survey (2023) coined the term to contrast with "Naive RAG" and "Advanced RAG".

Reference shape

        Query
          │
          ▼
   ┌──────────────┐
   │ Rewrite /    │
   │ decompose    │
   └──────────────┘
          │
          ▼
   ┌──────────────┐
   │   Router     │ ── intent / domain / data-source decision
   └──────────────┘
          │
   ┌──────┴───────┐
   ▼              ▼
 Vector DB    SQL / Graph / API
   │              │
   └──────┬───────┘
          ▼
    Reranker
          │
          ▼
    Compressor
          │
          ▼
       LLM
          │
          ▼
    Evaluator
          │
          ▼
      Answer

Where it fits

  • Production systems that are going to keep evolving — you can swap the embedding model, change the reranker, add a SQL retriever, without rewriting the pipeline.
  • Multi-tenant platforms where different tenants need different routing logic.
  • Frameworks: LlamaIndex (QueryPipeline, RouterQueryEngine), LangChain LCEL, Haystack 2.x, DSPy.

Where to avoid it

  • A weekend prototype — modularity is overhead until you have a reason for it.
  • Teams without observability tooling — modular pipelines are hard to debug without per-stage tracing (LangSmith, Arize Phoenix, OpenTelemetry).

Pros / cons

  • ✅ Future-proof. Lets you adopt every other technique in this article incrementally.
  • ❌ Easy to over-engineer — many "modular" pipelines are slower and worse than a 50-line baseline.

10. Memory-Augmented RAG

Definition

RAG plus a persistent memory layer that is updated by the system itself, not just by an indexing pipeline. Memory is treated as a first-class retrieval source on equal footing with documents.

Memory taxonomy

Memory typeWhat it storesTypical store
Short-termCurrent session / conversation bufferRAM, Redis
Long-term semanticDistilled facts ("user prefers Postgres")Vector DB
Long-term episodicPast interactions / eventsVector DB, append-only log
ProceduralWorkflows / playbooks the agent has learnedStructured store

Flow

Turn N
   │
   ▼
┌────────────────────────────┐
│  Retrieve memory + docs    │
│   ┌──────────┐ ┌─────────┐ │
│   │  memory  │ │ vector  │ │
│   │  store   │ │  store  │ │
│   └──────────┘ └─────────┘ │
└────────────────────────────┘
   │
   ▼
  LLM generates answer
   │
   ▼
┌────────────────────────────┐
│  Memory writer:             │
│   • extract durable facts   │
│   • dedupe                  │
│   • importance-score        │
│   • write back              │
└────────────────────────────┘

The memory writer is the part most teams skip and most papers focus on. Without it you're just keeping chat history, not building memory. Frameworks: mem0, Letta (formerly MemGPT), Zep.

Where it fits

  • Long-running assistants (months/years of interaction with the same user).
  • Personalisation where re-asking the user every session is unacceptable.
  • Agents that need to remember decisions and outcomes ("last time we tried X, it failed").

Where to avoid it

  • Stateless one-shot APIs.
  • Anywhere with strict data-deletion / GDPR constraints unless you build the deletion path first — memory systems silently accumulate PII.
  • Domains where stale memories are dangerous (medical dosing, current pricing) without a TTL strategy.

Pros / cons

  • ✅ Massive UX improvement for repeat users. Foundational for serious agents.
  • ❌ Easy to corrupt — bad facts written once contaminate future answers. Needs an importance/confidence model and a way to forget.

11. GraphRAG

Definition

Retrieval over a knowledge graph instead of (or in addition to) a flat vector index. The graph is typically extracted from text by an LLM during indexing — entities become nodes, relationships become edges. Microsoft's open-source graphrag library (2024) is the reference implementation.

Indexing phase

Documents
    │
    ▼
LLM extraction pass: entities + relations
    │
    ▼
Build graph
    │
    ▼
Community detection (Leiden / Louvain)
    │
    ▼
LLM summarises each community
    │
    ▼
Graph + community summaries persisted

Query modes

ModeMechanismBest for
Local searchAnchor on matched entities, traverse 1–3 hops"Tell me about vendor X"
Global searchMap-reduce over community summaries"What are the recurring themes across the corpus?"
DRIFT searchHybrid of local and globalMixed-granularity questions

Where it fits

  • Corpora dense in named entities and relationships (legal, biomedical, intelligence analysis, large support ticket archives).
  • Questions that include "across all", "common patterns", "who is connected to whom".
  • When you need explainable answers — graph paths are directly inspectable.

Where to avoid it

  • Small corpora — the LLM extraction pass is expensive (often 10×–50× more tokens than embedding) and overkill.
  • Highly dynamic corpora — re-indexing is slow; the graph drifts from reality.
  • When you already have structured relationships in a database — extracting them again from text is a worse copy. Use Knowledge-Enhanced RAG instead.

Pros / cons

  • ✅ The only RAG variant that handles "global" / corpus-wide questions well.
  • ❌ Indexing can take hours and cost serious money for large corpora. Entity resolution is statistical and imperfect ("J. Smith" vs. "John Smith" vs. "Smith, J.").

Covered in detail (with comparison to Context Graph) in From Plain LLM to Context Graph.


12. Knowledge-Enhanced RAG

Definition

RAG augmented with pre-existing structured knowledge — knowledge bases, ontologies, taxonomies, business rules, ER diagrams — rather than knowledge extracted from text. The graph or schema is authored, not statistically inferred.

How it differs from GraphRAG

GraphRAG                       Knowledge-Enhanced RAG
────────                       ──────────────────────
Graph extracted from text      Graph already exists
LLM is the extractor           Domain experts / app are
                                the source of truth
Probabilistic edges            Exact, typed edges
Re-indexing on doc change      Live updates from systems
Best for: text corpora         Best for: structured domains

Components

  • A canonical schema (entities, relations, constraints).
  • Live ingestion from operational systems (DB CDC, event bus).
  • A retriever that can mix vector search over text and graph queries (Cypher / SPARQL / SQL).
  • Optionally a Text2Cypher / Text2SQL layer so the LLM can ask the graph directly.

Where it fits

  • Enterprises that already have a curated KB (Wikidata mirror, internal CMDB, product catalog with relations, drug-interaction database).
  • Compliance / audit contexts where edges must be exact and explainable.
  • Anywhere "structure" is a stable artifact of your business.

Where to avoid it

  • You don't have an ontology and won't build one — KE-RAG without a schema is just a brittle GraphRAG.
  • Fast-moving startup domains where the schema would change weekly.

Pros / cons

  • ✅ Highest precision in its niche. Edges are trustworthy because humans defined them.
  • ❌ Up-front modelling cost. Schema drift is its own ongoing tax.

13. Agentic RAG

Definition

RAG where retrieval is just one tool among many in an agent loop. The LLM decides — at each step — whether to retrieve, call an API, run code, query SQL, search the web, or answer directly. Often multiple agents collaborate (router + worker + critic).

Reference loop

        ┌────────────────────────────────────┐
        │            Agent loop              │
        │                                    │
   ┌────┴────┐                               │
   │  Plan   │  ← user goal / scratchpad     │
   └────┬────┘                               │
        │                                    │
        ▼                                    │
   ┌─────────┐                               │
   │  Pick   │ ── retrieve_docs              │
   │  tool   │ ── search_web                 │
   │         │ ── query_sql                  │
   │         │ ── call_api                   │
   │         │ ── run_code                   │
   │         │ ── answer                     │
   └────┬────┘                               │
        │                                    │
        ▼                                    │
   ┌─────────┐                               │
   │ Execute │                               │
   └────┬────┘                               │
        │                                    │
        ▼                                    │
   ┌─────────┐                               │
   │ Observe │                               │
   └────┬────┘                               │
        │                                    │
        ▼                                    │
   Done? ── No ───────────────────▶──────────┘
        │
       Yes
        ▼
     Answer

Components

  • A planner / orchestrator (often the LLM itself with structured outputs).
  • A tool registry with explicit JSON-schema interfaces.
  • Short-term scratchpad memory; usually long-term memory (Memory-Augmented RAG).
  • Optionally a critic agent that reviews the worker's output.
  • Frameworks: LangGraph, CrewAI, AutoGen, OpenAI Agents SDK, DSPy.

Where it fits

  • Tasks that require taking action, not just answering — sending emails, opening tickets, executing trades.
  • Complex investigations where the next step genuinely depends on the last result.
  • Multi-source workflows that span systems an ordinary RAG retriever can't reach.

Where to avoid it

  • Anything a single retrieval call could answer — you'll pay 5–10× the latency and tokens for a worse answer.
  • Untrusted or high-stakes environments without sandboxing — agents that can call tools can also do damage.
  • Production paths where determinism matters — agent traces are non-trivial to reproduce.

Pros / cons

  • ✅ The most general framing. Almost everything else in this article can be a tool inside an Agentic RAG.
  • ❌ Hardest to debug, evaluate, and bound. Loops, runaway costs, and "agent does something stupid" are all real failure modes.

14. Multi-Modal RAG

Definition

RAG over non-text modalities — images, scanned PDFs, tables, charts, audio, video, code — alongside or instead of text. Retrieval, generation, or both are multi-modal.

Important: "Multi-Modal" (multiple data types) and "Multi-Model" (multiple AI models) are different things. The PDF you may have seen is about Multi-Modal. The next section covers Multi-Model.

Patterns

Pattern A — Caption-and-embed (cheap)
─────────────────────────────────────
Image ─▶ VLM caption ─▶ text embedding ─▶ vector store
(retrieval is text-against-text; image retrieved by its caption)

Pattern B — Native multi-modal embedding
────────────────────────────────────────
Image ─┐
       ├─▶ multi-modal embedder (CLIP, SigLIP, JinaCLIP) ─▶ vector store
Text  ─┘
(text query embeds into the same space as image)

Pattern C — ColPali / vision-page retrieval
───────────────────────────────────────────
PDF page ─▶ vision encoder per page ─▶ patch-level embeddings
Query    ─▶ text encoder ─▶ scored against patches (late interaction)

Components

  • A vision-language model (Claude 3.5 Sonnet / GPT-4o / Gemini / LLaVA / Qwen-VL).
  • A multi-modal embedder (CLIP family, SigLIP, JinaCLIP, ColPali, Cohere Embed v3 multimodal).
  • An object/table/chart-aware extractor (Unstructured.io, AWS Textract, Azure Document Intelligence, LayoutLMv3).
  • Storage that doesn't lose the original asset — vector for retrieval, blob (S3) for the source image.

Where it fits

  • Document-heavy domains: invoices, contracts, scientific papers with figures, slide decks.
  • Product search with image queries.
  • Video understanding (frame sampling + ASR transcript indexing).
  • Modern field: ColPali (Faysse et al., 2024) is the headline result — directly retrieve PDF pages without OCR.

Where to avoid it

  • Plain-text corpora — multi-modal embeddings are weaker than text-only ones for pure text retrieval.
  • Cost-sensitive paths — VLM inference is dramatically more expensive than text LLM inference.
  • Anything where layout doesn't matter — caption-and-embed is enough; don't reach for ColPali.

Pros / cons

  • ✅ Unlocks corpora that were previously invisible to RAG (scanned docs, diagrams, screenshots).
  • ❌ Tooling is younger; quality varies wildly across modalities. Eval is harder — text-RAG metrics don't transfer.

15. Multi-Model RAG

Definition

A single RAG pipeline that uses several specialised models for different jobs: a small fast embedder, a strong reranker, a cheap generator for easy turns, an expensive reasoner for hard turns, optionally a separate judge. A router picks per-request.

Components

Query
  │
  ▼
Classifier / router ── easy ──▶ small fast LLM ─▶ Answer
  │
 hard
  │
  ▼
Embedder (e.g. bge-small)
  │
  ▼
Vector + BM25 retrieval
  │
  ▼
Reranker (e.g. bge-reranker-v2 / Cohere Rerank)
  │
  ▼
Strong LLM (Claude Opus / GPT-4o / Gemini Pro)
  │
  ▼
Judge LLM (small) ── score / fallback
  │
  ▼
Answer

Where it fits

  • Anywhere cost matters at scale — running every query through your most expensive model is wasteful.
  • Latency-tiered UX: instant for trivial queries, slow-but-better for hard ones.
  • Ensembles: have two LLMs answer, a judge picks. Used in evaluation pipelines.

Where to avoid it

  • Small traffic volumes — the routing complexity is overhead.
  • Teams without an evaluation harness to measure when the cheap model is "good enough" — without it, the router is a guess.

Pros / cons

  • ✅ Often the single biggest cost lever in a production RAG stack.
  • ❌ More moving parts to monitor. The router itself is a model that can be wrong.

16. Federated RAG

Definition

Retrieval is distributed across multiple independent data sources that you do not (or cannot) centralise into one index. A federation layer dispatches the query, gathers results, normalises and re-ranks them.

Components

                  Query
                    │
                    ▼
        ┌─────────────────────┐
        │  Query planner      │  ← decides which sources to hit
        └─────────┬───────────┘
                  │
   ┌──────┬───────┼───────┬───────┬───────┐
   ▼      ▼       ▼       ▼       ▼       ▼
 Postgres  Slack  GitHub  Confluence  Vector  Internal
  / ERP                                 DB     APIs
   │      │       │       │       │       │
   └──────┴───────┼───────┴───────┴───────┘
                  ▼
           Result normaliser
                  │
                  ▼
            Global reranker
                  │
                  ▼
                 LLM

Where it fits

  • Enterprises with strict data residency / compliance requirements where copying data into a central vector DB is not allowed (healthcare, finance, government).
  • Real-time data sources where staleness is unacceptable (live tickets, current inventory).
  • Federations across independent organisations (research consortia, supply chains).

Where to avoid it

  • Greenfield projects where you can just centralise — federation is significantly more work.
  • Latency-sensitive UX — you're as slow as your slowest source unless you build careful timeouts.

Pros / cons

  • ✅ Honours existing access control. No big copy-job. Always-fresh.
  • ❌ Auth, rate limits, and per-source ranking are all ongoing problems. Cross-source ranking has no canonical solution.

17. Streaming RAG

Definition

Two distinct meanings, both real, often confused:

  1. Streaming generation — token-by-token streaming of the LLM's output to the user (this is table stakes today, not really "a kind of RAG").
  2. Streaming retrieval / streaming indexing — the index is updated continuously from an event stream (Kafka, Kinesis, Redis Streams) and queries can be answered against the latest state, not a nightly snapshot. The retriever may also stream results progressively to the generator.

This section is about (2).

Components

Event sources (Kafka / Kinesis / CDC)
        │
        ▼
   Stream processor (Flink / Spark / custom)
        │
        ├──▶ Embed new docs
        ├──▶ Update vector index incrementally
        ├──▶ Update BM25 / tantivy index
        └──▶ Update graph / KB
                │
                ▼
        Live index
                │
Query ─────────▶ Retriever ─▶ Streaming generator ─▶ Tokens

Where it fits

  • Live monitoring and observability ("what is happening in production right now").
  • News / social feeds where freshness is the product.
  • Markets, fraud detection, IoT telemetry.

Where to avoid it

  • Stable corpora that change daily or weekly — batch reindexing is simpler and cheaper.
  • Teams without streaming infrastructure already in place — building Kafka just for this is a tax.

Pros / cons

  • ✅ Sub-minute freshness. Indispensable for "ops copilot" use cases.
  • ❌ Streaming pipelines are operationally expensive. Index consistency under high write rates is its own problem (HNSW rebuilds, BM25 segment merges).

18. ODQA RAG

Definition

Open-Domain Question Answering is older than RAG itself (DrQA, 2017; DPR, 2020). The original setup: retrieve from a huge, broad corpus (Wikipedia, the web), then have a reader model extract or generate the answer. Modern ODQA is RAG with the LLM as the reader.

Components

StageClassical (pre-LLM)Modern
RetrieverBM25, DPRHybrid + reranker
ReaderBERT span extractorLLM with citations
SourceWikipediaWikipedia + web search + APIs

Flow

Open question ("Why did the Roman Empire collapse?")
        │
        ▼
   Hybrid retrieval over a *broad* corpus
        │
        ▼
   Multi-hop reasoning (often Recursive RAG)
        │
        ▼
   LLM generates with citations
        │
        ▼
       Answer with sources

Where it fits

  • Consumer-facing answer engines (Perplexity, You.com, Gemini "AI Overviews").
  • Any product whose value is "ask anything, get a grounded answer with sources".

Where to avoid it

  • Internal / enterprise use cases — your data is not in the broad corpus, the broad corpus is noise relative to your data. Use Domain-Specific RAG.
  • Anything that needs a guaranteed, auditable answer from a defined source set — open-domain retrieval is the opposite of bounded.

Pros / cons

  • ✅ Maximally general. Pairs perfectly with web search tools.
  • ❌ Quality is bounded by the worst page on the internet. Citation accuracy is famously imperfect — verify, don't trust.

19. Domain-Specific RAG

Definition

The opposite of ODQA: retrieval and generation tuned for one specific domain (legal, medical, code, finance, a single company's knowledge). Tuning happens at every layer — chunking, embeddings, reranker, prompts, evaluation, sometimes the LLM itself.

Customisation surface

┌──────────────────────────────────────────────────────────────┐
│  Generic RAG          Domain-Specific RAG                    │
│  ──────────           ──────────────────                     │
│                                                              │
│  Generic chunker      Domain-aware chunker                   │
│                       (clause-level for legal,               │
│                        function-level for code,              │
│                        line-item-level for invoices)         │
│                                                              │
│  Generic embeddings   Domain-tuned embeddings                │
│                       (BGE-M3 fine-tuned, Voyage-law-2,      │
│                        Voyage-code-3, BiomedBERT)            │
│                                                              │
│  Generic prompts      Domain prompts with terminology,       │
│                       formatting rules, refusal policies     │
│                                                              │
│  Generic eval         Domain eval set with                   │
│                       expert-graded ground truth             │
│                                                              │
│  Generic LLM          Domain-finetuned LLM (optional)        │
└──────────────────────────────────────────────────────────────┘

Where it fits

  • Anywhere accuracy beats breadth — legal contract review, medical decision support, code copilots, financial filings analysis.
  • Internal company assistants where the corpus and terminology are stable.

Where to avoid it

  • Cross-domain consumer products — you'll hard-code yourself out of half your users.
  • Domains too small to justify fine-tuning — a generic model with good prompts is often within 5%.

Pros / cons

  • ✅ Highest measurable quality on the domain it's tuned for. The default for serious enterprise RAG.
  • ❌ Investment. Eval set, fine-tuning, ongoing curation. Cross-domain transfer is poor.

20. Side-by-Side Comparison

┌────────────────────────┬────────┬─────────┬──────────┬────────────┐
│ Variant                │ Index  │ Latency │ $ / query│  Hardness  │
├────────────────────────┼────────┼─────────┼──────────┼────────────┤
│ Standard RAG           │ Cheap  │ Low     │ Low      │ Easy       │
│ Hybrid RAG             │ Med    │ Low+    │ Low      │ Easy       │
│ HyDE RAG               │ Cheap  │ Med     │ Med      │ Easy       │
│ Contextual Retrieval   │ HIGH   │ Low     │ Low      │ Medium     │
│ Recursive / Multi-step │ Cheap  │ HIGH    │ HIGH     │ Medium     │
│ Self-RAG               │ Cheap  │ Med     │ Med      │ Medium-Hard│
│ Modular RAG            │ Varies │ Varies  │ Varies   │ Medium     │
│ Memory-Augmented       │ Med    │ Med     │ Med      │ Medium-Hard│
│ GraphRAG               │ HIGH   │ Med     │ Med      │ Hard       │
│ Knowledge-Enhanced RAG │ Med    │ Low-Med │ Low-Med  │ Hard       │
│ Agentic RAG            │ Cheap  │ HIGH    │ HIGH     │ Hard       │
│ Multi-Modal RAG        │ Med    │ Med-Hi  │ Med-Hi   │ Medium-Hard│
│ Multi-Model RAG        │ Cheap  │ Low     │ LOW      │ Medium     │
│ Federated RAG          │ N/A    │ Med-Hi  │ Med      │ Hard       │
│ Streaming RAG          │ HIGH   │ Low     │ Med      │ Hard       │
│ ODQA RAG               │ HUGE   │ Med-Hi  │ Med-Hi   │ Medium     │
│ Domain-Specific RAG    │ Med    │ Low     │ Low      │ Hard       │
└────────────────────────┴────────┴─────────┴──────────┴────────────┘

What each variant primarily fixes

Variant                       Primary failure mode it fixes
─────────────────────────     ─────────────────────────────
Hybrid                        Bad retrieval on identifiers / codes
HyDE                          Bad retrieval on short / vague queries
Contextual Retrieval          Bad retrieval on chunks-without-context
Recursive                     Multi-hop questions
Self-RAG                      Hallucinations / blind retrieval
Modular                       Inflexibility / lock-in
Memory-Augmented              No cross-session continuity
GraphRAG                      No relational / global reasoning
Knowledge-Enhanced            GraphRAG over already-structured data
Agentic                       No actions, only answers
Multi-Modal                   Non-text data is invisible
Multi-Model                   Cost / latency at scale
Federated                     Can't centralise the data
Streaming                     Stale index
ODQA                          Bounded corpus
Domain-Specific               Generic-quality answers in a specialist domain

21. Decision Guide: Which RAG Should You Pick?

START
  │
  ▼
Is this a prototype to prove RAG is worth doing at all?
  │
 Yes ─────▶ Standard RAG. Build it in an afternoon. Measure.
  │
  No
  │
  ▼
Are queries failing on exact codes / names / IDs?
  │
 Yes ─────▶ Add Hybrid (BM25 + vector + RRF + reranker)
  │
  ▼
Are queries short / vague?
  │
 Yes ─────▶ Add HyDE
  │
  ▼
Are chunks meaningless without document-level context?
  │
 Yes ─────▶ Add Contextual Retrieval (Anthropic recipe)
  │
  ▼
Do users ask multi-hop questions?
  │
 Yes ─────▶ Add Recursive RAG (cap hops, add Self-RAG critique)
  │
  ▼
Are hallucinations expensive (legal / medical / finance)?
  │
 Yes ─────▶ Add Self-RAG style critique + grounding checks
  │
  ▼
Is the answer often "across all the documents"?
  │
 Yes ─────▶ Use GraphRAG (extracted graph) or
            Knowledge-Enhanced RAG (authored graph)
  │
  ▼
Do users come back across sessions and expect continuity?
  │
 Yes ─────▶ Add Memory-Augmented RAG (mem0 / Letta / Zep)
  │
  ▼
Does the system need to take actions, not just answer?
  │
 Yes ─────▶ Wrap everything above in Agentic RAG
            (LangGraph / CrewAI / OpenAI Agents)
  │
  ▼
Do you have images / scans / video / charts in scope?
  │
 Yes ─────▶ Add Multi-Modal RAG (ColPali for PDFs,
            CLIP family for images, ASR + chunk for audio)
  │
  ▼
Are you running at a scale where running every query through
your strongest model is wasteful?
  │
 Yes ─────▶ Add Multi-Model RAG with a router + judge
  │
  ▼
Does data live in many systems you cannot centralise?
  │
 Yes ─────▶ Federated RAG
  │
  ▼
Does the answer need to reflect events from the last few minutes?
  │
 Yes ─────▶ Streaming RAG (Kafka / CDC → live index)
  │
  ▼
Is your product "ask anything, get a sourced answer"?
  │
 Yes ─────▶ ODQA RAG (web search + reranker + citations)
  │
  ▼
Is your product specialist (legal / medical / code / your company)?
  │
 Yes ─────▶ Domain-Specific RAG
            (domain embeddings, domain eval set,
             optionally domain-finetuned LLM)
  │
  ▼
Always: wire all of the above as a Modular RAG so you can
        replace any block as the field evolves.

A realistic stack for a serious 2026 production system

Most production RAG you will ship looks something like:

┌──────────────────────────────────────────────────────────┐
│                                                          │
│  Modular pipeline (LangGraph / LlamaIndex)               │
│      │                                                   │
│      ├─ Domain-Specific corpus + eval set                │
│      ├─ Contextual Retrieval at index time               │
│      ├─ Hybrid (BM25 + dense) + reranker                 │
│      ├─ HyDE on short queries                            │
│      ├─ Recursive hops capped at 3                        │
│      ├─ Self-RAG critique before final answer             │
│      ├─ Memory-Augmented for repeat users                 │
│      ├─ Multi-Model router (cheap → strong → judge)       │
│      └─ Optionally Agentic wrapper for tool calls         │
│                                                          │
└──────────────────────────────────────────────────────────┘

That is six or seven of the "16 types of RAG" composed into one system. The point of knowing all sixteen is to know which knob to turn when a specific failure mode shows up in your eval set — not to pick one and call it your architecture.


Related reading: From Plain LLM to Context Graph goes deeper on the Standard → RAG → GraphRAG → Context Graph progression with code examples.