Retrieval-augmented generation is the most common AI architecture shipped in 2026 and the most-misimplemented one. A team picks a vector database, a default chunk size, and a default top-k, then blames the language model when answers come back vague or wrong. In our experience the model is rarely the bottleneck. The retrieval layer is.
Retrieval is the bottleneck.
A RAG system is a two-stage pipeline. Stage one retrieves documents. Stage two generates an answer over them. If stage one returns the wrong documents, no model can recover. The model can only hallucinate confidently over what it was handed. The failure is silent because the answer still sounds plausible. The cure is to treat retrieval as a first-class system with its own design and its own eval.
Chunking, shape over size.
The default mistake is to chunk by character count. A long document is sliced every 1,000 characters with a small overlap, and the chunks are embedded. The embedding model then sees half a paragraph that ends mid-sentence, the retriever pulls it, and the generator tries to reason from a fragment. Better practice in 2026 is shape-aware chunking:
- Document structure first. Split on headings, paragraphs, list items, and table rows. The chunks line up with the document's own semantic units.
- Contextual prefixes. Prepend the page title and section path to each chunk before embedding. A chunk from page 47 of a contract that says “the supplier shall indemnify” means nothing without “Section 12 · Indemnification.”
- Two-tier indexing. Embed both the chunk itself and a summary of the chunk. Match queries against summaries for breadth, against chunks for precision.
- Token-aware overlap. Overlap is a hedge against split-mid-sentence; size it by tokens, not characters, and never set it lower than 10% of the chunk size.
Retrieval, hybrid by default.
Vector search is good at meaning. Keyword search is good at identifiers, codes, and rare names. A query that says “What does Article 7.3 say about late payment?” contains a semantic intent (late payment) and a literal identifier (Article 7.3). Pure vector search will surface documents that talk about late payment in general; pure keyword search will surface Article 7.3 in any context.
The pattern that wins in production is hybrid retrieval: run vector and keyword (BM25 is still the default keyword scorer in 2026) in parallel, fuse the result lists with reciprocal rank fusion, then send the merged top-k to the next stage. Hybrid retrieval lifts precision on real workflows by double-digit points over either component alone.
Reranking, the cheap accuracy lift.
The retriever returns, say, the top forty candidates fast. A reranker then scores those forty against the query for actual relevance and returns the top eight. The reranker is a small cross-encoder model or a thin LLM call. It runs on a short candidate list, so it is fast and cheap, and the accuracy lift is typically large. Most RAG systems we audit do not have a reranker, and most of them should.
Where a reranker shines: queries with multiple constraints (recent, in language X, about topic Y), queries with subtle intent the embedding model missed, and any workflow where the cost of returning the wrong document is high.
Eval the retrieval, not just the generator.
Almost no team in 2026 evaluates the retrieval step independently. The eval rubric scores the final answer, the model gets credit or blame, and the retrieval layer is invisible in the scoreboard. The correction is a separate retrieval eval with its own fixtures and its own metrics:
- Recall at k. Did the top-k results include the relevant chunk at all? Anything other than 100% recall means the generator was sometimes asked to answer without the source.
- Reranker precision. Of the top-3 returned by the reranker, how many are actually relevant?
- Chunk faithfulness. When the generator cites the chunk, does the chunk actually support the cited claim?
- Cost and latency per query. Hybrid retrieval and reranking add cost. Track and bound it.
See The Morvion Eval Spec for the harness shape these eval dimensions slot into.
Common failures we see.
- Top-k too small. The team set k=5 because the tutorial said so. The workflow needs k=20 with a reranker to k=5. The generator is starved.
- Stale index. The embedding model was bumped, the index was not rebuilt, old and new vectors live in different spaces, retrieval silently degrades.
- Metadata blindness. The system retrieves on similarity alone, ignoring the customer tenant, the language, or the document date. Wrong-tenant documents leak. Wrong-language drafts surface.
- Single-stage retrieval. No reranker. The top results are similar to the query but not the best match for the actual question.
- No query rewriting. The user typed a sloppy sentence. The system embeds it directly. A cheap LLM rewrite would have lifted recall substantially.
If the team can show you the generation eval but not the retrieval eval, the system has half a scoreboard. Most failures live in the half that is not measured.
Common questions.
Should every project use a vector database? Not always. A small corpus (under a few thousand records) can live in memory or in Postgres with pgvector. The vector database overhead is justified once scale or write velocity demands it.
What about long-context models, can we skip RAG? Sometimes. If the entire relevant corpus fits in the context window and the cost is acceptable, paste it in. For most production corpora, retrieval is still cheaper, faster, and more auditable than long-context-everything.
How big should chunks be? Workflow-dependent. Code and structured documents favor smaller chunks (200-400 tokens). Long-form prose favors larger chunks (600-1000 tokens). The eval scoreboard decides.
For one-paragraph definitions of the underlying terms, see the glossary entries on retrieval-augmented generation, vector search, semantic search, and embedding model.



