AI / LLM17 min read

RAG in Production: Retrieval, Chunking, and Eval That Actually Hold Up

Published May 28, 2026alphabench Engineering

Retrieval-augmented generation is easy to prototype and hard to get right. A weekend project that pipes documents into a vector store and queries it will produce plausible answers. Putting that in front of users who depend on the answers being correct is a different discipline entirely.

This is what we've learned building RAG systems that hold up - where the failure mode isn't "it sometimes gives a vague answer" but "a person made a decision based on what it said."

The Naive Pipeline and Why It Fails

The starter-kit RAG pipeline is: split documents into fixed-size chunks, embed them, store them, and at query time embed the question, retrieve the top-k nearest chunks, and stuff them into the prompt. It works in a demo because demos use easy questions over clean documents.

It fails in production for predictable reasons. Fixed-size chunking splits ideas in half. Pure vector similarity misses exact-match terms like product codes and names. Top-k retrieval returns near-duplicates that crowd out the one chunk that actually mattered. And nobody can tell whether retrieval is the problem or generation is, because nothing is measured.

Most "the LLM hallucinated" complaints are actually retrieval failures. The model answered faithfully from the wrong context.

Chunking Is a Modeling Decision

Chunking is where most RAG quality is won or lost, and fixed-size splitting is rarely the right answer.

We chunk along the document's actual structure - sections, headings, list items - so a chunk is a coherent unit of meaning rather than an arbitrary 500-token window. We attach metadata to every chunk: source, section title, date, and document type, so retrieval can filter as well as rank. And we keep a small overlap between chunks so an idea that spans a boundary isn't lost.

For documents where context matters - a clause that only makes sense within its parent section - we store the chunk for retrieval but expand to the surrounding context when building the prompt. The thing you search over and the thing you send to the model don't have to be identical.

Retrieval: Hybrid, Then Rerank

Vector search alone is a mistake for most real corpora. Dense embeddings are great at semantic similarity and bad at exact terms - the part numbers, names, and acronyms that users actually search for.

We run hybrid search: a dense vector query for semantic match plus a sparse keyword query (BM25) for exact terms, combined into one ranked list. This single change fixes a large share of "it couldn't find the obvious answer" failures.

Then we rerank. The first retrieval pass optimizes for recall - cast a wide net. A cross-encoder reranker then reorders that candidate set for precision, scoring each chunk against the actual question. Retrieve broadly, rerank tightly, send only the top few to the model.

Evaluation Is the Whole Game

Without evaluation, every change to a RAG system is a guess. We build the eval set before tuning anything.

We separate retrieval metrics from generation metrics, because they fail independently. For retrieval, we measure whether the right chunk made it into the context at all - if it didn't, no amount of prompt tuning will save the answer. For generation, we measure faithfulness (did the answer stick to the retrieved context?) and correctness against a labeled set of question-answer pairs drawn from real usage.

This separation is what makes the system debuggable. When an answer is wrong, the metrics tell you immediately whether retrieval missed or generation drifted - so you fix the actual problem instead of cycling through prompt edits.

RAG, Agentic Search, or Fine-Tuning?

RAG isn't always the answer. We reach for it when the knowledge is large, changes often, and needs to be cited. When questions require multi-step lookups - find X, then use it to find Y - agentic search with retrieval as a tool often beats single-shot RAG. And when you need the model to internalize a style or format rather than facts, fine-tuning is the better lever. The interesting systems usually combine them.

The Short Version

Chunk along structure, retrieve hybrid then rerank, and measure retrieval and generation separately. Do those three things and you're past where most RAG systems stall.

If you're building a retrieval system that has to be right, our AI Agent Development and LLM Automation Consulting work covers exactly this - retrieval pipelines you can evaluate and trust.

A RAG system you can't evaluate is a RAG system you can't improve. Build the eval set first.

Architecting Production LLM Agents: Tools, Memory, and Guardrails

Where LLM Automation Pays Off (and Where It Quietly Burns Money)

Have a similar challenge?

Let's discuss how we can help you build the right solution.

START A PROJECT