The RAG Architecture Decisions That Actually Matter

Retrieval-Augmented Generation (RAG) has become the default architecture for production LLM applications. Every vendor, every framework, every blog post has a RAG tutorial. The barrier to building a prototype that looks impressive is essentially zero.

The gap between a demo that works on three PDFs and a production system that handles thousands of documents with consistent quality is where the real engineering happens. And that gap is almost entirely about retrieval quality — not generation.

The Chunking Decision

Chunking is the most consequential decision in any RAG system, and it is almost always made too quickly. The default approach — split by character count with some overlap — works for demos and fails for production.

The core tension is simple: small chunks improve precision but lose context. Large chunks preserve context but introduce noise. The right answer depends on what you are retrieving and what you are asking.

**Our recommendation:** Start with semantic chunking that respects document structure (paragraphs, sections, lists). Use a chunk size of 512-1024 tokens with 10-20% overlap. Then measure. Then iterate. The optimal chunk size for legal contracts is different from technical documentation is different from chat logs. There is no universal answer.

Embedding Model Selection

The embedding model determines what "similar" means in your system. Most teams default to OpenAI's text-embedding-3-small or Ada-002, which are excellent general-purpose models. But "general-purpose" means they are optimised for the average use case — and your use case is not average.

**When to use a general-purpose model:** Your domain is broad, your queries are diverse, and you don't have a large corpus of labelled data.

**When to fine-tune or use a domain-specific model:** Your corpus is narrow (legal documents, medical records, code), your queries follow predictable patterns, and you have at least a few hundred labelled examples.

We've seen a 15-25% improvement in retrieval precision from simply switching from a general-purpose to a domain-adapted embedding model. In a RAG system, that improvement cascades — better retrieval means better context, which means better generation.

The Reranker Layer

The single biggest quality lever in a production RAG system is the reranker. A reranker takes the top-K results from your initial embedding search and re-scores them using a more expensive, cross-encoder model that considers the query and document together.

The difference between embedding-only search and embedding + reranker is dramatic. In our benchmarks, a reranker improves top-1 accuracy by 20-35% across diverse domains. The latency cost (typically 50-200ms per query) is almost always worth it.

Hybrid Search: When and Why

Pure semantic search (embedding-based) misses exact matches that keyword search catches — part numbers, proper names, specific phrases. Pure keyword search misses conceptual matches that semantic search handles naturally.

Hybrid search combines both: run semantic and keyword searches in parallel, then merge and rank the results. This is our default recommendation for production systems. The implementation complexity is modest, and the robustness gain is substantial.

Monitoring Retrieval Quality

Most teams monitor LLM response quality (toxicity, helpfulness) but not retrieval quality. This is backwards. If the retrieved context is bad, the generation cannot save it.

We recommend tracking:

- **Precision@K:** What fraction of retrieved documents are relevant?

- **MRR (Mean Reciprocal Rank):** How early in the results does the first relevant document appear?

- **Context Win Rate:** In A/B tests, how often does the retrieved context improve the final answer vs. a baseline?

Without these metrics, you are flying blind. With them, you can systematically improve each component of the retrieval pipeline and measure the impact.

RAG is table stakes. The winners will be the teams that obsess over retrieval quality — not the teams with the best prompt templates or the largest context windows. The architecture decisions above are where the real leverage lives. Make them carefully.