
The demo worked. That was the problem.
I’d built a document assistant over an internal knowledge base — a few hundred pages of support articles, runbooks, the usual. In the demo it answered everything I threw at it. Crisp, sourced, fast. People nodded. We shipped it.
Two weeks later the complaints started. The assistant was confidently wrong. Not gibberish — worse. It would answer a question about refund eligibility by quoting the section on shipping delays, and it would do it in the same authoritative tone it used when it was right. Users couldn’t tell the difference. Honestly, at first, neither could I.
So I did what most people do. I assumed the model was the weak link. I swapped it for a bigger one. I rewrote the system prompt four times. I added “only answer using the provided context” in increasingly stern language. The answers got slightly more polite and exactly as wrong.
The model was never the problem. It was doing precisely what I asked — generating a fluent answer from the context it was handed. The context was garbage. I just hadn’t looked.
The part the tutorials skip
Every RAG tutorial teaches the same four steps. Embed your documents. Store the vectors. Retrieve the top matches. Hand them to the model. All correct. It’s also why so many RAG systems fall apart in production, because the tutorial ends exactly where the hard part starts.
Here’s the uncomfortable number: naive RAG pipelines return the wrong context roughly 40% of the time. And when teams finally trace their failures, around 80% of them live in the ingestion and retrieval layer — not in the model. You can burn a month tuning prompts and swapping LLMs while your retriever quietly hands back the wrong paragraph every third query.
One idea should reframe how you debug the entire system: retrieval quality determines answer quality. The model is downstream of everything. If the right information never makes it into the context window, no model on earth can recover it. It’ll just be wrong more eloquently.
Once that clicked, the six things I’d been doing wrong became obvious.
1. You’re debugging the model when retrieval is broken
This is the first mistake and the most expensive, because it points you in the wrong direction for weeks.
When a RAG system gives a bad answer, the instinct is to stare at the generation step. It’s the visible part. It’s where the words come out. But a bad answer has only two possible causes, and just one of them is the model: either the retriever fed it the wrong context, or it had the right context and reasoned poorly. The first is far more common.
Before you touch the prompt or the model, look at what was actually retrieved for the failing query. Print the chunks. Read them. Most of the time you’ll find the answer wasn’t in there at all, and the model was inventing a plausible response from irrelevant text — which is exactly what it’s built to do when you give it nothing useful.
Debug retrieval first. Always. You cannot fix a generation problem that is actually a retrieval problem, and you’ll waste enormous effort trying.
2. Fixed-size chunking is shredding your context
You split your documents into 500-token chunks with a little overlap, because that’s what the tutorial did. It’s the most common chunking strategy on the planet and it’s quietly wrecking your retrieval.
Fixed-size chunking doesn’t care about meaning. It cuts mid-sentence, mid-table, mid-thought. A boundary lands in the middle of the one paragraph that answers the user’s question, so half the answer sits in chunk 14 and half in chunk 15, and your top-k grabs neither cleanly. The embedding for a chunk that begins halfway through an idea is a blurry average of two unrelated topics — and a blurry embedding matches nothing well.
The numbers here are brutal. One clinical-decision study compared adaptive chunking against fixed-size on the same corpus: 87% accuracy versus 13%. Same documents, same model, same everything except how the text was split. Chunking quality constrains retrieval accuracy more than your choice of embedding model does — which is the exact opposite of where most people spend their tuning hours.
Chunk on structure instead. Respect headers, paragraphs, list items, table rows. Better yet, use a small-to-big approach: index small, precise chunks for matching, but when one hits, send the larger parent section to the model so it gets coherent context instead of a fragment.
3. You skipped the reranker
This is the highest-return change you’re probably not making, and it takes about an hour to add.
Vector search ranks chunks by cosine similarity — how close two embeddings sit in vector space. That correlates with relevance. It does not equal it. A chunk can be semantically near your query and still be answering a completely different question. Cosine similarity gives you candidates, not answers.
A reranker fixes this. It’s a cross-encoder: instead of embedding the query and each document separately and comparing the results, it reads both together, as a pair, and scores how well the document actually answers the query. It’s the difference between “these two texts are about similar things” and “this text answers this question.”
The reason you don’t just use a cross-encoder for everything is speed — it’s far too slow to score millions of documents. So you chain them. Fast vector search pulls 20 to 50 candidates, then the reranker reorders them carefully and you keep the top few. Adding one typically buys 10 to 30% precision for about 50 to 100 milliseconds of extra latency. There is almost no other single change in RAG with that return on effort.
4. Pure vector search can’t read
Semantic search is supposed to be the whole point of RAG, so it feels backwards to say pure vector search has a blind spot. It does, and it’s a wide one.
Vector search matches meaning, which means it can miss exact terms entirely. A user asks “How do I cancel my subscription?” Your document is titled “Account Termination Policy.” To a human those are obviously the same thing. To an embedding model they can land far enough apart that the right document never enters the top-k. The same failure hits product codes, error numbers, function names, acronyms, proper nouns — anything where the literal token matters and there’s no neat semantic synonym.
The fix is hybrid search. Run dense vector search and a sparse keyword search like BM25 side by side, then merge the results. Vector search catches paraphrase and meaning. BM25 catches the exact SKU-4471 or ERR_CONN_REFUSED the user typed verbatim. Together they cover each other's failures. Hybrid plus a reranker is the configuration most production teams settle on, because it has the best quality-to-cost ratio of anything short of full agentic retrieval.
5. You’re stuffing the context window and calling it “more context”
More context feels safer. If you’re not sure which chunks matter, retrieve ten and let the model sort it out — it has a huge context window now anyway. Why not use it.
Because the model doesn’t sort it out. It averages across everything you hand it. Retrieve ten chunks when two are relevant and you’ve diluted the signal with eight chunks of noise. The model reads all ten, weights them roughly the same, and produces a muddier answer than it would have from the two good ones alone. Long context windows didn’t repeal this. They just let you make the mistake at a bigger scale, and pay more for the privilege.
This is why mistakes 2 through 4 matter so much — they all exist to get fewer, better chunks into the prompt. Precision beats volume. Retrieve less, but retrieve right. If your reranker is doing its job, the top three chunks should carry the answer, and everything past that is mostly tax: on cost, on latency, and on the answer itself.
6. You have no idea if it’s working
This is the one that separates teams who improve their RAG from teams who just keep changing things and hoping.
Ask most people how good their retrieval is and you get a vibe. “Seems decent.” “It got the last few right.” That’s not a measurement, and you can’t improve what you don’t measure. It’s worse than that, actually — every change you make to chunking or retrieval is invisible without a baseline. You nudge the chunk size, it feels a bit better, you keep it, and you’ve quietly made things worse on all the queries you didn’t happen to test.
Set up evaluation early, before you start tuning anything. Frameworks like RAGAS score the things that actually matter: context precision (did you retrieve relevant chunks), context recall (did you retrieve all the relevant ones), faithfulness (is the answer grounded in what you retrieved), and answer relevance (does it address the question). Build a set of real questions with known good answers, run it after every change, and watch the numbers instead of the mood. The first time you do this, you’ll find your retrieval is worse than you thought. That’s the point.
The opposite mistake: building too much
There’s one final mistake, and it comes from reading articles like this one.
You learn about GraphRAG, agentic retrieval, self-correcting loops, query decomposition, parametric injection — the 2026 RAG menu is long and genuinely impressive — and you decide your system needs all of it. So you build a multi-agent retrieval pipeline to answer questions that a hybrid search and a reranker would have handled at a tenth of the cost and latency.
Match the pipeline to the query. If the answer lives in a single chunk, naive RAG with a reranker is fine. If it needs facts from two or three documents, hybrid plus rerank plus a little query rewriting is the sweet spot — and that’s where most applications should live. The heavy machinery earns its keep only when your evaluation proves the simpler setup isn’t enough: knowledge graphs for relationship-heavy reasoning, agentic loops for genuine multi-hop questions. Reach for them before that, and you’ve bought complexity, latency, and more things that can break, in exchange for accuracy you didn’t need.
Start with the simplest thing that works. Measure it. Add complexity only when the numbers force your hand.
I learned all of this the slow way — debugging the wrong layer for two weeks while users quietly lost trust in something I’d shipped. The model was never the problem. It almost never is. Go look at what your retriever is actually handing it. You’ll probably be surprised, and then you’ll know exactly what to fix.
POSTS ACROSS THE NETWORK
Best Alternatives to Crown Coins
