Why RAG?
Large language models are impressive — until you ask them something they don't know. GPT-4 has a knowledge cutoff, knows nothing about your internal documents, and will confidently hallucinate facts it hasn't seen. Retrieval-Augmented Generation fixes this by giving the model relevant context at query time, pulled from your own data.
The idea is simple: instead of fine-tuning an LLM on your documents (expensive, slow, requires retraining whenever data changes), you retrieve the most relevant chunks at runtime and inject them into the prompt. The model answers based on what you give it, not what it memorised during training.
I built my first proper RAG system for the AI Workflow System project — a document assistant that lets you upload PDFs and ask questions about them. Here's what I learned.
The Architecture
At a high level, RAG has two phases:
- Indexing — load documents, split them into chunks, embed each chunk, store in a vector database
- Retrieval + Generation — embed the user's question, find the most similar chunks, pass them to the LLM as context
User Query
│
▼
[Embed Query] ──▶ [Vector DB] ──▶ [Top-k Chunks]
│
▼
[LLM + Context] ──▶ Answer
Simple on paper. The complexity is in the details.
Step 1: Loading and Chunking Documents
The first problem is that LLMs have a context window limit. You can't feed a 200-page PDF into the prompt. You have to split it into chunks and only send the relevant ones.
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
loader = PyPDFLoader("document.pdf")
pages = loader.load()
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
separators=["\n\n", "\n", ". ", " ", ""]
)
chunks = splitter.split_documents(pages)
Chunk size matters more than you think. I started with chunk_size=2000 and got poor retrieval — the chunks were too broad and diluted the signal. Dropping to 800–1000 characters with 200 characters of overlap significantly improved answer quality.
The overlap ensures that sentences crossing chunk boundaries aren't lost. Without it, important context that spans two chunks gets cut in half.
Step 2: Embeddings
Embeddings convert text into vectors — numerical representations that capture semantic meaning. Similar meaning = similar vectors = similar position in vector space.
from langchain.embeddings import OpenAIEmbeddings
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
I tested text-embedding-3-small vs text-embedding-3-large. For most document QA tasks, small is fast and cheap enough. The quality difference matters more when you have very domain-specific language — legal, medical, technical specs — where large earns its cost.
One thing that tripped me up: you must use the same embedding model at both indexing time and query time. Mixing models produces garbage retrieval because the vector spaces are incompatible.
Step 3: Vector Store
Embeddings need somewhere to live. I used FAISS for local development — it's fast, runs in-memory, no server required.
from langchain.vectorstores import FAISS
vectorstore = FAISS.from_documents(chunks, embeddings)
# Persist to disk
vectorstore.save_local("faiss_index")
# Load later
vectorstore = FAISS.load_local("faiss_index", embeddings)
For production with large document sets, consider Pinecone or pgvector (Postgres extension). I moved to pgvector on Supabase for the hosted version — it keeps everything in one database rather than managing a separate vector service.
Step 4: The Retrieval Chain
This is where LangChain shines. A RetrievalQA chain wires the vector store retriever to the LLM automatically:
from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff", # "stuff" = inject all chunks directly into prompt
retriever=vectorstore.as_retriever(search_kwargs={"k": 4}),
return_source_documents=True,
)
result = qa_chain({"query": "What are the main findings?"})
print(result["result"])
print(result["source_documents"])
The k=4 parameter controls how many chunks are retrieved. More chunks = more context = better answers, but also higher token cost and risk of hitting context limits. I settled on 4–6 for most use cases.
What I Got Wrong (and Fixed)
1. Not cleaning the documents
Raw PDFs are messy — headers, footers, page numbers, watermarks all end up in the chunks. A simple cleanup step before splitting made retrieval noticeably better:
import re
def clean_text(text: str) -> str:
text = re.sub(r'\n{3,}', '\n\n', text) # collapse excessive newlines
text = re.sub(r'Page \d+ of \d+', '', text) # remove page numbers
text = text.strip()
return text
2. No conversation memory
The first version answered each question in isolation. If you asked "Who wrote this?" and then "What did they conclude?", the second question had no context from the first.
LangChain's ConversationalRetrievalChain fixes this with a memory buffer:
from langchain.chains import ConversationalRetrievalChain
from langchain.memory import ConversationBufferMemory
memory = ConversationBufferMemory(
memory_key="chat_history",
return_messages=True
)
chain = ConversationalRetrievalChain.from_llm(
llm=llm,
retriever=vectorstore.as_retriever(search_kwargs={"k": 4}),
memory=memory,
)
3. Blindly trusting retrieval
Sometimes the retrieved chunks are irrelevant — the question has no good answer in the document. Without guardrails, the LLM invents one.
I added a simple system prompt instructing it to say "I don't know" when the context doesn't support an answer:
system_prompt = """You are a helpful assistant answering questions about a document.
Use ONLY the provided context to answer. If the context does not contain enough
information to answer the question, say "I don't have enough information in the
document to answer that." Do not make up information."""
This alone eliminated most hallucinations in my testing.
Results
After all these iterations, the system handles multi-turn Q&A over PDFs reliably. Accuracy on questions with clear answers in the document is high. The failure cases are mostly questions that require reasoning across widely separated sections — a known limitation of chunk-based retrieval that more advanced techniques like HyDE or re-ranking can help with.
What's Next
A few things I want to explore:
- Hybrid search — combining vector similarity with BM25 keyword search usually outperforms either alone
- Re-ranking — using a cross-encoder to re-score the top-k results before passing to the LLM
- Agentic RAG — letting the model decide when to retrieve and what to search for, rather than always retrieving on every query
RAG is still a fast-moving space. What's considered best practice changes every few months. But the core ideas — chunk, embed, retrieve, generate — stay stable, and understanding them deeply makes it much easier to follow the new developments as they arrive.