Suyash Shrestha — Software Engineer

Why RAG?

Large language models are impressive — until you ask them something they don't know. GPT-4 has a knowledge cutoff, knows nothing about your internal documents, and will confidently hallucinate facts it hasn't seen. Retrieval-Augmented Generation fixes this by giving the model relevant context at query time, pulled from your own data.

The idea is simple: instead of fine-tuning an LLM on your documents (expensive, slow, requires retraining whenever data changes), you retrieve the most relevant chunks at runtime and inject them into the prompt. The model answers based on what you give it, not what it memorised during training.

I built my first proper RAG system for the AI Workflow System project — a document assistant that lets you upload PDFs and ask questions about them. Here's what I learned.

The Architecture

At a high level, RAG has two phases:

Indexing — load documents, split them into chunks, embed each chunk, store in a vector database
Retrieval + Generation — embed the user's question, find the most similar chunks, pass them to the LLM as context

User Query
    │
    ▼
[Embed Query] ──▶ [Vector DB] ──▶ [Top-k Chunks]
                                        │
                                        ▼
                               [LLM + Context] ──▶ Answer

Simple on paper. The complexity is in the details.

Step 1: Loading and Chunking Documents

The first problem is that LLMs have a context window limit. You can't feed a 200-page PDF into the prompt. You have to split it into chunks and only send the relevant ones.

from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

loader = PyPDFLoader("document.pdf")
pages = loader.load()

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    separators=["\n\n", "\n", ". ", " ", ""]
)
chunks = splitter.split_documents(pages)

Chunk size matters more than you think. I started with chunk_size=2000 and got poor retrieval — the chunks were too broad and diluted the signal. Dropping to 800–1000 characters with 200 characters of overlap significantly improved answer quality.

The overlap ensures that sentences crossing chunk boundaries aren't lost. Without it, important context that spans two chunks gets cut in half.

Step 2: Embeddings

Embeddings convert text into vectors — numerical representations that capture semantic meaning. Similar meaning = similar vectors = similar position in vector space.

from langchain.embeddings import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

I tested text-embedding-3-small vs text-embedding-3-large. For most document QA tasks, small is fast and cheap enough. The quality difference matters more when you have very domain-specific language — legal, medical, technical specs — where large earns its cost.

One thing that tripped me up: you must use the same embedding model at both indexing time and query time. Mixing models produces garbage retrieval because the vector spaces are incompatible.

Step 3: Vector Store

Embeddings need somewhere to live. I used FAISS for local development — it's fast, runs in-memory, no server required.

from langchain.vectorstores import FAISS

vectorstore = FAISS.from_documents(chunks, embeddings)

# Persist to disk
vectorstore.save_local("faiss_index")

# Load later
vectorstore = FAISS.load_local("faiss_index", embeddings)

For production with large document sets, consider Pinecone or pgvector (Postgres extension). I moved to pgvector on Supabase for the hosted version — it keeps everything in one database rather than managing a separate vector service.

Step 4: The Retrieval Chain

This is where LangChain shines. A RetrievalQA chain wires the vector store retriever to the LLM automatically:

from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",  # "stuff" = inject all chunks directly into prompt
    retriever=vectorstore.as_retriever(search_kwargs={"k": 4}),
    return_source_documents=True,
)

result = qa_chain({"query": "What are the main findings?"})
print(result["result"])
print(result["source_documents"])

The k=4 parameter controls how many chunks are retrieved. More chunks = more context = better answers, but also higher token cost and risk of hitting context limits. I settled on 4–6 for most use cases.

What I Got Wrong (and Fixed)

1. Not cleaning the documents

Raw PDFs are messy — headers, footers, page numbers, watermarks all end up in the chunks. A simple cleanup step before splitting made retrieval noticeably better:

import re

def clean_text(text: str) -> str:
    text = re.sub(r'\n{3,}', '\n\n', text)      # collapse excessive newlines
    text = re.sub(r'Page \d+ of \d+', '', text)  # remove page numbers
    text = text.strip()
    return text

2. No conversation memory

The first version answered each question in isolation. If you asked "Who wrote this?" and then "What did they conclude?", the second question had no context from the first.

LangChain's ConversationalRetrievalChain fixes this with a memory buffer:

from langchain.chains import ConversationalRetrievalChain
from langchain.memory import ConversationBufferMemory

memory = ConversationBufferMemory(
    memory_key="chat_history",
    return_messages=True
)

chain = ConversationalRetrievalChain.from_llm(
    llm=llm,
    retriever=vectorstore.as_retriever(search_kwargs={"k": 4}),
    memory=memory,
)

3. Blindly trusting retrieval

Sometimes the retrieved chunks are irrelevant — the question has no good answer in the document. Without guardrails, the LLM invents one.

I added a simple system prompt instructing it to say "I don't know" when the context doesn't support an answer:

system_prompt = """You are a helpful assistant answering questions about a document.
Use ONLY the provided context to answer. If the context does not contain enough
information to answer the question, say "I don't have enough information in the
document to answer that." Do not make up information."""

This alone eliminated most hallucinations in my testing.

Results

After all these iterations, the system handles multi-turn Q&A over PDFs reliably. Accuracy on questions with clear answers in the document is high. The failure cases are mostly questions that require reasoning across widely separated sections — a known limitation of chunk-based retrieval that more advanced techniques like HyDE or re-ranking can help with.

What's Next

A few things I want to explore:

Hybrid search — combining vector similarity with BM25 keyword search usually outperforms either alone
Re-ranking — using a cross-encoder to re-score the top-k results before passing to the LLM
Agentic RAG — letting the model decide when to retrieve and what to search for, rather than always retrieving on every query

RAG is still a fast-moving space. What's considered best practice changes every few months. But the core ideas — chunk, embed, retrieve, generate — stay stable, and understanding them deeply makes it much easier to follow the new developments as they arrive.