What is RAG?
Retrieval-Augmented Generation (RAG) combines the power of large language models with external knowledge retrieval. Instead of relying solely on the LLM's training data, RAG fetches relevant context from your own documents.
The RAG Pipeline
Documents → Chunking → Embeddings → Vector Store
↓
User Query → Embedding → Similarity Search
↓
Retrieved Context + Query → LLM → Response
Step 1: Document Processing
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
# Load documents
loader = PyPDFLoader("document.pdf")
pages = loader.load()
# Split into chunks
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
)
chunks = splitter.split_documents(pages)
Step 2: Creating Embeddings
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import FAISS
embeddings = OpenAIEmbeddings()
vectorstore = FAISS.from_documents(chunks, embeddings)
Step 3: Building the RAG Chain
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI
qa_chain = RetrievalQA.from_chain_type(
llm=OpenAI(),
retriever=vectorstore.as_retriever(
search_kwargs={"k": 4}
),
)
response = qa_chain.run("What does the document say about X?")
Production Optimizations
1. Hybrid Search
Combine semantic and keyword search:
from langchain.retrievers import BM25Retriever, EnsembleRetriever
bm25 = BM25Retriever.from_documents(chunks)
semantic = vectorstore.as_retriever()
ensemble = EnsembleRetriever(
retrievers=[bm25, semantic],
weights=[0.3, 0.7]
)
2. Reranking
Use a cross-encoder for better ranking:
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import CohereRerank
reranker = CohereRerank()
compression_retriever = ContextualCompressionRetriever(
base_compressor=reranker,
base_retriever=ensemble
)
3. Caching
Cache embeddings and responses:
from langchain.cache import SQLiteCache
import langchain
langchain.llm_cache = SQLiteCache(database_path=".cache.db")
Common Pitfalls
- Chunk size too large - Loses specificity
- No overlap - Loses context at boundaries
- Wrong embedding model - Use domain-specific when available
- No evaluation - Always measure retrieval quality
Evaluation Metrics
- Retrieval Precision: Are retrieved docs relevant?
- Answer Correctness: Is the answer accurate?
- Faithfulness: Does the answer match the context?
Conclusion
RAG is powerful but requires careful tuning. Start simple, measure everything, and iterate based on real user feedback.