AI & Machine Learning

Building RAG Applications: From Zero to Production

JT
Jahanzaib Tayyab
October 5, 2024
10 min read
RAGLangChainAITutorialVector Database

What is RAG?

Retrieval-Augmented Generation (RAG) combines the power of large language models with external knowledge retrieval. Instead of relying solely on the LLM's training data, RAG fetches relevant context from your own documents.

The RAG Pipeline

Documents → Chunking → Embeddings → Vector Store
                                         ↓
User Query → Embedding → Similarity Search
                                         ↓
                              Retrieved Context + Query → LLM → Response

Step 1: Document Processing

from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Load documents
loader = PyPDFLoader("document.pdf")
pages = loader.load()

# Split into chunks
splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
)
chunks = splitter.split_documents(pages)

Step 2: Creating Embeddings

from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import FAISS

embeddings = OpenAIEmbeddings()
vectorstore = FAISS.from_documents(chunks, embeddings)

Step 3: Building the RAG Chain

from langchain.chains import RetrievalQA
from langchain.llms import OpenAI

qa_chain = RetrievalQA.from_chain_type(
    llm=OpenAI(),
    retriever=vectorstore.as_retriever(
        search_kwargs={"k": 4}
    ),
)

response = qa_chain.run("What does the document say about X?")

Production Optimizations

1. Hybrid Search

Combine semantic and keyword search:

from langchain.retrievers import BM25Retriever, EnsembleRetriever

bm25 = BM25Retriever.from_documents(chunks)
semantic = vectorstore.as_retriever()

ensemble = EnsembleRetriever(
    retrievers=[bm25, semantic],
    weights=[0.3, 0.7]
)

2. Reranking

Use a cross-encoder for better ranking:

from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import CohereRerank

reranker = CohereRerank()
compression_retriever = ContextualCompressionRetriever(
    base_compressor=reranker,
    base_retriever=ensemble
)

3. Caching

Cache embeddings and responses:

from langchain.cache import SQLiteCache
import langchain

langchain.llm_cache = SQLiteCache(database_path=".cache.db")

Common Pitfalls

  1. Chunk size too large - Loses specificity
  2. No overlap - Loses context at boundaries
  3. Wrong embedding model - Use domain-specific when available
  4. No evaluation - Always measure retrieval quality

Evaluation Metrics

  • Retrieval Precision: Are retrieved docs relevant?
  • Answer Correctness: Is the answer accurate?
  • Faithfulness: Does the answer match the context?

Conclusion

RAG is powerful but requires careful tuning. Start simple, measure everything, and iterate based on real user feedback.

Share this article
JT

Jahanzaib Tayyab

Full Stack Developer & AI Engineer

Passionate about building scalable applications and exploring the frontiers of AI. Writing about web development, cloud architecture, and lessons learned from shipping software.