Building a Production RAG Pipeline with LangChain and Pinecone
Move beyond demos. Learn how to build, scale, and evaluate a production-ready RAG pipeline using LangChain and Pinecone with real-world patterns.
Building a Production RAG Pipeline with LangChain and Pinecone
You built a RAG demo. It worked. Then you plugged in real data — and everything broke.
That’s the part most tutorials skip. A RAG pipeline with LangChain and Pinecone looks clean in a notebook, but once you feed it messy PDFs, long docs, or real user queries, things start falling apart — irrelevant chunks, slow retrieval, and hallucinated answers.
I’ve debugged this more times than I’d like to admit. And the pattern is always the same: it’s not the LLM that fails — it’s the pipeline.
📊 Stat: According to Pinecone (2024), roughly 31% of AI applications now rely on RAG to improve accuracy.
But production-grade systems are still rare, because most implementations stop at “it works” instead of “it scales and stays correct.”
This guide is about building one that actually holds up in production.
Architecting a Production RAG Pipeline
A working RAG demo hides architectural problems that will break under real data.
In production, you should think in two separate pipelines:
- Offline pipeline (indexing) Load → Chunk → Embed → Store in Pinecone
- Online pipeline (query-time) Query → Retrieve → Generate → Respond
The system flow looks like this:
User Query
↓
Embedding Model
↓
Pinecone Vector Search
↓
Top-K Documents
↓
LLM via LangChain
↓
Final AnswerLangChain is your orchestration layer — it wires components together. Pinecone is your retrieval engine — it makes semantic search fast.
💡 Tip: If you only take one thing: RAG is a data system, not an LLM feature.
Most performance issues come from how data is stored and retrieved — not from which model you use.
This becomes even more important when you integrate it into a real backend (like a FastAPI service or Next.js app), where latency and consistency matter.
Building the Knowledge Base (Chunking is Everything)
Bad chunking silently kills your RAG system.
You can have the best model and still get terrible answers if your chunks are poorly structured. From experience, chunk sizes between 250–500 tokens work best for most documentation-style content.
Step 1 — Load Documents
from langchain_community.document_loaders import TextLoader, PyPDFLoader
def load_documents(paths: list[str]):
docs = []
for path in paths:
if path.endswith(".txt"):
docs.extend(TextLoader(path).load())
elif path.endswith(".pdf"):
docs.extend(PyPDFLoader(path).load())
return docsThis part is straightforward — but consistency matters. Normalize everything early (encoding, formatting, etc.).
Step 2 — Smart Chunking
from langchain_text_splitters import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=500,
chunk_overlap=75
)
chunks = splitter.split_documents(documents)
# Output:
# List[Document]The overlap is critical. Without it, you’ll lose context across boundaries.
⚠️ Warning: Common mistake — avoid this.
chunk_overlap = 0 # ❌ breaks semantic continuityThis leads to incomplete answers because important context is split across chunks.
Step 3 — Embeddings
from langchain_openai import OpenAIEmbeddings
embedding_model = OpenAIEmbeddings(
model="text-embedding-3-small"
)
embeddings = [
embedding_model.embed_query(doc.page_content)
for doc in chunks
]💡 Tip: Always attach metadata early.
doc.metadata = {
"source": "docs/api.md",
"section": "authentication"
}You’ll use metadata later for filtering — skipping this now is painful to fix later.
Embeddings + Pinecone Index Design
Your embedding model and index configuration define what your system can retrieve.
📊 Stat: According to ANN (Approximate Nearest Neighbor) research, these systems provide 100–1000× faster search with minimal accuracy loss.
This is why vector databases like Pinecone are essential at scale.
Initialize Pinecone
from pinecone import Pinecone, ServerlessSpec
pc = Pinecone(api_key="YOUR_API_KEY")
index_name = "rag-index"
if index_name not in pc.list_indexes().names():
pc.create_index(
name=index_name,
dimension=1536,
metric="cosine",
spec=ServerlessSpec(
cloud="aws",
region="us-east-1"
)
)
index = pc.Index(index_name)Make sure your dimension matches your embedding model — this is a common failure point.
Batch Upserts
def upsert_batches(chunks, embeddings, batch_size=100):
for i in range(0, len(chunks), batch_size):
batch = []
for j in range(batch_size):
if i + j >= len(chunks):
break
doc = chunks[i + j]
batch.append({
"id": f"doc-{i+j}",
"values": embeddings[i + j],
"metadata": doc.metadata
})
index.upsert(vectors=batch)⚠️ Warning: Small upserts will destroy ingestion performance. Always batch.
Namespace Strategy
index.upsert(vectors=batch, namespace="docs-v1")Namespaces help you:
- isolate tenants
- version data
- roll out updates safely
Retrieval Layer — Where Most RAG Pipelines Fail
Retrieval quality matters more than model quality in a RAG pipeline with LangChain and Pinecone.
This is where most systems break. You can upgrade your model all day — if retrieval is wrong, your answers will still be wrong.
Create Retriever
from langchain_pinecone import PineconeVectorStore
vectorstore = PineconeVectorStore(
index=index,
embedding=embedding_model
)
retriever = vectorstore.as_retriever(
search_kwargs={"k": 5}
)Keep k small. More context doesn’t mean better answers.
RetrievalQA Chain
from langchain.chains import RetrievalQA
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model="gpt-4o-mini")
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
retriever=retriever,
return_source_documents=True
)
response = qa_chain.invoke({"query": "What is the refund policy?"})
print(response["result"])
# Output:
# "Our refund policy allows returns within 30 days..."⚠️ Warning: Common mistake.
k = 20 # ❌ too many chunks = noisy contextMore documents introduce noise and reduce answer quality.
Prompt Control (Underrated)
from langchain.prompts import PromptTemplate
template = """
You are a helpful assistant.
Use ONLY the context below.
If the answer is not in the context, say "I don't know".
Context:
{context}
Question:
{question}
"""
prompt = PromptTemplate(
template=template,
input_variables=["context", "question"]
)This reduces hallucination significantly.
Scaling a RAG Pipeline in Production
Scaling RAG is a data systems problem, not an LLM problem.
Systems typically start degrading around 1M+ documents if you don’t optimize retrieval and indexing.
Metadata Filtering
query_embedding = embedding_model.embed_query("pricing plans")
results = index.query(
vector=query_embedding,
top_k=5,
filter={"section": {"$eq": "pricing"}}
)Filtering reduces search space and improves both speed and relevance.
Async Ingestion
import asyncio
async def async_upsert(batch):
index.upsert(vectors=batch)
await asyncio.gather(*[async_upsert(b) for b in batches])Parallel ingestion becomes critical as your dataset grows.
💡 Tip: Split your data:
- Hot index → frequently accessed
- Cold index → archive
This improves latency and reduces cost.
Evaluation, Monitoring, and Real-World Pitfalls
If you’re not evaluating your RAG system, you’re guessing.
Most teams don’t measure:
- retrieval relevance
- answer faithfulness
- hallucination rate
That’s why systems feel unreliable.
Example Evaluation
from ragas import evaluate
results = evaluate(
dataset=my_dataset,
metrics=["faithfulness", "answer_relevancy"]
)
print(results)⚠️ Warning: Hard truth.
High retrieval accuracy does NOT guarantee good answers.
The LLM still depends on:
- clean context
- clear instructions
- proper formatting
Deployment and Cost Trade-offs
RAG pipelines fail in production because of infrastructure, not models.
📊 Stat: According to Pinecone (2024), ~55% of organizations now run AI systems in production.
This means reliability and cost matter more than experimentation.
Docker Setup
FROM python:3.12-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . .
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]Cost Reality
You’re paying for:
- embeddings (per token)
- vector storage
- query operations
💡 Tip: Cache aggressively:
- repeated queries
- embeddings
Security
- Never expose API keys
- Use environment variables
- Rotate keys regularly
Conclusion
A production RAG system isn’t about plugging LangChain into Pinecone and calling it done.
It’s about:
- chunking correctly
- retrieving precisely
- scaling intelligently
- evaluating continuously
If you get those right, the LLM becomes the easiest part.
And that’s the real takeaway — building a RAG pipeline with LangChain and Pinecone is less about AI magic, and more about engineering discipline.
Written by
M. YousufFull-Stack Developer learning ML, DL & Agentic AI. Student at GIAIC, building production-ready applications with Next.js, FastAPI, and modern AI tools.