BuildnScale
RAGLangChainPineconeFastAPILLMSoftware Engineering

Building a Production RAG Pipeline with LangChain and Pinecone

Move beyond demos. Learn how to build, scale, and evaluate a production-ready RAG pipeline using LangChain and Pinecone with real-world patterns.

MY
M. Yousuf
Mar 26, 202614 min read
Building a Production RAG Pipeline with LangChain and Pinecone

Building a Production RAG Pipeline with LangChain and Pinecone

You built a RAG demo. It worked. Then you plugged in real data — and everything broke.

That’s the part most tutorials skip. A RAG pipeline with LangChain and Pinecone looks clean in a notebook, but once you feed it messy PDFs, long docs, or real user queries, things start falling apart — irrelevant chunks, slow retrieval, and hallucinated answers.

I’ve debugged this more times than I’d like to admit. And the pattern is always the same: it’s not the LLM that fails — it’s the pipeline.

📊 Stat: According to Pinecone (2024), roughly 31% of AI applications now rely on RAG to improve accuracy.

But production-grade systems are still rare, because most implementations stop at “it works” instead of “it scales and stays correct.”

This guide is about building one that actually holds up in production.

Architecting a Production RAG Pipeline

A working RAG demo hides architectural problems that will break under real data.

In production, you should think in two separate pipelines:

  • Offline pipeline (indexing) Load → Chunk → Embed → Store in Pinecone
  • Online pipeline (query-time) Query → Retrieve → Generate → Respond

The system flow looks like this:

User Query

Embedding Model

Pinecone Vector Search

Top-K Documents

LLM via LangChain

Final Answer

LangChain is your orchestration layer — it wires components together. Pinecone is your retrieval engine — it makes semantic search fast.

💡 Tip: If you only take one thing: RAG is a data system, not an LLM feature.

Most performance issues come from how data is stored and retrieved — not from which model you use.

This becomes even more important when you integrate it into a real backend (like a FastAPI service or Next.js app), where latency and consistency matter.

Building the Knowledge Base (Chunking is Everything)

Bad chunking silently kills your RAG system.

You can have the best model and still get terrible answers if your chunks are poorly structured. From experience, chunk sizes between 250–500 tokens work best for most documentation-style content.

Step 1 — Load Documents

from langchain_community.document_loaders import TextLoader, PyPDFLoader
 
def load_documents(paths: list[str]):
    docs = []
    for path in paths:
        if path.endswith(".txt"):
            docs.extend(TextLoader(path).load())
        elif path.endswith(".pdf"):
            docs.extend(PyPDFLoader(path).load())
    return docs

This part is straightforward — but consistency matters. Normalize everything early (encoding, formatting, etc.).

Step 2 — Smart Chunking

from langchain_text_splitters import RecursiveCharacterTextSplitter
 
splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=75
)
 
chunks = splitter.split_documents(documents)
 
# Output:
# List[Document]

The overlap is critical. Without it, you’ll lose context across boundaries.

⚠️ Warning: Common mistake — avoid this.

chunk_overlap = 0  # ❌ breaks semantic continuity

This leads to incomplete answers because important context is split across chunks.

Step 3 — Embeddings

from langchain_openai import OpenAIEmbeddings
 
embedding_model = OpenAIEmbeddings(
    model="text-embedding-3-small"
)
 
embeddings = [
    embedding_model.embed_query(doc.page_content)
    for doc in chunks
]

💡 Tip: Always attach metadata early.

doc.metadata = {
    "source": "docs/api.md",
    "section": "authentication"
}

You’ll use metadata later for filtering — skipping this now is painful to fix later.

Embeddings + Pinecone Index Design

Your embedding model and index configuration define what your system can retrieve.

📊 Stat: According to ANN (Approximate Nearest Neighbor) research, these systems provide 100–1000× faster search with minimal accuracy loss.

This is why vector databases like Pinecone are essential at scale.

Initialize Pinecone

from pinecone import Pinecone, ServerlessSpec
 
pc = Pinecone(api_key="YOUR_API_KEY")
 
index_name = "rag-index"
 
if index_name not in pc.list_indexes().names():
    pc.create_index(
        name=index_name,
        dimension=1536,
        metric="cosine",
        spec=ServerlessSpec(
            cloud="aws",
            region="us-east-1"
        )
    )
 
index = pc.Index(index_name)

Make sure your dimension matches your embedding model — this is a common failure point.

Batch Upserts

def upsert_batches(chunks, embeddings, batch_size=100):
    for i in range(0, len(chunks), batch_size):
        batch = []
 
        for j in range(batch_size):
            if i + j >= len(chunks):
                break
 
            doc = chunks[i + j]
 
            batch.append({
                "id": f"doc-{i+j}",
                "values": embeddings[i + j],
                "metadata": doc.metadata
            })
 
        index.upsert(vectors=batch)

⚠️ Warning: Small upserts will destroy ingestion performance. Always batch.

Namespace Strategy

index.upsert(vectors=batch, namespace="docs-v1")

Namespaces help you:

  • isolate tenants
  • version data
  • roll out updates safely

Retrieval Layer — Where Most RAG Pipelines Fail

Retrieval quality matters more than model quality in a RAG pipeline with LangChain and Pinecone.

This is where most systems break. You can upgrade your model all day — if retrieval is wrong, your answers will still be wrong.

Create Retriever

from langchain_pinecone import PineconeVectorStore
 
vectorstore = PineconeVectorStore(
    index=index,
    embedding=embedding_model
)
 
retriever = vectorstore.as_retriever(
    search_kwargs={"k": 5}
)

Keep k small. More context doesn’t mean better answers.

RetrievalQA Chain

from langchain.chains import RetrievalQA
from langchain_openai import ChatOpenAI
 
llm = ChatOpenAI(model="gpt-4o-mini")
 
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=retriever,
    return_source_documents=True
)
 
response = qa_chain.invoke({"query": "What is the refund policy?"})
 
print(response["result"])
 
# Output:
# "Our refund policy allows returns within 30 days..."

⚠️ Warning: Common mistake.

k = 20  # ❌ too many chunks = noisy context

More documents introduce noise and reduce answer quality.

Prompt Control (Underrated)

from langchain.prompts import PromptTemplate
 
template = """
You are a helpful assistant.
 
Use ONLY the context below.
If the answer is not in the context, say "I don't know".
 
Context:
{context}
 
Question:
{question}
"""
 
prompt = PromptTemplate(
    template=template,
    input_variables=["context", "question"]
)

This reduces hallucination significantly.

Scaling a RAG Pipeline in Production

Scaling RAG is a data systems problem, not an LLM problem.

Systems typically start degrading around 1M+ documents if you don’t optimize retrieval and indexing.

Metadata Filtering

query_embedding = embedding_model.embed_query("pricing plans")
 
results = index.query(
    vector=query_embedding,
    top_k=5,
    filter={"section": {"$eq": "pricing"}}
)

Filtering reduces search space and improves both speed and relevance.

Async Ingestion

import asyncio
 
async def async_upsert(batch):
    index.upsert(vectors=batch)
 
await asyncio.gather(*[async_upsert(b) for b in batches])

Parallel ingestion becomes critical as your dataset grows.

💡 Tip: Split your data:

  • Hot index → frequently accessed
  • Cold index → archive

This improves latency and reduces cost.

Evaluation, Monitoring, and Real-World Pitfalls

If you’re not evaluating your RAG system, you’re guessing.

Most teams don’t measure:

  • retrieval relevance
  • answer faithfulness
  • hallucination rate

That’s why systems feel unreliable.

Example Evaluation

from ragas import evaluate
 
results = evaluate(
    dataset=my_dataset,
    metrics=["faithfulness", "answer_relevancy"]
)
 
print(results)

⚠️ Warning: Hard truth.

High retrieval accuracy does NOT guarantee good answers.

The LLM still depends on:

  • clean context
  • clear instructions
  • proper formatting

Deployment and Cost Trade-offs

RAG pipelines fail in production because of infrastructure, not models.

📊 Stat: According to Pinecone (2024), ~55% of organizations now run AI systems in production.

This means reliability and cost matter more than experimentation.

Docker Setup

FROM python:3.12-slim
 
WORKDIR /app
 
COPY requirements.txt .
RUN pip install -r requirements.txt
 
COPY . .
 
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]

Cost Reality

You’re paying for:

  • embeddings (per token)
  • vector storage
  • query operations

💡 Tip: Cache aggressively:

  • repeated queries
  • embeddings

Security

  • Never expose API keys
  • Use environment variables
  • Rotate keys regularly

Conclusion

A production RAG system isn’t about plugging LangChain into Pinecone and calling it done.

It’s about:

  • chunking correctly
  • retrieving precisely
  • scaling intelligently
  • evaluating continuously

If you get those right, the LLM becomes the easiest part.

And that’s the real takeaway — building a RAG pipeline with LangChain and Pinecone is less about AI magic, and more about engineering discipline.

Share this postX / TwitterLinkedIn
MY

Written by

M. Yousuf

Full-Stack Developer learning ML, DL & Agentic AI. Student at GIAIC, building production-ready applications with Next.js, FastAPI, and modern AI tools.

Related Posts