Building a RAG System: From Zero to Production in 2 Weeks

Apr 8, 2025 12 min readBy Onesnzeros Team

A step-by-step walkthrough of how we built a production-grade Retrieval-Augmented Generation system for a client.

Our client — a legal services firm in Pune — had 15 years of contract templates, case notes, and compliance documents locked inside PDFs and Word files. Associates spent 2–3 hours daily just searching for the right precedent. We built them a RAG (Retrieval-Augmented Generation) system in two weeks. Here's exactly how we did it.

What is RAG and Why Does It Matter?

Large Language Models like GPT-4 and Claude are trained on general knowledge up to a cutoff date. They don't know about your internal documents, your specific products, or your proprietary processes. RAG solves this by retrieving relevant chunks of your own documents at query time and feeding them to the LLM as context. The result: an AI that answers questions about your business accurately — not generically.

Our Architecture

Document ingestion pipeline: PDF, DOCX, and text files → chunked text
Embedding model: OpenAI text-embedding-3-small for vector representations
Vector store: Supabase pgvector for storage and similarity search
Retrieval layer: top-5 most relevant chunks per query
LLM: Claude Sonnet for generation with retrieved context
Frontend: Next.js with streaming responses for real-time output
Auth: Supabase Auth with row-level security on document access

Week 1: Building the Foundation

1Set up document ingestion pipeline — handle PDF, DOCX, and scanned documents (OCR via Tesseract)
2Implement chunking strategy — 512 tokens with 64-token overlap to preserve context across boundaries
3Build embedding pipeline — batch-process all documents through OpenAI's embedding API
4Set up Supabase with pgvector extension and create vector similarity search functions
5Build the retrieval API — given a query, return top-k most relevant document chunks
6Create a basic chat interface and connect it end-to-end with streaming

Week 2: Production Hardening

1Implement hybrid search — combine vector similarity with keyword (BM25) search for better recall
2Add a re-ranking layer to sort retrieved chunks by true relevance before sending to the LLM
3Build document source citations — every AI answer shows which document it came from
4Implement access control — staff only search documents they're authorised to see
5Add query caching for frequently asked questions to reduce API costs
6Set up monitoring — track query latency, retrieval quality, and LLM token usage
7Load test with 50 concurrent users and optimise database indexes

Key Challenges We Faced

Scanned PDFs with poor quality — solved by preprocessing with image enhancement before OCR
Chunking strategy — too small loses context, too large exceeds context limits and hurts relevance
Hallucination when retrieved context was insufficient — solved with confidence thresholds and graceful 'I don't know' responses
Latency — vector search + LLM inference adds up; streaming responses masked the wait time significantly
Embedding costs — batch processing during ingestion is cheap, but re-embedding updated documents needs careful management

“The firm's associates now find the right document in under 30 seconds on average, compared to 2–3 hours before. That's roughly 500 hours of associate time saved per month across the team.”

Should You Build or Buy?

Off-the-shelf RAG products (Notion AI, SharePoint Copilot) work well for generic use cases. But if you have specific access control requirements, proprietary document formats, or need to integrate with your existing systems, building custom gives you the control you need. The two-week timeline is achievable with the right team — the components are well-understood, the tooling has matured, and the hard problems have known solutions.

Previous post Next post

Get Started

Ready to apply this to your business?

Book a free 30-minute call — no commitment, just a clear plan for how we can help.

Book Free Consultation