How RAG Development Works: A Step-by-Step Explanation
Understand how retrieval-augmented generation (RAG) works: document ingestion, chunking, embeddings, vector search, and LLM answer generation explained.

The Process, Step by Step
1. Document ingestion and preprocessing. Source documents are collected and converted to plain text. PDFs get parsed (pypdf, pdfplumber, or more robust options like Unstructured.io for complex layouts), HTML gets stripped of navigation and boilerplate (readability-lxml, trafilatura), Word documents get extracted (python-docx, mammoth). This step is messier than it sounds. Real-world documents have tables, footnotes, headers that repeat across pages, multi-column layouts, embedded images, and formatting that loses meaning in plain text. The preprocessing step addresses these issues so the content that enters the pipeline is clean. Budget one to two weeks on this phase for a typical 500 to 2,000 document library; teams that treat it as a one-day job almost always redo it later.
2. Chunking. Each document is split into smaller pieces, called chunks, typically 300 to 600 tokens each (roughly 200 to 400 words). The chunking strategy matters more than most teams realize. Fixed-size chunking splits text every N tokens regardless of sentence or paragraph boundaries, which is fast but can cut important context in half. Semantic chunking splits at logical boundaries (paragraph breaks, section headers, markdown structure) and keeps related content together. Hierarchical chunking creates both small and large chunks and uses the small ones for retrieval and the large ones for context injection. The right strategy depends on your document structure. The wrong chunking strategy means the retrieval step consistently returns partial, decontextualized answers, which is the single most common cause of RAG systems that "mostly work" but degrade user trust.
3. Embedding generation. Each chunk is sent through an embedding model, which converts the text into a vector: a long list of numbers (1536 dimensions for text-embedding-3-small, 3072 for text-embedding-3-large) that represents the chunk's semantic meaning in a high-dimensional space. Chunks about similar topics end up with similar vectors. This is the mechanism that makes semantic search possible. The embedding model is fixed at this step. Every query at runtime must use the same embedding model, or the similarity math breaks. Cost-wise, embedding a 1,000-document library of roughly 5 million tokens runs about $0.10 on text-embedding-3-small and $0.65 on text-embedding-3-large; the ingestion bill is almost never the binding constraint.
4. Vector database storage. The vectors are stored in a vector database alongside the original chunk text and metadata (source document name, page number, section header, date, author, permission tags, document type). The metadata is what allows the system to tell you where an answer came from and to filter results. Without metadata, you get answers with no attribution, which users rightfully do not trust. Without permission tags, you cannot enforce document-level access control, which is a compliance problem for any system touching HR, legal, or customer data. Pinecone, Weaviate, Qdrant, and pgvector all support filtered vector search; make sure your implementation uses it.
5. Query processing at runtime. When a user submits a question, the query goes through the same embedding model used during ingestion. The resulting vector is compared against all stored chunk vectors using a similarity algorithm (cosine similarity is standard, though dot product and Euclidean distance are also used depending on the embedding model). The top K most similar chunks are retrieved, typically 3 to 10 depending on context window size and the complexity of the expected answer. More sophisticated implementations layer in hybrid search (combining semantic similarity with keyword-based BM25 search) and a reranking step (a smaller cross-encoder model like Cohere Rerank or bge-reranker that reorders the top 50 results to surface the best 5). Reranking typically costs pennies per query and improves retrieval precision by 15 to 25% on real queries.
6. Context injection and generation. The retrieved chunks are inserted into the LLM's context window alongside the user's question and a system prompt that instructs the model to answer using only the provided context. The prompt matters as much as the retrieval. A production-grade prompt explicitly tells the model: cite your sources by chunk ID, say "I do not have that information in the provided documents" when context is insufficient, refuse to speculate beyond what is in the context, and answer in a specific tone matching the brand voice. The model then generates a response. This retrieved context injection is the core mechanism that distinguishes RAG from a plain chatbot. Without it, the model falls back on its training data, which may be outdated or simply wrong for your specific content.
7. Response with attribution. The final response includes the answer and citations pointing back to the source documents and specific chunks. Attribution is not optional in production systems. It is the audit mechanism that lets users verify answers and lets operators identify when retrieval is pulling the wrong content. A well-implemented UI renders citations as clickable links that jump the user directly to the source document and the highlighted passage. This is the difference between a RAG system users trust and one they quietly stop using.
Where Things Go Wrong
Bad chunking loses context across chunk boundaries. If a policy document describes a rule in one paragraph and the exception in the next, and your chunking split them into separate chunks, retrieval might return the rule without the exception. Users get incomplete answers that are technically grounded in your documents but misleading in practice. A common fix is chunk overlap (typically 10 to 20% of chunk size) so adjacent chunks share some content, but this is a patch on a larger problem. The real fix is chunking strategies that respect document structure: split on section headers, preserve list items together, keep table rows in the same chunk as their headers. Chunking strategy requires testing with real query examples, not just assumption.
Retrieval misses when the embedding model does not match query style. Embedding models are trained on specific data distributions. A model trained on general web text may not perform well on highly technical domain-specific queries, or on queries in languages other than English, or on queries using acronyms your documents spell out in full. If users ask "what is the SLA on P1 tickets?" but your documentation says "service level agreement for priority-one incidents," the semantic similarity may be lower than expected. Fixes include domain-specific embedding models (BioBERT for medical, FinBERT for financial, or fine-tuned embeddings on your own content), query rewriting at the preprocessing step (an LLM call that expands acronyms and paraphrases the query before embedding), or hybrid search that catches exact keyword matches the semantic search misses.
Hallucination when retrieved documents are insufficient. When the system retrieves chunks that are tangentially related but do not actually answer the question, the LLM tries to synthesize an answer from insufficient evidence. The result is a confident-sounding answer that is partially fabricated. This is the most dangerous failure mode because it looks like success until someone verifies. Mitigation requires: explicit instructions to the model to say "I do not know" when the context is insufficient, minimum similarity thresholds (we typically set 0.7 to 0.75 on cosine similarity for production systems) that reject low-confidence retrievals, confidence scoring on the final answer, and user feedback mechanisms that flag wrong answers for review.
No strategy for stale documents. Your knowledge base changes. Policy documents get updated. Product specs change. Pricing tiers shift. A RAG system ingested once and never updated serves answers from old documents. One incident we diagnosed involved a support chatbot still quoting a return policy that had been updated seven months earlier because nobody had wired the RAG ingestion pipeline to the document management system. Production RAG systems need a re-ingestion pipeline that detects changed documents (webhook from Google Drive, SharePoint, Notion, or Confluence, or a daily scan with content hashing), removes the old chunks, generates new embeddings, and updates the vector database, ideally automatically when source documents change.
No evaluation framework for regressions. LLM providers update their models. Your team updates prompts. New documents are added that conflict with old ones. Any of these can quietly degrade answer quality. Without a regression test suite (Ragas, TruLens, LangSmith evals, or a custom harness running 100 to 500 canonical queries with known-correct answers), drift is invisible until a user complaint surfaces it. Budget for weekly automated evals and a dashboard that tracks accuracy over time.
What the Output Looks Like
A completed RAG system delivers a question-answering interface (chatbot widget on your website, Slack app, API endpoint, or search bar embedded in an internal tool) connected to your document library, source citations on every answer with clickable links to the original content, an ingestion pipeline for adding new documents (ideally automated via your document management system), a monitoring dashboard showing query volume, retrieval confidence scores, cost per query, and flagged low-confidence answers, and a testing suite that validates the system against a set of known correct answers.
The system is not static after launch. Expect an iteration cycle in the first 60 to 90 days where the team tunes chunking, adjusts the prompt, and closes gaps in the source documents that real user queries expose. Plan for 10 to 20% of the original build budget to be spent in this phase; it is the difference between a system that peaks at launch and one that compounds in value over time.
How to Evaluate Your Options
Before committing to a custom RAG build, pressure-test the decision against three alternatives. First, could a well-configured off-the-shelf tool solve this? Intercom Fin, Zendesk AI, Glean, and Sana Labs all offer document-grounded answering with reasonable out-of-the-box performance, and the annual cost of $15,000 to $80,000 may beat the total cost of a custom build for simple use cases. Second, could you solve the retrieval problem without an LLM at all? A modern search tool (Algolia, Elastic, or Typesense) with good metadata and a clean UI answers many "where is the information" problems without the hallucination risk. Third, is this a small enough content library that a well-prompted LLM with the full content in its context (now possible with Claude's 200,000-token context window) outperforms a complex retrieval pipeline? For libraries under 300 pages, sometimes yes.
If you clear those filters, the custom-build evaluation comes down to vendor quality and scope discipline. Ask any prospective builder to walk through a prior RAG implementation in detail: which embedding model and why, chunk size and overlap strategy, reranking or not, evaluation metrics and current performance, how ingestion is automated, how access control is enforced, and what the monthly operating cost looks like at your projected query volume. Vague answers are a flag. Good answers from a builder who has seen the failure modes are the signal you want.
How Long It Takes
Week 1: Content audit, document preprocessing, and chunking strategy definition. Week 2: Embedding pipeline build, vector database setup, and initial ingestion. Week 3: Retrieval tuning, reranking integration, generation prompt development, and accuracy testing. Week 4: Interface integration, attribution implementation, monitoring dashboard, and user acceptance testing.
A well-scoped RAG system against a clean, organized document library takes 3 to 4 weeks. Systems with many document sources, inconsistent formatting, or complex multi-document synthesis requirements take 6 to 8 weeks. Enterprise deployments with SOC 2 requirements, multi-region hosting, and integrations into SSO and document management systems take 10 to 16 weeks.
Frequently Asked Questions
### Is RAG the same as fine-tuning? No, and the distinction matters for cost and use case. Fine-tuning trains the model itself on new data, which changes the model's weights permanently. It is expensive (often $5,000 to $50,000 per fine-tuning run depending on model and data volume), requires significant data preparation (thousands of high-quality labeled examples), and does not work well for content that changes frequently because every update requires another training run. RAG does not touch the model. It retrieves relevant content at query time and injects it into the model's context. For most business knowledge base use cases, RAG is the right approach. Fine-tuning is better for style adaptation, domain-specific reasoning patterns, or high-volume narrow tasks where the per-query cost savings of a smaller fine-tuned model compound meaningfully.
### How accurate is the system? Accuracy depends heavily on document quality, chunking strategy, reranking, prompt design, and the difficulty of the queries. On clean, well-structured content with clear factual questions, production RAG systems routinely achieve 85 to 95% accuracy on a benchmark test set. On ambiguous queries, sparse documentation, or questions that require synthesizing many documents, accuracy drops to 65 to 80%. Testing against a representative set of 100 or more real questions before deployment gives you an honest baseline, and running the same eval monthly after launch catches drift.
### Can the system handle documents in different formats? Yes. Modern document parsers handle PDF, Word, HTML, Markdown, plain text, and structured data like CSV and JSON. PDFs with complex tables or scanned image content require additional processing (OCR for scanned pages using AWS Textract, Google Document AI, or open-source Tesseract; table extraction libraries like Camelot or Unstructured's table parser for complex tables). The preprocessing step addresses format diversity, but plan for format-specific quirks requiring custom handling. Scanned low-quality PDFs are the single most common source of ingestion headaches and often require manual content cleanup.
### How do we keep the knowledge base current? Through an automated re-ingestion pipeline. When a source document changes, the pipeline detects the change (webhook from the document system, or a daily scan with content hashing), removes the old chunks from the vector database, re-processes the updated document, generates new embeddings, and stores them. For most organizations, connecting the ingestion pipeline to the document management system they already use (SharePoint, Google Drive, Notion, Confluence, Box, Dropbox) provides the trigger mechanism for automatic updates. Without this automation, the system will drift out of date within weeks of launch.
### What does a production RAG system cost to operate? Operating costs break into four components. Vector database hosting ranges from $100 per month (pgvector on an existing PostgreSQL instance or a small Qdrant deployment) to $2,000 per month (Pinecone at scale or large Weaviate clusters). LLM API costs for generation typically run $0.003 to $0.02 per query on Claude or GPT-4 class models, so 10,000 queries per month lands between $30 and $200. Embedding costs for re-ingesting changed content are usually under $50 per month. Monitoring and eval infrastructure (Langfuse, LangSmith, Datadog) runs $100 to $500 per month. Add human maintenance time at roughly 4 to 12 hours per month and the typical mid-market deployment costs $1,200 to $3,500 per month all-in to operate after launch.
### How does a RAG system connect to the rest of our marketing and website stack? RAG systems usually present through a user interface: a chat widget on the website, a support tool integration, a Slack or Teams app, or an internal admin surface. The quality of that interface determines whether users actually adopt the system, so coordination with the team running your website design and UI/UX design work matters. The content powering the RAG system often overlaps with content that serves SEO services goals, and organizations that maintain a single high-quality content library for both purposes see better results than those running parallel content operations. Hosting decisions also ripple through; a RAG system embedded on a marketing site needs to fit within the architecture your web hosting and maintenance provider supports.
Ready to put this into action?
We help businesses implement the strategies in these guides. Talk to our team.