Home Projects Blog Career Skills Resume Contact
← Back to Blog

RAG Systems in Production: Beyond the Tutorial

Every RAG tutorial follows the same script: embed some documents, store them in a vector database, retrieve the top-k results, pass them to an LLM, and marvel at the response. It takes about 45 minutes to set up. It takes about 45 days to realize it doesn't work in production.

I built an enterprise RAG system that handles 45+ domain-specific knowledge collections, achieves 86% retrieval accuracy at 22ms average latency, and serves as the knowledge backbone for our AI agent fleet. Getting there required throwing out most of what the tutorials teach you and rebuilding from production constraints up.

The Gap Between Tutorial RAG and Production RAG

Tutorial RAG operates in a controlled environment: clean documents, homogeneous data, simple queries, and forgiving accuracy requirements. Production RAG operates in chaos: documents in 15 different formats, data spread across departments with different terminology for the same concepts, queries that range from "what's the leave policy" to complex multi-hop reasoning, and accuracy requirements where wrong answers are worse than no answers.

The first version of our RAG system — built following tutorial patterns — achieved about 52% retrieval accuracy. Half the time, the system returned relevant information. The other half, it confidently presented irrelevant content or hallucinated connections between unrelated documents. That's not a knowledge system; that's a liability.

Architecture Decisions: Why Qdrant + Neo4j Hybrid

The single biggest architectural decision was moving from pure vector search to a hybrid Qdrant + Neo4j architecture. Here's why:

Vector search (Qdrant) is excellent at semantic similarity — finding documents that mean similar things to the query. But enterprise knowledge isn't just about similarity. It's about relationships. "Who approved this purchase order?" isn't a similarity question — it's a graph traversal question. "What policies apply to this department for this type of expense?" requires understanding hierarchical relationships between organizational entities.

Neo4j handles the relationship layer. We model organizational hierarchies, document ownership chains, process dependencies, and cross-reference relationships as graph edges. When a query comes in, the system determines whether it needs semantic retrieval (Qdrant), relationship traversal (Neo4j), or both — and merges the results.

💡 Architecture Pattern

Qdrant for "what is similar to this?" — Neo4j for "what is connected to this?" — Redis for "have we answered this before?" The three layers together handle 90%+ of enterprise knowledge queries.

Building 45+ Domain-Specific Collections

One of the earliest mistakes was treating all enterprise knowledge as a single collection. When you dump finance policies, technical documentation, HR guidelines, and sales playbooks into one vector space, the embedding model can't distinguish between them effectively. A query about "returns" retrieves a mix of product returns (sales), tax returns (finance), and return types (programming documentation).

The solution was domain-specific collections. Each department, each knowledge domain, gets its own Qdrant collection with its own embedding configuration. We currently maintain 45+ collections spanning:

  • Operations: SOPs, process documentation, equipment manuals, safety protocols
  • Finance: Accounting policies, compliance requirements, audit trails, tax regulations
  • Sales: Product catalogs, pricing rules, customer communication templates, CRM data
  • Technical: API documentation, architecture diagrams, codebase analysis, deployment guides
  • HR: Leave policies, compensation structures, onboarding guides, performance criteria

A query router sits in front of all collections and classifies incoming queries by domain before dispatching to the appropriate collection(s). This single change improved retrieval accuracy from 52% to 71%. Adding the Neo4j relationship layer pushed it to 86%.

Achieving 86% Accuracy at 22ms Latency

The accuracy improvements came from three layers of optimization stacked on top of each other:

Layer 1: Better chunking. Tutorials typically chunk documents by character count — 500 characters per chunk, 100-character overlap. That's crude. We chunk by semantic boundaries — paragraph breaks, section headers, logical thought units. For structured documents (policies, SOPs), we preserve the hierarchical structure so each chunk retains its context within the broader document.

Layer 2: Query expansion. Users don't always ask clear questions. "Where's the leave thing?" should retrieve the same documents as "What is the company leave policy for FY2025-26?" We expand ambiguous queries using a lightweight LLM call that generates 2-3 reformulations, then merge the retrieval results.

Layer 3: Re-ranking. Initial retrieval returns the top 20 candidates. A cross-encoder re-ranker scores each candidate against the original query and re-orders them by relevance. The top 5 after re-ranking are significantly more accurate than the top 5 from raw vector search.

The 22ms latency target was non-negotiable — the system feeds into real-time AI agents that users interact with conversationally. Latency above 50ms creates perceptible pauses. We hit the target through aggressive caching (Redis), pre-computed embeddings, and keeping the Qdrant collections in-memory on dedicated hardware.

Document Ingestion with Docling

Enterprise documents are messy. PDFs with tables. Scanned images with OCR artifacts. Word documents with embedded charts. Excel files with merged cells. HTML exports from legacy systems with broken formatting.

We use Docling as the document processing layer. It handles the conversion of diverse document formats into clean, structured text while preserving layout information — table structures, list hierarchies, header levels. This is critical because chunking quality depends entirely on input quality. Garbage in, garbage retrieval out.

The ingestion pipeline runs asynchronously. New documents dropped into monitored directories are automatically processed, chunked, embedded, and indexed. The pipeline handles deduplication (same document updated multiple times), versioning (newer versions supersede older ones), and provenance tracking (every chunk links back to its source document, page, and section).

Caching Strategy with Redis

Not every query needs a fresh vector search. In enterprise environments, the same questions get asked repeatedly — especially by different people in the same department. Our Redis caching layer operates at three levels:

  • Query cache: Exact query matches return cached results instantly (~1ms). This handles the "same question, different user" pattern.
  • Semantic cache: Queries with embedding similarity above 0.95 to a cached query return the cached result. This handles reformulations of the same question.
  • Fragment cache: Individual document chunks are cached after retrieval, reducing repeated disk/network reads for popular documents.

The cache hit rate averages 34% across all collections, but for high-traffic collections like HR policies and IT documentation, it exceeds 60%. That's 60% of queries answered in under 2ms.

What Breaks in Production That Tutorials Never Mention

Embedding drift. When you update your embedding model (and you will — better models come out constantly), every existing vector in your database becomes slightly misaligned with new vectors. You either re-embed everything (expensive) or live with degraded accuracy during the transition. We handle this with versioned collections — new embeddings go into a new collection, and the system gradually migrates traffic.

Stale documents. Enterprise knowledge changes. Policies update. Procedures evolve. If your RAG system serves a policy from 2023 when the 2025 version exists, that's worse than not answering. We built a document freshness scorer that penalizes old documents and a review pipeline that flags potentially outdated content.

Confidentiality boundaries. Not everyone should access every document. The finance team's compensation data shouldn't appear in a sales rep's query results. We implement collection-level access controls and query-time permission filtering. This adds complexity but is non-negotiable in enterprise environments.

The "confident wrong answer" problem. RAG systems can retrieve contextually plausible but factually wrong information and present it with complete confidence. We added a confidence scoring layer that flags low-confidence retrievals for human review rather than serving them directly. Better to say "I'm not sure — let me check" than to confidently serve wrong information.

Practical Tips for Anyone Building Production RAG

  1. Start with one domain, not all domains. Pick the department with the cleanest documents and the highest query volume. Prove the system works there before expanding.
  2. Invest in chunking quality over embedding model selection. Switching from default character-count chunking to semantic chunking improved our accuracy more than upgrading the embedding model.
  3. Build the monitoring layer from day one. Track retrieval accuracy, latency percentiles, cache hit rates, and user feedback. Without metrics, you're flying blind.
  4. Plan for document updates. Your ingestion pipeline needs to handle document versioning, deletion, and replacement gracefully. This is not an edge case — it's the normal operating condition.
  5. Test with real users, not benchmarks. Academic retrieval benchmarks don't capture how enterprise users actually phrase questions. Real user testing revealed query patterns we never anticipated.
  6. Cache aggressively. In enterprise environments, query patterns are repetitive. A good caching layer can serve 30-60% of queries without touching the vector database.

The gap between tutorial RAG and production RAG is real, but it's not insurmountable. It requires thinking about the system as an infrastructure investment, not a weekend project. Build for reliability, measure everything, and design for the messy reality of enterprise data. The 86% accuracy at 22ms didn't come from a better model — it came from better engineering.

KR

Kunal Chaudhary Rajora

IT Manager & Enterprise Architect at Y Group

7+ years building enterprise systems that think, adapt, and deliver. Specializing in ERP architecture, agentic AI, RAG systems, and leading digital transformation in industrial environments.

Connect on LinkedIn
← Previous Article I Built 58 AI Agents for Enterprise Operations — Here's What Actually Works