RAG Pipeline: Retrieval-Augmented Generation for Enterprise Knowledge
A language model without grounding is an oracle that makes things up. OctopusOS RAG Pipeline transforms raw user queries into grounded, cited answers through an 8-stage pipeline: query rewriting, hybrid search (keyword + vector), permission filtering, LLM re-ranking, token-budgeted context assembly with PII masking, grounded answer generation, hallucination detection, and audit digest — all wired into the Worker Loop as a first-class port.
1. Why RAG Matters for an AI OS
Large Language Models hallucinate. They generate fluent text that sounds authoritative but may be entirely fabricated. For an enterprise AI operating system, this is unacceptable — wrong answers erode trust, create compliance risk, and waste human time verifying outputs.
RAG solves this by grounding LLM responses in retrieved source documents. Every claim in the answer is traceable to a specific chunk of verified knowledge. When the knowledge base has no relevant information, the system says so instead of guessing.
OctopusOS implements RAG as a kernel-native pipeline — pure domain logic with zero IO, backed by pluggable embedding and vector store adapters at the server layer.
2. The 8-Stage Pipeline
Every RAG query flows through eight deterministic stages. Each stage is a pure function in kernel/domains/rag/, composable and independently testable.
3. Query Rewriting
The rewriter converts colloquial user input into retrieval-optimized queries. It uses a small LLM (configurable via LLM_MODEL_SMALL) with structured JSON output.
# kernel/domains/rag/query_rewriter.py
def rewrite_query(*, query, llm_provider, config, history=None) -> list[str]:
"""LLM-based: resolve pronouns from conversation history,
expand abbreviations, add synonyms. Returns 1-3 rewritten queries."""
def rule_based_rewrite(*, query) -> list[str]:
"""Fallback: synonym expansion for common CJK patterns
(退款 -> 退款 退钱 退费), stopword removal."""
The rewriter also classifies query intent into factual, analytical, or procedural — this metadata is carried through the entire pipeline for downstream optimization.
4. Hybrid Search: Keyword + Vector
Pure keyword search misses semantic similarity. Pure vector search misses exact term matches. OctopusOS combines both via Reciprocal Rank Fusion (RRF).
Reciprocal Rank Fusion
RRF merges two ranked lists without requiring score normalization:
RRF_score(d) = sum( 1 / (k + rank_i(d)) ) for each retrieval system i
where k=60 (standard constant). This naturally handles the incompatible score distributions between BM25 and cosine similarity.
5. LLM Re-ranking
After hybrid search returns candidates (overfetched at 3x top_k), an LLM re-ranker scores each passage for query relevance on a 0.0-1.0 scale.
# kernel/domains/rag/reranker.py
def rerank_with_llm(*, query, candidates, llm_provider, top_k=5):
"""LLM evaluates up to 20 candidate passages.
Output: JSON {"rankings": [{"chunk_id": ..., "relevance_score": 0.0-1.0}]}
Fallback: original retrieval order preserved on LLM failure."""
Re-ranking is the highest-impact stage for answer quality — it catches semantic mismatches that embedding similarity alone cannot detect.
6. Context Building with PII Masking
The context builder assembles the final prompt context within a strict token budget. Before any text reaches the LLM, field-level PII masking is applied.
PII Patterns Detected
| Pattern | Replacement | Example |
|---|---|---|
| Chinese ID card (18 digits) | [MASKED_ID_CARD] | 110101199001011234 |
| Mobile phone (1xx-xxxx-xxxx) | [MASKED_PHONE] | 13812345678 |
| International phone (+xx …) | [MASKED_PHONE] | +86 138 1234 5678 |
| Email address | [MASKED_EMAIL] | [email protected] |
| Credit/debit card | [MASKED_CARD] | 6222 0200 1234 5678 |
| Bank account (16-19 digits) | [MASKED_BANK_ACCT] | 6217001234567890123 |
| Passport (E/G/D + 8 digits) | [MASKED_PASSPORT] | E12345678 |
| US SSN (xxx-xx-xxxx) | [MASKED_SSN] | 123-45-6789 |
Custom patterns can be injected via extra_pii_patterns at runtime for domain-specific fields (employee IDs, medical record numbers, etc.).
Token Budget
The builder uses a bilingual token estimator: ~4 chars/token for ASCII, ~2 chars/token for CJK. Chunks are accumulated greedily until the budget (default 3000 tokens) is exhausted. Each chunk gets a citation marker [1], [2], etc.
7. Grounded Answer Generation
The Q&A generator enforces grounded responses through a strict system prompt:
RAG_QA_SYSTEM_PROMPT = """You are an enterprise knowledge assistant.
Answer the user's question based ONLY on the reference documents below.
Rules:
1. Only use information from the reference documents. Do not make up facts.
2. After each factual statement, cite the source using markers like [1], [2].
3. If the documents do not contain relevant information, clearly state that.
4. Answer in the same language as the user's question.
5. Be concise and accurate."""
Confidence scoring combines retrieval relevance (60% weight) and citation usage ratio (40% weight):
confidence = avg_chunk_score * 0.6 + (used_citations / total_citations) * 0.4
8. Hallucination Detection
Post-generation, an LLM evaluator checks every claim in the answer against source documents, producing a faithfulness score.
# kernel/domains/rag/hallucination_detector.py
def detect_hallucination(*, answer, source_chunks, llm_provider):
"""Returns:
faithfulness_score: float 0-1
unsupported_claims: [{claim, reason}]
grounded_claims: [{claim, source_index}]"""
When the LLM evaluator is unavailable, a rule-based fallback splits the answer into sentences and checks keyword overlap (>30% overlap threshold) against source text.
9. Supporting Services
10. Embedding and Vector Store Adapters
The server layer provides two embedding backends and two vector store backends, selected via configuration.
FAISS Vector Store
For development and single-node deployments. Each collection is stored as:
{root}/vector_stores/{collection}/index.faiss— FAISS inner-product index{root}/vector_stores/{collection}/metadata.json— chunk metadata sidecar
Operations: create_collection, upsert, search (with metadata filter), delete (index rebuild).
Qdrant Vector Store
For production environments. Uses the Qdrant gRPC/REST API with cosine distance, batch upserts (100 per batch), and native metadata filtering via FieldCondition.
11. Worker Loop Integration
RAG is wired into the main Worker Loop as a first-class port, with RAG-first retrieval and graceful fallback.
# kernel/runtime/_wl_memory_kb.py
def _query_kb_ref(self, *, run_id, kb_query_spec):
# Try RAG pipeline first if available
rag_ref = self._query_rag(run_id=run_id, kb_query_spec=kb_query_spec)
if rag_ref is not None:
return rag_ref
# Fallback to standard knowledge port query
return self.knowledge_port.query(kb_id, kb_query_spec)
The Worker Loop calls _kb_query_spec() for every intent, generating a query spec with kb_id, query (from intent objective), top_k=5, and source filters. The RAG adapter bridges this to the full 8-stage pipeline.
Bootstrap Wiring
# server/shared/wiring/bootstrap.py
rag_backend = config.get("RAG_BACKEND", "faiss") # enabled by default
rag_port = create_rag_port(
config=config,
llm_provider=llm_provider_port,
knowledge_port=capability_bundle.ports.get("KnowledgePort"),
root_dir=root_dir,
)
# Injected into KernelWorkerLoop(rag_port=rag_port)
12. Multimodal RAG
The pipeline supports multimodal inputs via multimodal_query(). When a MultimodalContext (from the Multimodal Perception layer) arrives, the fused text is extracted and fed through the standard RAG pipeline.
# kernel/domains/rag/pipeline.py
def multimodal_query(*, context, kb_id, embedding_port, vector_store, ...):
"""Extract fused_text from MultimodalContext, delegate to rag_query().
Fallback: dominant modality content_ref."""
This means image descriptions, audio transcriptions, and document extractions can all be used as RAG queries — the user uploads a PDF and asks questions about it, grounded in the knowledge base.
13. Configuration Reference
| Config Key | Default | Purpose |
|---|---|---|
RAG_BACKEND | faiss | Enable RAG and select backend mode |
VECTOR_STORE_MODE | faiss | Vector DB: faiss or qdrant |
EMBEDDING_MODE | local | Embedding: local (sentence-transformers) or api |
RAG_REWRITE_ENABLED | true | LLM query rewriting |
RAG_RERANK_ENABLED | true | LLM re-ranking |
RAG_PII_MASKING | true | PII detection and masking in context |
RAG_MAX_CONTEXT_TOKENS | 3000 | Token budget for context assembly |
RAG_MAX_ANSWER_TOKENS | 1024 | Max tokens for generated answer |
VECTOR_STORE_ROOT | /tmp/octopus-vectors | FAISS storage directory |
EMBEDDING_MODEL | all-MiniLM-L6-v2 | Local embedding model |
QDRANT_URL | http://localhost:6333 | Qdrant server address |
14. Architecture Overview
15. Implementation Metrics
| Metric | Count |
|---|---|
| RAG pipeline stages | 8 |
Domain modules in domains/rag/ | 13 |
| Frozen contracts | 6 (RAGQuery, RAGContext, RAGResponse, EmbeddingResult, VectorSearchResult, HybridSearchResult) |
| PII detection patterns | 8 (ID card, phone, email, card, bank account, passport, SSN, international phone) |
| Vector store backends | 2 (FAISS, Qdrant) |
| Embedding backends | 2 (local sentence-transformers, API) |
| Supporting services | 4 (classifier, freshness monitor, feedback store, entity extractor) |
| Gate violations introduced | 0 |
16. Application Scenarios
Enterprise Knowledge Q&A
Upload product manuals, policy documents, and FAQs into the knowledge base. Users ask questions in natural language and receive grounded, cited answers — with PII automatically masked before reaching the LLM.
Customer Service Automation
Connect CRM and ticketing system data. The RAG pipeline retrieves relevant case history and product information, generates accurate responses with citations, and tracks user satisfaction through the feedback store.
Compliance and Audit
Every RAG response carries an immutable audit_digest (SHA-256) linking query, context, citations, and answer. The hallucination detector provides a faithfulness score for each response, enabling automated quality gates.
Multilingual Support
Query rewriting handles CJK synonym expansion natively. The context builder’s bilingual token estimator correctly budgets for mixed ASCII/CJK content. The Q&A generator responds in the same language as the user’s question.