RAG Pipeline: Retrieval-Augmented Generation for Enterprise Knowledge

A language model without grounding is an oracle that makes things up. OctopusOS RAG Pipeline transforms raw user queries into grounded, cited answers through an 8-stage pipeline: query rewriting, hybrid search (keyword + vector), permission filtering, LLM re-ranking, token-budgeted context assembly with PII masking, grounded answer generation, hallucination detection, and audit digest — all wired into the Worker Loop as a first-class port.


1. Why RAG Matters for an AI OS

Large Language Models hallucinate. They generate fluent text that sounds authoritative but may be entirely fabricated. For an enterprise AI operating system, this is unacceptable — wrong answers erode trust, create compliance risk, and waste human time verifying outputs.

RAG solves this by grounding LLM responses in retrieved source documents. Every claim in the answer is traceable to a specific chunk of verified knowledge. When the knowledge base has no relevant information, the system says so instead of guessing.

OctopusOS implements RAG as a kernel-native pipeline — pure domain logic with zero IO, backed by pluggable embedding and vector store adapters at the server layer.


2. The 8-Stage Pipeline

Every RAG query flows through eight deterministic stages. Each stage is a pure function in kernel/domains/rag/, composable and independently testable.

RAG Pipeline: 8 Stages
1
1. Query Rewrite
LLM-based expansion: resolve pronouns, expand abbreviations, add synonyms. Rule-based fallback for CJK text.
2
2. Hybrid Search
Keyword BM25 + vector cosine scored via Reciprocal Rank Fusion (RRF). Overfetch 3x for re-ranking headroom.
3
3. Permission Filter
ACL check via KnowledgePort. Only chunks the user is authorized to see survive.
4
4. LLM Re-rank
LLM scores each candidate passage 0.0-1.0 for query relevance. Structured JSON output.
5
5. Context Build
Token budgeting (default 3000), deduplication, PII masking (ID cards, phones, emails, bank accounts).
6
6. Answer Generation
Grounded Q&A with citation markers [1], [2]. System prompt enforces source-only answers.
7
7. Hallucination Check
LLM faithfulness evaluator scores every claim. Rule-based keyword-overlap fallback.
8
8. Audit Digest
stable_digest over query + answer + citations. Immutable evidence for compliance.

3. Query Rewriting

The rewriter converts colloquial user input into retrieval-optimized queries. It uses a small LLM (configurable via LLM_MODEL_SMALL) with structured JSON output.

# kernel/domains/rag/query_rewriter.py
def rewrite_query(*, query, llm_provider, config, history=None) -> list[str]:
    """LLM-based: resolve pronouns from conversation history,
    expand abbreviations, add synonyms. Returns 1-3 rewritten queries."""

def rule_based_rewrite(*, query) -> list[str]:
    """Fallback: synonym expansion for common CJK patterns
    (退款 -> 退款 退钱 退费), stopword removal."""

The rewriter also classifies query intent into factual, analytical, or procedural — this metadata is carried through the entire pipeline for downstream optimization.


4. Hybrid Search: Keyword + Vector

Pure keyword search misses semantic similarity. Pure vector search misses exact term matches. OctopusOS combines both via Reciprocal Rank Fusion (RRF).

Keyword Search (BM25)
Inverted index built from chunked documentsTF-IDF scoring with BM25 normalizationCJK-aware tokenization (jieba segmentation)Default weight: 0.3
Vector Search (Cosine)
Query embedded via EmbeddingPortFAISS IndexFlatIP (dev) or Qdrant cosine (prod)Batch embedding with configurable batch_size=32Default weight: 0.7

Reciprocal Rank Fusion

RRF merges two ranked lists without requiring score normalization:

RRF_score(d) = sum( 1 / (k + rank_i(d)) )   for each retrieval system i

where k=60 (standard constant). This naturally handles the incompatible score distributions between BM25 and cosine similarity.


5. LLM Re-ranking

After hybrid search returns candidates (overfetched at 3x top_k), an LLM re-ranker scores each passage for query relevance on a 0.0-1.0 scale.

# kernel/domains/rag/reranker.py
def rerank_with_llm(*, query, candidates, llm_provider, top_k=5):
    """LLM evaluates up to 20 candidate passages.
    Output: JSON {"rankings": [{"chunk_id": ..., "relevance_score": 0.0-1.0}]}
    Fallback: original retrieval order preserved on LLM failure."""

Re-ranking is the highest-impact stage for answer quality — it catches semantic mismatches that embedding similarity alone cannot detect.


6. Context Building with PII Masking

The context builder assembles the final prompt context within a strict token budget. Before any text reaches the LLM, field-level PII masking is applied.

PII Patterns Detected

PatternReplacementExample
Chinese ID card (18 digits)[MASKED_ID_CARD]110101199001011234
Mobile phone (1xx-xxxx-xxxx)[MASKED_PHONE]13812345678
International phone (+xx …)[MASKED_PHONE]+86 138 1234 5678
Email address[MASKED_EMAIL][email protected]
Credit/debit card[MASKED_CARD]6222 0200 1234 5678
Bank account (16-19 digits)[MASKED_BANK_ACCT]6217001234567890123
Passport (E/G/D + 8 digits)[MASKED_PASSPORT]E12345678
US SSN (xxx-xx-xxxx)[MASKED_SSN]123-45-6789

Custom patterns can be injected via extra_pii_patterns at runtime for domain-specific fields (employee IDs, medical record numbers, etc.).

Token Budget

The builder uses a bilingual token estimator: ~4 chars/token for ASCII, ~2 chars/token for CJK. Chunks are accumulated greedily until the budget (default 3000 tokens) is exhausted. Each chunk gets a citation marker [1], [2], etc.


7. Grounded Answer Generation

The Q&A generator enforces grounded responses through a strict system prompt:

RAG_QA_SYSTEM_PROMPT = """You are an enterprise knowledge assistant.
Answer the user's question based ONLY on the reference documents below.

Rules:
1. Only use information from the reference documents. Do not make up facts.
2. After each factual statement, cite the source using markers like [1], [2].
3. If the documents do not contain relevant information, clearly state that.
4. Answer in the same language as the user's question.
5. Be concise and accurate."""

Confidence scoring combines retrieval relevance (60% weight) and citation usage ratio (40% weight):

confidence = avg_chunk_score * 0.6 + (used_citations / total_citations) * 0.4

8. Hallucination Detection

Post-generation, an LLM evaluator checks every claim in the answer against source documents, producing a faithfulness score.

# kernel/domains/rag/hallucination_detector.py
def detect_hallucination(*, answer, source_chunks, llm_provider):
    """Returns:
        faithfulness_score: float 0-1
        unsupported_claims: [{claim, reason}]
        grounded_claims: [{claim, source_index}]"""

When the LLM evaluator is unavailable, a rule-based fallback splits the answer into sentences and checks keyword overlap (>30% overlap threshold) against source text.


9. Supporting Services

Intent Classifier
8 categories: factual, analytical, procedural, complaint, request, feedback, greeting, otherLLM-based with structured JSON outputRule-based CJK fallback (什么/怎么/为什么 patterns)Sentiment analysis (positive/negative/neutral + aspects)
Freshness Monitor
Per-connector staleness scoring (0.0 = just synced, >1.0 = overdue)Configurable threshold (default 24 hours)Severity levels: info (>50%), warning (>75%), critical (>100%)Error state detection for failed connectors
Feedback Store
JSONL-based persistent storage1-5 star ratings or thumbs (-1/0/1)Aggregate statistics: average, satisfaction rate, distributionTime-range and response-scoped queries
Entity Extractor
LLM-based extraction with structured outputPerson, organization, location, date, product entitiesEntity linking to knowledge base entriesRule-based CJK fallback

10. Embedding and Vector Store Adapters

The server layer provides two embedding backends and two vector store backends, selected via configuration.

Embedding and Vector Store Factory
Local Embedding
sentence-transformers (all-MiniLM-L6-v2)
API Embedding
OpenAI-compatible embedding API
Embedding Factory
EMBEDDING_MODE: local | api
FAISS Store
IndexFlatIP + JSON sidecar (dev/single-node)
Qdrant Store
Cosine distance, batch upsert (production)

FAISS Vector Store

For development and single-node deployments. Each collection is stored as:

  • {root}/vector_stores/{collection}/index.faiss — FAISS inner-product index
  • {root}/vector_stores/{collection}/metadata.json — chunk metadata sidecar

Operations: create_collection, upsert, search (with metadata filter), delete (index rebuild).

Qdrant Vector Store

For production environments. Uses the Qdrant gRPC/REST API with cosine distance, batch upserts (100 per batch), and native metadata filtering via FieldCondition.


11. Worker Loop Integration

RAG is wired into the main Worker Loop as a first-class port, with RAG-first retrieval and graceful fallback.

# kernel/runtime/_wl_memory_kb.py
def _query_kb_ref(self, *, run_id, kb_query_spec):
    # Try RAG pipeline first if available
    rag_ref = self._query_rag(run_id=run_id, kb_query_spec=kb_query_spec)
    if rag_ref is not None:
        return rag_ref
    # Fallback to standard knowledge port query
    return self.knowledge_port.query(kb_id, kb_query_spec)

The Worker Loop calls _kb_query_spec() for every intent, generating a query spec with kb_id, query (from intent objective), top_k=5, and source filters. The RAG adapter bridges this to the full 8-stage pipeline.

Bootstrap Wiring

# server/shared/wiring/bootstrap.py
rag_backend = config.get("RAG_BACKEND", "faiss")  # enabled by default
rag_port = create_rag_port(
    config=config,
    llm_provider=llm_provider_port,
    knowledge_port=capability_bundle.ports.get("KnowledgePort"),
    root_dir=root_dir,
)
# Injected into KernelWorkerLoop(rag_port=rag_port)

12. Multimodal RAG

The pipeline supports multimodal inputs via multimodal_query(). When a MultimodalContext (from the Multimodal Perception layer) arrives, the fused text is extracted and fed through the standard RAG pipeline.

# kernel/domains/rag/pipeline.py
def multimodal_query(*, context, kb_id, embedding_port, vector_store, ...):
    """Extract fused_text from MultimodalContext, delegate to rag_query().
    Fallback: dominant modality content_ref."""

This means image descriptions, audio transcriptions, and document extractions can all be used as RAG queries — the user uploads a PDF and asks questions about it, grounded in the knowledge base.


13. Configuration Reference

Config KeyDefaultPurpose
RAG_BACKENDfaissEnable RAG and select backend mode
VECTOR_STORE_MODEfaissVector DB: faiss or qdrant
EMBEDDING_MODElocalEmbedding: local (sentence-transformers) or api
RAG_REWRITE_ENABLEDtrueLLM query rewriting
RAG_RERANK_ENABLEDtrueLLM re-ranking
RAG_PII_MASKINGtruePII detection and masking in context
RAG_MAX_CONTEXT_TOKENS3000Token budget for context assembly
RAG_MAX_ANSWER_TOKENS1024Max tokens for generated answer
VECTOR_STORE_ROOT/tmp/octopus-vectorsFAISS storage directory
EMBEDDING_MODELall-MiniLM-L6-v2Local embedding model
QDRANT_URLhttp://localhost:6333Qdrant server address

14. Architecture Overview

RAG Pipeline Architecture
L5: Worker Loop Integration
_query_kb_ref() — RAG-first with KB fallback_query_rag() — RAG port bridgeEvery intent triggers KB/RAG queryKBRef output consumed by LLM prompt builder
L4: Server Adapters
RagPortAdapter — bridges RAGPort to pipelineEmbeddingFactory — local or API providerFaissVectorStore — dev/single-nodeQdrantVectorStore — production
L3: RAG Domain Logic (8 Stages)
pipeline.py — orchestratorquery_rewriter — LLM + rule-basedreranker — LLM relevance scoringcontext_builder — token budget + PII maskingqa_generator — grounded answer + citationshallucination_detector — faithfulness evaluation
L2: Search Infrastructure
embedding_pipeline — batch embed + index + searchhybrid_search — keyword BM25 + vector cosine + RRF mergeVectorSearchResult + HybridSearchResult contracts
L1: Contracts
RAGQuery, RAGContext, RAGResponse (frozen)EmbeddingResult, VectorSearchResult (frozen)EmbeddingPort, VectorStorePort (Protocol)UserFeedback (frozen)

15. Implementation Metrics

MetricCount
RAG pipeline stages8
Domain modules in domains/rag/13
Frozen contracts6 (RAGQuery, RAGContext, RAGResponse, EmbeddingResult, VectorSearchResult, HybridSearchResult)
PII detection patterns8 (ID card, phone, email, card, bank account, passport, SSN, international phone)
Vector store backends2 (FAISS, Qdrant)
Embedding backends2 (local sentence-transformers, API)
Supporting services4 (classifier, freshness monitor, feedback store, entity extractor)
Gate violations introduced0

16. Application Scenarios

Enterprise Knowledge Q&A

Upload product manuals, policy documents, and FAQs into the knowledge base. Users ask questions in natural language and receive grounded, cited answers — with PII automatically masked before reaching the LLM.

Customer Service Automation

Connect CRM and ticketing system data. The RAG pipeline retrieves relevant case history and product information, generates accurate responses with citations, and tracks user satisfaction through the feedback store.

Compliance and Audit

Every RAG response carries an immutable audit_digest (SHA-256) linking query, context, citations, and answer. The hallucination detector provides a faithfulness score for each response, enabling automated quality gates.

Multilingual Support

Query rewriting handles CJK synonym expansion natively. The context builder’s bilingual token estimator correctly budgets for mixed ASCII/CJK content. The Q&A generator responds in the same language as the user’s question.

OctopusOS
How can we help?