Crate entelix_rag

Expand description

§entelix-rag

Algorithmic primitives for retrieval-augmented generation pipelines — Document with provenance + lineage, plus the DocumentLoader / TextSplitter / Chunker trait surface every RAG path composes around.

§Position

2026-era agentic RAG (Contextual Retrieval, Self-RAG, CRAG, Adaptive-RAG) is no longer a side pipeline — it’s the agent’s baseline working memory. This crate ships the algorithmic primitives (splitters, chunkers, ingestion composition) that every consumer reaches for. Concrete source connectors (S3, Notion, Confluence, GDrive, …) live in companion crates so the core surface stays small and dependency-light.

§Surface

Document — RAG-shaped document with Source (where it came from), Lineage (split / chunk ancestry), and entelix_memory::Namespace (multi-tenant boundary). The retrieval-side entelix_memory::Document (with similarity score) is a result shape; this is the ingestion shape.
DocumentLoader — async source-side trait. Streams to keep ingestion memory-bounded over arbitrarily large corpora.
TextSplitter — sync algorithmic primitive. Slices a Document into smaller Documents preserving Lineage.
Chunker — async transform over a chunk sequence. LLM-call capable (Anthropic Contextual Retrieval, HyDE, query decomposition).

§What lives in companion crates

Source connectors — entelix-rag-s3, entelix-rag-notion, entelix-rag-confluence, entelix-rag-fs (filesystem-backed, invariant 9 exemption).
Vendor-accurate tokenizers — entelix-tokenizer-tiktoken, entelix-tokenizer-hf, locale-aware companions (Korean / Japanese morphology). The entelix_core::TokenCounter trait is the integration surface; this crate’s TokenCountSplitter is generic over any C: TokenCounter + ?Sized + 'static (default dyn TokenCounter) so concrete Arc<TiktokenCounter> and type-erased Arc<dyn TokenCounter> plug in interchangeably. Vendor accuracy is a counter swap, not a splitter rewrite.

§Why algorithmic primitives only

The LangChain ecosystem’s mistake was bundling 100+ source connectors into the core surface — version churn became unmanageable. entelix-rag’s coreis explicitly small (4 traits + Document + provenance types) so vendor-specific loaders ship independently and never gate the core’s release cadence. The algorithmic primitives (splitters, chunkers, ingestion composition) ARE universal so they live here.

Structs§

ContextualChunker: Anthropic Contextual Retrieval chunker. Each chunk’s content is rewritten as <contextual prefix>\n\n<original chunk> where <contextual prefix> is a model-generated 50-100 token summary of how the chunk relates to its parent document.
ContextualChunkerBuilder: Builder for ContextualChunker. Construct via ContextualChunker::builder; chain config setters; finalise with Self::build.
CorrectiveRagState: State the corrective-RAG graph drives across nodes. Carries the original + current query, the rewrite history, the last retrieval batch + verdicts, the surviving correct subset, and the terminal answer.
CragConfig: Operator-tunable knobs for the corrective-RAG recipe. Construct via Self::new or Self::default; chain with_* setters.
Document: The unit a RAG pipeline moves around — content plus everything downstream needs to know about where it came from.
DocumentId: Stable identifier for a Document within its Namespace. Loaders mint these from the source’s natural id (S3 object key, Notion page id, file path); splitters derive child ids by suffixing the parent id with :<chunk_index>.
IngestError: One per-document failure recorded during ingestion. Carries the originating document id (when known) and a stage label identifying which pipeline phase failed.
IngestReport: Outcome counters and per-document failure list a single IngestionPipeline::run produces.
IngestionPipeline: End-to-end RAG ingestion pipeline. Construct via Self::builder; finalise with IngestionPipelineBuilder::build; drive with Self::run.
IngestionPipelineBuilder: Builder for IngestionPipeline. Required components (loader / splitter / embedder / store) come in via IngestionPipeline::builder; optional Chunker entries accumulate via Self::add_chunker.
Lineage: Split-history — survives every transformation. A leaf chunk’s Lineage describes which parent it came from, which split produced it, and which chunkers ran over it. Audit / debug flows reconstruct the path from a retrieval hit back to the ingestion source by walking the lineage chain (parent_id → loader’s source URI).
LlmQueryRewriter: Reference LLM-driven QueryRewriter. Asks the supplied Runnable<Vec<Message>, Message> model for a corrected query, then trims surrounding whitespace and quote marks.
LlmQueryRewriterBuilder: Builder for LlmQueryRewriter.
LlmRetrievalGrader: Reference LLM-driven RetrievalGrader. Asks the supplied Runnable<Vec<Message>, Message> model to classify relevance, then parses the reply into a GradeVerdict. Operators inheriting from this default tune the prompt via LlmRetrievalGraderBuilder::with_instruction or write their own grader from scratch.
LlmRetrievalGraderBuilder: Builder for LlmRetrievalGrader.
MarkdownStructureSplitter: Heading-aware markdown splitter.
RecursiveCharacterSplitter: Recursive character-budget splitter.
Source: Where a Document originated. Survives every split and chunker pass — the leaf chunk knows the source URI of the parent document and which loader produced it.
TokenCountSplitter: Recursive token-budget splitter.

Enums§

FailurePolicy: Per-chunk failure policy — picks how the chunker reacts when the underlying model call fails on one chunk. See module docs for the trade-off matrix.
GradeVerdict: Three-way verdict the grader emits per (query, document) pair, matching the CRAG paper’s relevance classes.

Constants§

CONTEXTUAL_CHUNKER_DEFAULT_INSTRUCTION: Default operator-facing instruction prepended to every model call. Verbatim from Anthropic’s published Contextual Retrieval recipe — lifts the model into the right framing without requiring per-corpus tuning.
CORRECTIVE_RAG_AGENT_NAME: Stable agent name surfaced on every emitted entelix_agents::AgentEvent and OTel entelix.agent.run span.
DEFAULT_CHUNK_OVERLAP_CHARS: Default overlap between consecutive chunks. ~10% of DEFAULT_CHUNK_SIZE_CHARS preserves enough trailing context for retrieval grounding without bloating the index.
DEFAULT_CHUNK_OVERLAP_TOKENS: Default overlap between consecutive chunks in tokens. ~12.5% of DEFAULT_CHUNK_SIZE_TOKENS preserves enough trailing context for retrieval grounding without bloating the index.
DEFAULT_CHUNK_SIZE_CHARS: Default chunk size in characters. ~1000 chars maps to roughly 200-300 tokens for English under cl100k_base, comfortably under every shipping vendor’s per-message ceiling.
DEFAULT_CHUNK_SIZE_TOKENS: Default chunk size in tokens. 512 matches the typical embedding context window (text-embedding-3-small and -large both cap at 8191 tokens; chunking under 512 leaves headroom for query + instruction tokens at retrieval time).
DEFAULT_GENERATOR_SYSTEM_PROMPT: Default system prompt the generator node prepends to every answer-generation call. Vendor-neutral, focused on grounded answer style.
DEFAULT_GRADER_INSTRUCTION: Default instruction prepended to every model call. Frames the task verbatim in the CRAG-paper terms so the model emits one of the three canonical labels.
DEFAULT_MARKDOWN_HEADING_LEVELS: Default ATX heading levels that open a new chunk. [1, 2, 3] splits at #, ##, ###; deeper sub-headings (####+) stay inline.
DEFAULT_MAX_REWRITE_ATTEMPTS: Default cap on rewrite-loop attempts before the recipe surrenders and generates over whatever was retrieved last. 3 is the CRAG paper’s reported sweet spot (retrieval rarely improves beyond the third rewrite).
DEFAULT_MIN_CORRECT_FRACTION: Default minimum fraction of retrieved documents that must grade GradeVerdict::Correct for the recipe to skip rewriting and proceed directly to generation. 0.5 matches the CRAG paper’s mid-confidence threshold — operators tuning for higher retrieval precision raise it; tuning for lower model spend (fewer rewrites at the cost of weaker grounding) lower it.
DEFAULT_RECURSIVE_SEPARATORS: Default separator priority list. Paragraph break → line break → word boundary → character. The empty-string fallback guarantees termination even on pathological input (one giant unbroken token).
DEFAULT_RETRIEVAL_TOP_K: Default top-k passed into the retriever on every retrieval pass. Operator-overridable via CragConfig::with_retrieval_top_k.
DEFAULT_REWRITER_INSTRUCTION: Default instruction prepended to every model call. Verbatim matches the CRAG-paper rewriter framing — the model produces one corrected query string, no surrounding explanation.
PROVENANCE_METADATA_KEY: Reserved key on the persisted metadata map under which the pipeline stamps Source + Lineage + namespace. Carries the entelix prefix so an operator’s own metadata fields never collide. Retrieval-side consumers reach back to provenance through this nested object.

Traits§

Chunker: Async transform applied to a sequence of chunks after a TextSplitter ran. Implementations may issue LLM calls, embedding lookups, or external metadata enrichment; the ExecutionContext supplies cancellation, deadline, and any entelix_core::RunBudget caps the parent pipeline configured.
DocumentLoader: Source-side trait the ingestion pipeline pulls documents from.
QueryRewriter: Async trait the corrective-RAG recipe calls when retrieval quality requires another attempt with a different query. Implementations may be LLM-driven, heuristic (query-expansion / synonym-bag), classifier-routed, or any hybrid — the recipe takes whatever string comes back and re-runs retrieval with it.
RetrievalGrader: Async trait the corrective-RAG recipe consults for every retrieved document. Implementations may be LLM-driven (the canonical case, see LlmRetrievalGrader) or keyword / heuristic / classifier-model based — the recipe doesn’t care as long as the verdict is one of the three GradeVerdict variants.
TextSplitter: Pure-algorithm slice of a Document into smaller documents.

Functions§

build_corrective_rag_graph: Compile the corrective-RAG graph from operator-supplied primitives. Use this when you need to embed the graph as a node in a larger StateGraph; for a ready-to-execute agent, prefer create_corrective_rag_agent.
create_corrective_rag_agent: Build a ready-to-execute corrective-RAG Agent. Wraps build_corrective_rag_graph in the standard Agent<S> shape so the full lifecycle (AgentEvent stream, sink fan-out, observer hooks, supervisor handoff) integrates uniformly with every other recipe (create_react_agent, create_supervisor_agent, create_chat_agent).

Type Aliases§

DocumentStream: Boxed stream type alias for documents produced by a DocumentLoader. Items are Result so a partial-success stream can yield successful documents while reporting per-item errors — a single mid-walk failure does not abort the whole ingestion run.