Skip to main content

entelix_rag/
lib.rs

1//! # entelix-rag
2//!
3//! Algorithmic primitives for retrieval-augmented generation
4//! pipelines — `Document` with provenance + lineage, plus the
5//! `DocumentLoader` / `TextSplitter` / `Chunker` trait surface
6//! every RAG path composes around.
7//!
8//! ## Position
9//!
10//! 2026-era agentic RAG (Contextual Retrieval, Self-RAG, CRAG,
11//! Adaptive-RAG) is no longer a side pipeline — it's the agent's
12//! baseline working memory. This crate ships the *algorithmic*
13//! primitives (splitters, chunkers, ingestion composition) that
14//! every consumer reaches for. Concrete source connectors (S3,
15//! Notion, Confluence, GDrive, …) live in companion crates so the
16//! core surface stays small and dependency-light.
17//!
18//! ## Surface
19//!
20//! - [`Document`] — RAG-shaped document with [`Source`] (where it
21//!   came from), [`Lineage`] (split / chunk ancestry), and
22//!   [`entelix_memory::Namespace`] (multi-tenant boundary). The
23//!   retrieval-side [`entelix_memory::Document`] (with similarity
24//!   score) is a *result* shape; this is the *ingestion* shape.
25//! - [`DocumentLoader`] — async source-side trait. Streams to keep
26//!   ingestion memory-bounded over arbitrarily large corpora.
27//! - [`TextSplitter`] — sync algorithmic primitive. Slices a
28//!   `Document` into smaller `Document`s preserving `Lineage`.
29//! - [`Chunker`] — async transform over a chunk sequence. LLM-call
30//!   capable (Anthropic Contextual Retrieval, HyDE, query
31//!   decomposition).
32//!
33//! ## What lives in companion crates
34//!
35//! - **Source connectors** — `entelix-rag-s3`, `entelix-rag-notion`,
36//!   `entelix-rag-confluence`, `entelix-rag-fs` (filesystem-backed,
37//!   invariant 9 exemption).
38//! - **Vendor-accurate tokenizers** — `entelix-tokenizer-tiktoken`,
39//!   `entelix-tokenizer-hf`, locale-aware companions
40//!   (Korean / Japanese morphology). The
41//!   [`entelix_core::TokenCounter`] trait is the integration
42//!   surface; this crate's [`TokenCountSplitter`] is generic over
43//!   any `C: TokenCounter + ?Sized + 'static` (default
44//!   `dyn TokenCounter`) so concrete `Arc<TiktokenCounter>` and
45//!   type-erased `Arc<dyn TokenCounter>` plug in interchangeably.
46//!   Vendor accuracy is a counter swap, not a splitter rewrite.
47//!
48//! ## Why algorithmic primitives only
49//!
50//! The LangChain ecosystem's mistake was bundling 100+ source
51//! connectors into the core surface — version churn became
52//! unmanageable. entelix-rag's coreis explicitly small (4 traits +
53//! `Document` + provenance types) so vendor-specific loaders ship
54//! independently and never gate the core's release cadence. The
55//! algorithmic primitives (splitters, chunkers, ingestion
56//! composition) ARE universal so they live here.
57
58#![cfg_attr(docsrs, feature(doc_cfg))]
59#![doc(html_root_url = "https://docs.rs/entelix-rag/0.5.3")]
60#![deny(missing_docs)]
61#![allow(
62    clippy::doc_markdown,
63    clippy::missing_errors_doc,
64    clippy::missing_panics_doc,
65    clippy::module_name_repetitions,
66    clippy::too_long_first_doc_paragraph,
67    // Tests use unwrap/expect liberally; the splitter modules call
68    // `Regex::new(...).expect(...)` on a compile-time-constant
69    // pattern (the round-trip test pins regex correctness).
70    clippy::expect_used,
71    clippy::indexing_slicing,
72    clippy::unwrap_used
73)]
74
75mod chunker;
76mod corrective;
77mod document;
78mod loader;
79mod pipeline;
80mod splitter;
81
82pub use chunker::{
83    CONTEXTUAL_CHUNKER_DEFAULT_INSTRUCTION, Chunker, ContextualChunker, ContextualChunkerBuilder,
84    FailurePolicy,
85};
86pub use corrective::{
87    CORRECTIVE_RAG_AGENT_NAME, CorrectiveRagState, CragConfig, DEFAULT_GENERATOR_SYSTEM_PROMPT,
88    DEFAULT_GRADER_INSTRUCTION, DEFAULT_MAX_REWRITE_ATTEMPTS, DEFAULT_MIN_CORRECT_FRACTION,
89    DEFAULT_RETRIEVAL_TOP_K, DEFAULT_REWRITER_INSTRUCTION, GradeVerdict, LlmQueryRewriter,
90    LlmQueryRewriterBuilder, LlmRetrievalGrader, LlmRetrievalGraderBuilder, QueryRewriter,
91    RetrievalGrader, build_corrective_rag_graph, create_corrective_rag_agent,
92};
93pub use document::{Document, DocumentId, Lineage, Source};
94pub use loader::{DocumentLoader, DocumentStream};
95pub use pipeline::{
96    IngestError, IngestReport, IngestionPipeline, IngestionPipelineBuilder, PROVENANCE_METADATA_KEY,
97};
98pub use splitter::{
99    DEFAULT_CHUNK_OVERLAP_CHARS, DEFAULT_CHUNK_OVERLAP_TOKENS, DEFAULT_CHUNK_SIZE_CHARS,
100    DEFAULT_CHUNK_SIZE_TOKENS, DEFAULT_MARKDOWN_HEADING_LEVELS, DEFAULT_RECURSIVE_SEPARATORS,
101    MarkdownStructureSplitter, RecursiveCharacterSplitter, TextSplitter, TokenCountSplitter,
102};