entelix_rag/lib.rs
1//! # entelix-rag
2//!
3//! Algorithmic primitives for retrieval-augmented generation
4//! pipelines — `Document` with provenance + lineage, plus the
5//! `DocumentLoader` / `TextSplitter` / `Chunker` trait surface
6//! every RAG path composes around.
7//!
8//! ## Position
9//!
10//! 2026-era agentic RAG (Contextual Retrieval, Self-RAG, CRAG,
11//! Adaptive-RAG) is no longer a side pipeline — it's the agent's
12//! baseline working memory. This crate ships the *algorithmic*
13//! primitives (splitters, chunkers, ingestion composition) that
14//! every consumer reaches for. Concrete source connectors (S3,
15//! Notion, Confluence, GDrive, …) live in companion crates so the
16//! core surface stays small and dependency-light.
17//!
18//! ## Surface
19//!
20//! - [`Document`] — RAG-shaped document with [`Source`] (where it
21//! came from), [`Lineage`] (split / chunk ancestry), and
22//! [`entelix_memory::Namespace`] (multi-tenant boundary). The
23//! retrieval-side [`entelix_memory::Document`] (with similarity
24//! score) is a *result* shape; this is the *ingestion* shape.
25//! - [`DocumentLoader`] — async source-side trait. Streams to keep
26//! ingestion memory-bounded over arbitrarily large corpora.
27//! - [`TextSplitter`] — sync algorithmic primitive. Slices a
28//! `Document` into smaller `Document`s preserving `Lineage`.
29//! - [`Chunker`] — async transform over a chunk sequence. LLM-call
30//! capable (Anthropic Contextual Retrieval, HyDE, query
31//! decomposition).
32//!
33//! ## What lives in companion crates
34//!
35//! - **Source connectors** — `entelix-rag-s3`, `entelix-rag-notion`,
36//! `entelix-rag-confluence`, `entelix-rag-fs` (filesystem-backed,
37//! invariant 9 exemption).
38//! - **Vendor-accurate tokenizers** — `entelix-tokenizer-tiktoken`,
39//! `entelix-tokenizer-hf`, locale-aware companions
40//! (Korean / Japanese morphology). The
41//! [`entelix_core::TokenCounter`] trait is the integration
42//! surface; this crate's [`TokenCountSplitter`] is generic over
43//! any `C: TokenCounter + ?Sized + 'static` (default
44//! `dyn TokenCounter`) so concrete `Arc<TiktokenCounter>` and
45//! type-erased `Arc<dyn TokenCounter>` plug in interchangeably.
46//! Vendor accuracy is a counter swap, not a splitter rewrite.
47//!
48//! ## Why algorithmic primitives only
49//!
50//! The LangChain ecosystem's mistake was bundling 100+ source
51//! connectors into the core surface — version churn became
52//! unmanageable. entelix-rag's coreis explicitly small (4 traits +
53//! `Document` + provenance types) so vendor-specific loaders ship
54//! independently and never gate the core's release cadence. The
55//! algorithmic primitives (splitters, chunkers, ingestion
56//! composition) ARE universal so they live here.
57
58#![cfg_attr(docsrs, feature(doc_cfg))]
59#![doc(html_root_url = "https://docs.rs/entelix-rag/0.5.3")]
60#![deny(missing_docs)]
61#![allow(
62 clippy::doc_markdown,
63 clippy::missing_errors_doc,
64 clippy::missing_panics_doc,
65 clippy::module_name_repetitions,
66 clippy::too_long_first_doc_paragraph,
67 // Tests use unwrap/expect liberally; the splitter modules call
68 // `Regex::new(...).expect(...)` on a compile-time-constant
69 // pattern (the round-trip test pins regex correctness).
70 clippy::expect_used,
71 clippy::indexing_slicing,
72 clippy::unwrap_used
73)]
74
75mod chunker;
76mod corrective;
77mod document;
78mod loader;
79mod pipeline;
80mod splitter;
81
82pub use chunker::{
83 CONTEXTUAL_CHUNKER_DEFAULT_INSTRUCTION, Chunker, ContextualChunker, ContextualChunkerBuilder,
84 FailurePolicy,
85};
86pub use corrective::{
87 CORRECTIVE_RAG_AGENT_NAME, CorrectiveRagState, CragConfig, DEFAULT_GENERATOR_SYSTEM_PROMPT,
88 DEFAULT_GRADER_INSTRUCTION, DEFAULT_MAX_REWRITE_ATTEMPTS, DEFAULT_MIN_CORRECT_FRACTION,
89 DEFAULT_RETRIEVAL_TOP_K, DEFAULT_REWRITER_INSTRUCTION, GradeVerdict, LlmQueryRewriter,
90 LlmQueryRewriterBuilder, LlmRetrievalGrader, LlmRetrievalGraderBuilder, QueryRewriter,
91 RetrievalGrader, build_corrective_rag_graph, create_corrective_rag_agent,
92};
93pub use document::{Document, DocumentId, Lineage, Source};
94pub use loader::{DocumentLoader, DocumentStream};
95pub use pipeline::{
96 IngestError, IngestReport, IngestionPipeline, IngestionPipelineBuilder, PROVENANCE_METADATA_KEY,
97};
98pub use splitter::{
99 DEFAULT_CHUNK_OVERLAP_CHARS, DEFAULT_CHUNK_OVERLAP_TOKENS, DEFAULT_CHUNK_SIZE_CHARS,
100 DEFAULT_CHUNK_SIZE_TOKENS, DEFAULT_MARKDOWN_HEADING_LEVELS, DEFAULT_RECURSIVE_SEPARATORS,
101 MarkdownStructureSplitter, RecursiveCharacterSplitter, TextSplitter, TokenCountSplitter,
102};