1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
//! # code-chunker
//!
//! AST-aware code chunking and late chunking for RAG pipelines.
//!
//! ## Two primitives
//!
//! ### `CodeChunker` — split source code at AST boundaries
//!
//! Tree-sitter walks the parse tree and produces chunks aligned to
//! function, class, impl, and module boundaries. When a node fits the
//! configured size budget it is kept intact; oversize nodes are split
//! recursively at structural separators. Supports Rust, Python,
//! TypeScript/JavaScript, and Go (behind the `code` feature).
//!
//! ### `LateChunkingPooler` — pool token embeddings into chunk embeddings
//!
//! Late chunking (Günther et al. 2024, arXiv:2409.04701) embeds the full
//! document first so every token attends to the rest of the document,
//! then mean-pools token embeddings inside each chunk's byte span. The
//! result is a per-chunk embedding that carries document-wide context —
//! pronouns, anaphora, and acronym definitions are no longer lost at
//! chunk boundaries.
//!
//! `LateChunkingPooler` is span-only: bring your own boundaries from any
//! source — `CodeChunker`, `text-splitter`, regex, or hand-built `Slab`s.
//!
//! ## What this crate does not do
//!
//! - **General-purpose text chunking.** Use [`text-splitter`](https://crates.io/crates/text-splitter)
//! for fixed/sentence/recursive prose splitting; it's the de-facto Rust
//! standard with broader Unicode and tokenizer support.
//! - **Format conversion (PDF, HTML, DOCX).** Input is `&str`. Use
//! [`deformat`](https://crates.io/crates/deformat) or
//! [`pdf-extract`](https://crates.io/crates/pdf-extract) upstream.
//! - **Embedding generation.** `LateChunkingPooler` consumes
//! pre-computed token embeddings; bring your own long-context model
//! (Jina v2/v3, nomic-embed-text, candle, ort).
//! - **Vector store integration.** [`Slab`] is the boundary; enable the
//! `serde` feature and wire to qdrant-client, lancedb, sqlx, etc. yourself.
//!
//! ## Quick start (code chunking)
//!
//! ```ignore
//! use code_chunker::{Chunker, CodeChunker, CodeLanguage};
//!
//! let chunker = CodeChunker::new(CodeLanguage::Rust, 1500, 0);
//! let slabs = chunker.chunk(source_code);
//! ```
//!
//! ## Quick start (late chunking)
//!
//! ```ignore
//! use code_chunker::{LateChunkingPooler, Slab};
//!
//! // Bring your own chunk boundaries (text-splitter, CodeChunker, ...).
//! let chunks: Vec<Slab> = my_chunker(&document);
//!
//! // Embed the full document with a long-context model.
//! let token_embeddings: Vec<Vec<f32>> = my_model.embed_tokens(&document);
//!
//! // Pool token embeddings into per-chunk embeddings.
//! let pooler = LateChunkingPooler::new(384);
//! let chunk_embeddings = pooler.pool(&token_embeddings, &chunks, document.len());
//! ```
pub use ;
pub use LateChunkingPooler;
pub use ;
pub use ;
pub use ;
/// A chunking strategy: text in, [`Slab`]s out.
///
/// Implementors override [`chunk_bytes`](Chunker::chunk_bytes); the default
/// [`chunk`](Chunker::chunk) method adds Unicode character offsets.
///
/// This crate only ships one public chunker — [`CodeChunker`] — but the
/// trait is public so users can wrap external chunkers (text-splitter,
/// regex, custom logic) and feed the output into [`LateChunkingPooler`].