1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
//! Text chunking utilities.
//!
//! This module provides text chunking functionality using the `text-splitter` library.
//! It splits long text into smaller chunks while preserving semantic boundaries.
//!
//! # Features
//!
//! - **Smart splitting**: Respects word and sentence boundaries
//! - **Markdown-aware**: Preserves Markdown structure (headings, code blocks, lists)
//! - **Configurable overlap**: Overlap chunks to maintain context
//! - **Unicode support**: Handles CJK characters and emojis correctly
//! - **Batch processing**: Process multiple texts efficiently
//!
//! # Chunker Types
//!
//! - **Text**: Generic text splitter, splits on whitespace and punctuation
//! - **Markdown**: Markdown-aware splitter, preserves formatting and structure
//! - **Yaml**: YAML-aware splitter, creates one chunk per top-level key
//!
//! # Example
//!
//! ```rust
//! use kreuzberg::chunking::{chunk_text, ChunkingConfig, ChunkerType};
//!
//! # fn example() -> kreuzberg::Result<()> {
//! let config = ChunkingConfig {
//! max_characters: 500,
//! overlap: 50,
//! trim: true,
//! chunker_type: ChunkerType::Text,
//! ..Default::default()
//! };
//!
//! let long_text = "This is a very long document...".repeat(100);
//! let result = chunk_text(&long_text, &config, None)?;
//!
//! println!("Split into {} chunks", result.chunk_count);
//! for (i, chunk) in result.chunks.iter().enumerate() {
//! println!("Chunk {}: {} chars", i + 1, chunk.content.len());
//! }
//! # Ok(())
//! # }
//! ```
//!
//! # Use Cases
//!
//! - Splitting documents for LLM context windows
//! - Creating overlapping chunks for semantic search
//! - Processing large documents in batches
//! - Maintaining context across chunk boundaries
use OnceCell;
use Arc;
// Module declarations
// Re-export submodule types and functions
pub use ;
pub use classify_chunk;
pub use ; // ChunkingConfig re-exported from core::config::processing
pub use ;
pub use ChunkingProcessor;
pub use ;
use crateResult;
/// One-time initialization guard for the chunking processor registry.
///
/// Set to `()` once registration succeeds. If registration fails the cell remains
/// empty, allowing the next call to retry.
static PROCESSOR_INITIALIZED: = new;
/// Ensure the chunking processor is registered.
///
/// This function is called automatically when needed.
/// It's safe to call multiple times - registration only happens once.
/// Register the chunking processor with the global registry.
///
/// This function should be called once at application startup to register
/// the chunking post-processor.
///
/// **Note:** This is called automatically on first use.
/// Explicit calling is optional.