1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
//! Pluggable text preprocessor infrastructure.
//!
//! Preprocessors run as a sequential pipeline inside
//! [`crate::chunking::SlidingWindowChunker`] before tokenization. Each
//! preprocessor receives the text of a section and returns either
//! `Some(transformed)` or `None`. A `None` return from any stage
//! short-circuits the remainder and causes the entire section to be dropped —
//! no chunks are produced from it.
//!
//! ## Registration
//!
//! Preprocessors are registered on a [`crate::config::ChunkingStrategy`] via
//! [`crate::config::ChunkingStrategy::register_preprocessor`]:
//!
//! ```rust
//! use triplets_core::{ChunkingStrategy, DenoiserConfig, DenoiserPreprocessor};
//!
//! let mut strategy = ChunkingStrategy::default();
//! strategy.register_preprocessor(DenoiserPreprocessor::new(DenoiserConfig {
//! enabled: true,
//! max_digit_ratio: 0.35,
//! strip_markdown: true,
//! }));
//! ```
//!
//! Multiple preprocessors run in registration order; the output of one feeds
//! the next.
/// Built-in preprocessor implementations.
/// Trait for pluggable text preprocessors.
///
/// Implement this trait to transform or filter section text before it is
/// tokenized and chunked. The pipeline is sequential: the output of each
/// stage feeds the next.
///
/// # Implementing
///
/// ```rust
/// use triplets_core::TextPreprocessor;
///
/// struct UppercasePreprocessor;
///
/// impl TextPreprocessor for UppercasePreprocessor {
/// fn process(&self, text: &str) -> Option<String> {
/// Some(text.to_uppercase())
/// }
/// }
/// ```