Expand description
OCR denoising and markdown-table cleanup for text chunks. Pluggable text preprocessor infrastructure.
Preprocessors run as a sequential pipeline inside
crate::chunking::SlidingWindowChunker before tokenization. Each
preprocessor receives the text of a section and returns either
Some(transformed) or None. A None return from any stage
short-circuits the remainder and causes the entire section to be dropped —
no chunks are produced from it.
§Registration
Preprocessors are registered on a crate::config::ChunkingStrategy via
crate::config::ChunkingStrategy::register_preprocessor:
use triplets_core::{ChunkingStrategy, DenoiserConfig, DenoiserPreprocessor};
let mut strategy = ChunkingStrategy::default();
strategy.register_preprocessor(DenoiserPreprocessor::new(DenoiserConfig {
enabled: true,
max_digit_ratio: 0.35,
strip_markdown: true,
}));Multiple preprocessors run in registration order; the output of one feeds the next.
Modules§
- backends
- Built-in preprocessor implementations.
Built-in
super::TextPreprocessorimplementations.
Traits§
- Text
Preprocessor - Trait for pluggable text preprocessors.