Skip to main content

Module preprocessor

Module preprocessor 

Source
Expand description

OCR denoising and markdown-table cleanup for text chunks. Pluggable text preprocessor infrastructure.

Preprocessors run as a sequential pipeline inside crate::chunking::SlidingWindowChunker before tokenization. Each preprocessor receives the text of a section and returns either Some(transformed) or None. A None return from any stage short-circuits the remainder and causes the entire section to be dropped — no chunks are produced from it.

§Registration

Preprocessors are registered on a crate::config::ChunkingStrategy via crate::config::ChunkingStrategy::register_preprocessor:

use triplets_core::{ChunkingStrategy, DenoiserConfig, DenoiserPreprocessor};

let mut strategy = ChunkingStrategy::default();
strategy.register_preprocessor(DenoiserPreprocessor::new(DenoiserConfig {
    enabled: true,
    max_digit_ratio: 0.35,
    strip_markdown: true,
}));

Multiple preprocessors run in registration order; the output of one feeds the next.

Modules§

backends
Built-in preprocessor implementations. Built-in super::TextPreprocessor implementations.

Traits§

TextPreprocessor
Trait for pluggable text preprocessors.