Expand description
The fastest semantic text chunking library — up to 1TB/s chunking throughput.
This crate provides three main functionalities:
-
Size-based chunking (
chunkmodule): Split text into chunks of a target size, preferring to break at delimiter boundaries. -
Delimiter splitting (
splitmodule): Split text at every delimiter occurrence, equivalent to Cython’ssplit_textfunction. -
Token-aware merging ([
merge] module): Merge segments based on token counts, equivalent to Cython’s_merge_splitsfunction.
§Examples
§Size-based chunking
use chunk::chunk;
let text = b"Hello world. How are you? I'm fine.\nThanks for asking.";
// With defaults (4KB chunks, split at \n . ?)
let chunks: Vec<&[u8]> = chunk(text).collect();
// With custom size and delimiters
let chunks: Vec<&[u8]> = chunk(text).size(1024).delimiters(b"\n.?!").collect();
// With multi-byte pattern (e.g., metaspace for SentencePiece tokenizers)
let metaspace = "▁".as_bytes(); // [0xE2, 0x96, 0x81]
let chunks: Vec<&[u8]> = chunk(b"Hello\xE2\x96\x81World").pattern(metaspace).collect();§Delimiter splitting
use chunk::{split, split_at_delimiters, IncludeDelim};
let text = b"Hello. World. Test.";
// Using the builder API
let slices = split(text).delimiters(b".").include_prev().collect_slices();
assert_eq!(slices, vec![b"Hello.".as_slice(), b" World.".as_slice(), b" Test.".as_slice()]);
// Using the function directly
let offsets = split_at_delimiters(text, b".", IncludeDelim::Prev, 0);
assert_eq!(&text[offsets[0].0..offsets[0].1], b"Hello.");§Token-aware merging
use chunk::merge_splits;
// Merge text segments based on token counts
let splits = vec!["a", "b", "c", "d", "e", "f", "g"];
let token_counts = vec![1, 1, 1, 1, 1, 1, 1];
let result = merge_splits(&splits, &token_counts, 3);
assert_eq!(result.merged, vec!["abc", "def", "g"]);
assert_eq!(result.token_counts, vec![3, 3, 1]);Structs§
- Chunker
- Chunker splits text at delimiter boundaries.
- Filtered
Indices - Result of filtering split indices.
- Merge
Result - Result of merge_splits operation.
- Minima
Result - Result of finding local minima.
- Owned
Chunker - Owned chunker for FFI bindings (Python, WASM).
- Pattern
Splitter - A compiled multi-pattern splitter for efficient repeated splitting.
- Splitter
- Splitter splits text at every delimiter occurrence.
Enums§
- Include
Delim - Where to include the delimiter in splits.
Constants§
- DEFAULT_
DELIMITERS - Default delimiters: newline, period, question mark.
- DEFAULT_
TARGET_ SIZE - Default chunk target size (4KB).
Functions§
- chunk
- Chunk text at delimiter boundaries.
- filter_
split_ indices - Filter split indices by percentile threshold and minimum distance.
- find_
local_ minima_ interpolated - Find local minima using first and second derivatives from Savitzky-Golay filter.
- find_
merge_ indices - Find merge indices for combining segments within token limits.
- merge_
splits - Merge text segments based on token counts, respecting chunk size limits.
- savgol_
filter - Apply Savitzky-Golay filter to data.
- split
- Builder for delimiter-based splitting with more options.
- split_
at_ delimiters - Split text at every delimiter occurrence, returning offsets.
- split_
at_ patterns - Split text at every occurrence of any multi-byte pattern.
- windowed_
cross_ similarity - Compute windowed cross-similarity for semantic chunking.