Expand description
The fastest semantic text chunking library — up to 1TB/s chunking throughput.
This crate provides three main functionalities:
-
Size-based chunking (
chunkmodule): Split text into chunks of a target size, preferring to break at delimiter boundaries. -
Delimiter splitting (
splitmodule): Split text at every delimiter occurrence, equivalent to Cython’ssplit_textfunction. -
Token-aware merging ([
merge] module): Merge segments based on token counts, equivalent to Cython’s_merge_splitsfunction.
§Examples
§Size-based chunking
use chunk::chunk;
let text = b"Hello world. How are you? I'm fine.\nThanks for asking.";
// With defaults (4KB chunks, split at \n . ?)
let chunks: Vec<&[u8]> = chunk(text).collect();
// With custom size and delimiters
let chunks: Vec<&[u8]> = chunk(text).size(1024).delimiters(b"\n.?!").collect();
// With multi-byte pattern (e.g., metaspace for SentencePiece tokenizers)
let metaspace = "▁".as_bytes(); // [0xE2, 0x96, 0x81]
let chunks: Vec<&[u8]> = chunk(b"Hello\xE2\x96\x81World").pattern(metaspace).collect();§Delimiter splitting
use chunk::{split, split_at_delimiters, IncludeDelim};
let text = b"Hello. World. Test.";
// Using the builder API
let slices = split(text).delimiters(b".").include_prev().collect_slices();
assert_eq!(slices, vec![b"Hello.".as_slice(), b" World.".as_slice(), b" Test.".as_slice()]);
// Using the function directly
let offsets = split_at_delimiters(text, b".", IncludeDelim::Prev, 0);
assert_eq!(&text[offsets[0].0..offsets[0].1], b"Hello.");§Token-aware merging
use chunk::merge_splits;
// Merge segments based on token counts
let token_counts = vec![1, 1, 1, 1, 1, 1, 1];
let result = merge_splits(&token_counts, 3, false);
assert_eq!(result.indices, vec![3, 6, 7]); // Merge indices
assert_eq!(result.token_counts, vec![3, 3, 1]); // Merged token countsStructs§
- Chunker
- Chunker splits text at delimiter boundaries.
- Merge
Result - Result of merge_splits operation.
- Owned
Chunker - Owned chunker for FFI bindings (Python, WASM).
- Splitter
- Splitter splits text at every delimiter occurrence.
Enums§
- Include
Delim - Where to include the delimiter in splits.
Constants§
- DEFAULT_
DELIMITERS - Default delimiters: newline, period, question mark.
- DEFAULT_
TARGET_ SIZE - Default chunk target size (4KB).
Functions§
- chunk
- Chunk text at delimiter boundaries.
- compute_
merged_ token_ counts - Compute merged token counts from merge indices.
- find_
merge_ indices - Find merge indices for combining segments within token limits.
- merge_
splits - Merge segments based on token counts, respecting chunk size limits.
- split
- Builder for delimiter-based splitting with more options.
- split_
at_ delimiters - Split text at every delimiter occurrence, returning offsets.