Crate chunk

Crate chunk 

Source
Expand description

The fastest semantic text chunking library — up to 1TB/s chunking throughput.

This crate provides three main functionalities:

  1. Size-based chunking (chunk module): Split text into chunks of a target size, preferring to break at delimiter boundaries.

  2. Delimiter splitting (split module): Split text at every delimiter occurrence, equivalent to Cython’s split_text function.

  3. Token-aware merging ([merge] module): Merge segments based on token counts, equivalent to Cython’s _merge_splits function.

§Examples

§Size-based chunking

use chunk::chunk;

let text = b"Hello world. How are you? I'm fine.\nThanks for asking.";

// With defaults (4KB chunks, split at \n . ?)
let chunks: Vec<&[u8]> = chunk(text).collect();

// With custom size and delimiters
let chunks: Vec<&[u8]> = chunk(text).size(1024).delimiters(b"\n.?!").collect();

// With multi-byte pattern (e.g., metaspace for SentencePiece tokenizers)
let metaspace = "▁".as_bytes(); // [0xE2, 0x96, 0x81]
let chunks: Vec<&[u8]> = chunk(b"Hello\xE2\x96\x81World").pattern(metaspace).collect();

§Delimiter splitting

use chunk::{split, split_at_delimiters, IncludeDelim};

let text = b"Hello. World. Test.";

// Using the builder API
let slices = split(text).delimiters(b".").include_prev().collect_slices();
assert_eq!(slices, vec![b"Hello.".as_slice(), b" World.".as_slice(), b" Test.".as_slice()]);

// Using the function directly
let offsets = split_at_delimiters(text, b".", IncludeDelim::Prev, 0);
assert_eq!(&text[offsets[0].0..offsets[0].1], b"Hello.");

§Token-aware merging

use chunk::merge_splits;

// Merge segments based on token counts
let token_counts = vec![1, 1, 1, 1, 1, 1, 1];
let result = merge_splits(&token_counts, 3, false);
assert_eq!(result.indices, vec![3, 6, 7]); // Merge indices
assert_eq!(result.token_counts, vec![3, 3, 1]); // Merged token counts

Structs§

Chunker
Chunker splits text at delimiter boundaries.
MergeResult
Result of merge_splits operation.
OwnedChunker
Owned chunker for FFI bindings (Python, WASM).
Splitter
Splitter splits text at every delimiter occurrence.

Enums§

IncludeDelim
Where to include the delimiter in splits.

Constants§

DEFAULT_DELIMITERS
Default delimiters: newline, period, question mark.
DEFAULT_TARGET_SIZE
Default chunk target size (4KB).

Functions§

chunk
Chunk text at delimiter boundaries.
compute_merged_token_counts
Compute merged token counts from merge indices.
find_merge_indices
Find merge indices for combining segments within token limits.
merge_splits
Merge segments based on token counts, respecting chunk size limits.
split
Builder for delimiter-based splitting with more options.
split_at_delimiters
Split text at every delimiter occurrence, returning offsets.