Expand description
Julienne is a Rust library for cutting text into range-preserving chunks.
It provides simple separator splitters, recursive and sentence-aware splitters, semantic chunking, token-window chunking, and structure-aware chunkers for Markdown, HTML/XML, and optional tree-sitter-backed code input.
Structured chunk APIs return TextChunk values whose text field is a
zero-copy slice of the original input. The offset invariant for every
structured chunk is:
&input[chunk.start_byte..chunk.end_byte] == chunk.textIterator APIs named chunks stream structured chunks where the algorithm can
operate incrementally. split_chunks collects those chunks, and split_text
projects them into owned strings for convenience.
§Quick start
use julienne::SemchunkSplitter;
let splitter = SemchunkSplitter::new(200, 40);
let chunks = splitter.split_text("Julienne keeps chunking small, explicit, and provenance-safe.");
assert!(!chunks.is_empty());Re-exports§
pub use character::CharacterTextSplitter;pub use chunk::ChunkMetadata;pub use chunk::TextChunk;pub use chunk::TextChunkIter;pub use error::ChunkError;pub use recursive::RecursiveCharacterTextSplitter;pub use semantic::SemanticChunker;pub use semchunk::SemchunkSplitter;pub use sentence::SentenceChunker;pub use sizing::ByteSizer;pub use sizing::CharSizer;pub use sizing::ChunkConfig;pub use sizing::ChunkSizer;pub use sizing::FunctionSizer;pub use sizing::WordSizer;pub use split::KeepSeparator;pub use structure::HtmlChunker;pub use structure::MarkdownChunker;pub use structure::XmlChunker;pub use token::TokenBoundaryProvider;pub use token::TokenChunker;pub use token::TokenSpan;
Modules§
Traits§
Functions§
- char_
len - Default length function: counts Unicode characters.
Type Aliases§
- Embedder
Handle - Embedding
Fn - Length
Fn - A custom length function for text splitting (e.g. token counting).