Julienne
Julienne is a Rust library for cutting text into range-preserving chunks for retrieval, embedding, indexing, search, and context-building pipelines.
It is deliberately a chunking library, not an ingestion framework. Bring strings that you already extracted from documents, Markdown, HTML/XML, SQL, prose, or source code; Julienne returns chunks with explicit boundaries and provenance.
Documentation
- Getting started
- Chunker guide
- Structured chunks and offsets
- Sizing and token windows
- Semantic chunking
- Structure-aware chunking
- API contracts
- Release and quality gates
Features
- Character-based splitters:
CharacterTextSplitterRecursiveCharacterTextSplitter
- Sentence-aware splitters:
SentenceChunker
- Semchunk-inspired recursive splitter:
SemchunkSplitter
- Embedding-boundary splitter:
SemanticChunker
- Structure-aware splitters:
MarkdownChunkerHtmlChunkerCodeChunkerbehind thecodefeature
- Pluggable length function (
LengthFn) for character, word, or tokenizer-based sizing. - Pluggable embedding function (
EmbeddingFn) for semantic chunking. - Zero-copy structured chunks with byte and character offsets.
Splitters
Choosing A Chunker
Use SemchunkSplitter as the general-purpose default for natural language and
mixed prose because it applies a punctuation-aware delimiter hierarchy before
falling back to smaller units. Use RecursiveCharacterTextSplitter when you
want LangChain-style separator behavior. Use SentenceChunker when sentence
boundaries are mandatory and embeddings are not available. Use
SemanticChunker when a domain-relevant embedder can identify topic shifts; it
is only as good as the embedding signal and falls back to sentence-sized packing
when no embedder is configured. Use try_split_chunks with fallible embedders
when provider failures should be returned as ChunkError instead of panicking
through the infallible convenience API. Use MarkdownChunker or HtmlChunker when the
input format carries useful block structure. Use CodeChunker when source code
should be split by parser-recognized Rust or Python AST nodes.
CharacterTextSplitter
LangChain-style single-separator splitting + merge with overlap.
Best for simple and predictable chunk boundaries.
RecursiveCharacterTextSplitter
LangChain-style recursive fallback over separators (\n\n -> \n -> -> char fallback).
Best general-purpose default when text structure varies.
SentenceChunker
Sentence-aware chunking with sentence-preserving overlap backtracking.
Supports:
min_characters_per_sentencemin_sentences_per_chunk- custom sentence delimiters
SemchunkSplitter
Semchunk-inspired recursive splitter with punctuation-aware hierarchy.
Supports:
- adaptive merge (binary-search span fitting)
- optional memoization (
memoize) - stricter delimiter precedence mode (
strict_mode) - configurable
length_fn(including tokenizers)
SemanticChunker
Embedding-similarity boundary detection:
- sentence windows
- cosine similarity between adjacent windows
- Savitzky-Golay smoothing
- local minima boundary detection
- optional skip-window reconnection for tangential asides
Falls back to sentence-based greedy splitting if no embedder is configured.
Quick Start
use ;
let text = "Hello world. This is a document with multiple sentences.";
let recursive = new;
let recursive_chunks = recursive.split_text;
let sentence = new;
let sentence_chunks = sentence.split_text;
let semchunk = new;
let semchunk_chunks = semchunk.split_text;
let semantic = new;
let semantic_chunks = semantic.split_text;
For a fuller walkthrough, see Getting started.
Structured Chunks And Offsets
Use split_chunks when downstream code needs provenance:
use ;
let text = "Intro.\n\nDetails with café.";
let splitter = new;
let chunks: = splitter.split_chunks;
for chunk in chunks
TextChunk contains text, start_byte, end_byte, start_char, end_char,
measured_length, and optional metadata. The byte range always indexes the
original input passed to the splitter. The character offsets are counted from
the start of that same input.
split_text is the owned-string convenience API. split_chunks collects
structured chunks. chunks is the iterator-style API for infallible splitters
that can expose structured output without changing the caller-facing contract.
Token-Aware Length Example
use SemchunkSplitter;
let splitter = builder
.chunk_size
.chunk_overlap
.length_fn
.build
.unwrap;
let chunks = splitter.split_text;
Typed Sizing And Token Windows
use ;
let word_config = new;
assert_eq!;
;
let token_chunks = new
.unwrap
.try_split_text
.unwrap;
assert_eq!;
Optional tokenizer integrations are feature gated:
tiktoken-rsenablesjulienne::token::tiktoken::TiktokenBoundaryProvider.tokenizersenablesjulienne::token::huggingface::HuggingFaceBoundaryProvider.unicode-segmentationenablesGraphemeSizerandUnicodeWordSizer.
Structure-Aware Chunking
use ;
let markdown = new
.unwrap
.split_text;
let html = new
.unwrap
.split_text;
MarkdownChunker preserves headings, paragraphs, lists, and fenced code blocks.
HtmlChunker works on already-extracted HTML/XML strings and uses block-level
tag boundaries; it does not fetch pages, sanitize markup, remove boilerplate, or
perform readability extraction.
With --features code, CodeChunker uses tree-sitter parsers for Rust and
Python and returns explicit ChunkError values for parser failures or oversized
semantic nodes.
SemanticChunker Example
use SemanticChunker;
let chunker = builder
.chunk_size
.chunk_overlap
.window_size
.skip_window
.reconnect_similarity_threshold
.max_aside_length
.embedding_fn
.build
.unwrap;
let chunks = chunker.split_text;
Testing
Integration tests use fixtures under:
examples/summarize_article.sqlexamples/embeddings_from_documents/documents/pgai.md
Benchmarks
Code chunker benchmarks are feature-gated:
The benchmark compares all splitter strategies on markdown and SQL fixtures,
including word-count and tiktoken length profiles for semchunk.
API Contracts
Structured chunk APIs return TextChunk with stable source offsets. Public
splitter types and builders are Send + Sync + Clone when their configured
handles are. Infallible splitters expose split_text, split_chunks, and
iterator-style chunks where available. Fallible integrations use try_*
methods and return ChunkError; this includes batch embedding and tree-sitter
code parsing.
Scope And Non-Goals
This crate is a chunking library. Compared with Chonkie, the focus is Rust-native range-preserving chunks, explicit fallibility, and configurable splitter strategies rather than a Python-first experimentation surface. Compared with Chunkr, this crate does not try to be document intelligence infrastructure: no OCR, hosted API, document loading service, layout extraction, vector database handshake, or ingestion pipeline is included. Bring already-extracted text, Markdown, HTML/XML, or source code strings; the crate returns chunks.