# Julienne
Julienne is a Rust library for cutting text into range-preserving chunks for
retrieval, embedding, indexing, search, and context-building pipelines.
It is deliberately a chunking library, not an ingestion framework. Bring strings
that you already extracted from documents, Markdown, HTML/XML, SQL, prose, or
source code; Julienne returns chunks with explicit boundaries and provenance.
## Documentation
- [Getting started](docs/getting-started.md)
- [Chunker guide](docs/chunkers.md)
- [Structured chunks and offsets](docs/structured-chunks.md)
- [Sizing and token windows](docs/sizing-and-tokens.md)
- [Semantic chunking](docs/semantic-chunking.md)
- [Structure-aware chunking](docs/structure-aware-chunking.md)
- [API contracts](docs/api-contracts.md)
- [Release and quality gates](docs/release.md)
## Features
- Character-based splitters:
- `CharacterTextSplitter`
- `RecursiveCharacterTextSplitter`
- Sentence-aware splitters:
- `SentenceChunker`
- Semchunk-inspired recursive splitter:
- `SemchunkSplitter`
- Embedding-boundary splitter:
- `SemanticChunker`
- Structure-aware splitters:
- `MarkdownChunker`
- `HtmlChunker`
- `CodeChunker` behind the `code` feature
- Pluggable length function (`LengthFn`) for character, word, or tokenizer-based sizing.
- Pluggable embedding function (`EmbeddingFn`) for semantic chunking.
- Zero-copy structured chunks with byte and character offsets.
## Splitters
### Choosing A Chunker
Use `SemchunkSplitter` as the general-purpose default for natural language and
mixed prose because it applies a punctuation-aware delimiter hierarchy before
falling back to smaller units. Use `RecursiveCharacterTextSplitter` when you
want LangChain-style separator behavior. Use `SentenceChunker` when sentence
boundaries are mandatory and embeddings are not available. Use
`SemanticChunker` when a domain-relevant embedder can identify topic shifts; it
is only as good as the embedding signal and falls back to sentence-sized packing
when no embedder is configured. Use `try_split_chunks` with fallible embedders
when provider failures should be returned as `ChunkError` instead of panicking
through the infallible convenience API. Use `MarkdownChunker` or `HtmlChunker` when the
input format carries useful block structure. Use `CodeChunker` when source code
should be split by parser-recognized Rust or Python AST nodes.
### CharacterTextSplitter
LangChain-style single-separator splitting + merge with overlap.
Best for simple and predictable chunk boundaries.
### RecursiveCharacterTextSplitter
LangChain-style recursive fallback over separators (`\n\n` -> `\n` -> ` ` -> char fallback).
Best general-purpose default when text structure varies.
### SentenceChunker
Sentence-aware chunking with sentence-preserving overlap backtracking.
Supports:
- `min_characters_per_sentence`
- `min_sentences_per_chunk`
- custom sentence delimiters
### SemchunkSplitter
Semchunk-inspired recursive splitter with punctuation-aware hierarchy.
Supports:
- adaptive merge (binary-search span fitting)
- optional memoization (`memoize`)
- stricter delimiter precedence mode (`strict_mode`)
- configurable `length_fn` (including tokenizers)
### SemanticChunker
Embedding-similarity boundary detection:
- sentence windows
- cosine similarity between adjacent windows
- Savitzky-Golay smoothing
- local minima boundary detection
- optional skip-window reconnection for tangential asides
Falls back to sentence-based greedy splitting if no embedder is configured.
## Quick Start
```rust
use julienne::{RecursiveCharacterTextSplitter, SentenceChunker, SemchunkSplitter, SemanticChunker};
let text = "Hello world. This is a document with multiple sentences.";
let recursive = RecursiveCharacterTextSplitter::new(500, 50);
let recursive_chunks = recursive.split_text(text);
let sentence = SentenceChunker::new(500, 50);
let sentence_chunks = sentence.split_text(text);
let semchunk = SemchunkSplitter::new(500, 50);
let semchunk_chunks = semchunk.split_text(text);
let semantic = SemanticChunker::new(500, 50);
let semantic_chunks = semantic.split_text(text);
```
For a fuller walkthrough, see [Getting started](docs/getting-started.md).
## Structured Chunks And Offsets
Use `split_chunks` when downstream code needs provenance:
```rust
use julienne::{RecursiveCharacterTextSplitter, TextChunk};
let text = "Intro.\n\nDetails with café.";
let splitter = RecursiveCharacterTextSplitter::new(80, 10);
let chunks: Vec<TextChunk<'_>> = splitter.split_chunks(text);
for chunk in chunks {
assert_eq!(&text[chunk.start_byte..chunk.end_byte], chunk.text);
}
```
`TextChunk` contains `text`, `start_byte`, `end_byte`, `start_char`, `end_char`,
`measured_length`, and optional metadata. The byte range always indexes the
original input passed to the splitter. The character offsets are counted from
the start of that same input.
`split_text` is the owned-string convenience API. `split_chunks` collects
structured chunks. `chunks` is the iterator-style API for infallible splitters
that can expose structured output without changing the caller-facing contract.
## Token-Aware Length Example
```rust
use julienne::SemchunkSplitter;
fn word_len(s: &str) -> usize {
s.split_whitespace().count()
}
let splitter = SemchunkSplitter::builder()
.chunk_size(90)
.chunk_overlap(15)
.length_fn(std::sync::Arc::new(word_len))
.build()
.unwrap();
let chunks = splitter.split_text("Some longer text...");
```
## Typed Sizing And Token Windows
```rust
use julienne::{ChunkConfig, ChunkSizer, WordSizer, TokenBoundaryProvider, TokenChunker, TokenSpan};
let word_config = ChunkConfig::new(90, 15, WordSizer);
assert_eq!(word_config.sizer.size("one two three"), 3);
#[derive(Clone)]
struct WhitespaceTokens;
impl TokenBoundaryProvider for WhitespaceTokens {
fn token_spans(&self, input: &str) -> Result<Vec<TokenSpan>, julienne::ChunkError> {
let mut spans = Vec::new();
let mut start = None;
for (idx, ch) in input.char_indices() {
if ch.is_whitespace() {
if let Some(s) = start.take() {
spans.push(TokenSpan { start_byte: s, end_byte: idx });
}
} else if start.is_none() {
start = Some(idx);
}
}
if let Some(s) = start {
spans.push(TokenSpan { start_byte: s, end_byte: input.len() });
}
Ok(spans)
}
}
let token_chunks = TokenChunker::new(WhitespaceTokens, 3, 1)
.unwrap()
.try_split_text("one two three four five")
.unwrap();
assert_eq!(token_chunks, vec!["one two three", "three four five"]);
```
Optional tokenizer integrations are feature gated:
- `tiktoken-rs` enables `julienne::token::tiktoken::TiktokenBoundaryProvider`.
- `tokenizers` enables `julienne::token::huggingface::HuggingFaceBoundaryProvider`.
- `unicode-segmentation` enables `GraphemeSizer` and `UnicodeWordSizer`.
```rust
#[cfg(feature = "tiktoken-rs")]
{
use julienne::token::tiktoken::TiktokenBoundaryProvider;
use julienne::TokenChunker;
let provider = TiktokenBoundaryProvider::new(tiktoken_rs::cl100k_base().unwrap());
let chunks = TokenChunker::new(provider, 128, 16)
.unwrap()
.try_split_text("Token-sized chunks using tiktoken.")
.unwrap();
}
```
```rust
#[cfg(feature = "tokenizers")]
{
use julienne::token::huggingface::HuggingFaceBoundaryProvider;
use julienne::TokenChunker;
use tokenizers::Tokenizer;
let tokenizer = Tokenizer::from_file("tokenizer.json").unwrap();
let provider = HuggingFaceBoundaryProvider::new(tokenizer);
let chunks = TokenChunker::new(provider, 128, 16)
.unwrap()
.try_split_text("Token-sized chunks using Hugging Face tokenizers.")
.unwrap();
}
```
## Structure-Aware Chunking
```rust
use julienne::{MarkdownChunker, HtmlChunker};
let markdown = MarkdownChunker::new(500, 50)
.unwrap()
.split_text("# Title\n\nA paragraph.\n\n```rust\nfn main() {}\n```");
let html = HtmlChunker::new(500, 50)
.unwrap()
.split_text("<section><h1>Title</h1><p>Body</p></section>");
```
`MarkdownChunker` preserves headings, paragraphs, lists, and fenced code blocks.
`HtmlChunker` works on already-extracted HTML/XML strings and uses block-level
tag boundaries; it does not fetch pages, sanitize markup, remove boilerplate, or
perform readability extraction.
With `--features code`, `CodeChunker` uses tree-sitter parsers for Rust and
Python and returns explicit `ChunkError` values for parser failures or oversized
semantic nodes.
## SemanticChunker Example
```rust
use julienne::SemanticChunker;
fn simple_embedding(text: &str) -> Vec<f32> {
let lower = text.to_lowercase();
let a = ["sql", "table", "vectorizer"].iter().map(|k| lower.matches(k).count() as f32).sum::<f32>();
let b = ["weather", "rain", "forecast"].iter().map(|k| lower.matches(k).count() as f32).sum::<f32>();
vec![a, b]
}
let chunker = SemanticChunker::builder()
.chunk_size(500)
.chunk_overlap(50)
.window_size(3)
.skip_window(1)
.reconnect_similarity_threshold(0.75)
.max_aside_length(512)
.embedding_fn(std::sync::Arc::new(simple_embedding))
.build()
.unwrap();
let chunks = chunker.split_text("...");
```
## Testing
```bash
cargo test
```
Integration tests use fixtures under:
- `examples/summarize_article.sql`
- `examples/embeddings_from_documents/documents/pgai.md`
## Benchmarks
```bash
cargo bench --bench splitters_bench
```
Code chunker benchmarks are feature-gated:
```bash
cargo bench --bench splitters_bench --features code -- splitters_code
```
The benchmark compares all splitter strategies on markdown and SQL fixtures,
including word-count and `tiktoken` length profiles for semchunk.
## API Contracts
Structured chunk APIs return `TextChunk` with stable source offsets. Public
splitter types and builders are `Send + Sync + Clone` when their configured
handles are. Infallible splitters expose `split_text`, `split_chunks`, and
iterator-style `chunks` where available. Fallible integrations use `try_*`
methods and return `ChunkError`; this includes batch embedding and tree-sitter
code parsing.
## Scope And Non-Goals
This crate is a chunking library. Compared with Chonkie, the focus is Rust-native
range-preserving chunks, explicit fallibility, and configurable splitter
strategies rather than a Python-first experimentation surface. Compared with
Chunkr, this crate does not try to be document intelligence infrastructure: no
OCR, hosted API, document loading service, layout extraction, vector database
handshake, or ingestion pipeline is included. Bring already-extracted text,
Markdown, HTML/XML, or source code strings; the crate returns chunks.