julienne 0.1.0

Range-preserving Rust text chunkers for retrieval and embedding pipelines
Documentation
# Julienne

Julienne is a Rust library for cutting text into range-preserving chunks for
retrieval, embedding, indexing, search, and context-building pipelines.

It is deliberately a chunking library, not an ingestion framework. Bring strings
that you already extracted from documents, Markdown, HTML/XML, SQL, prose, or
source code; Julienne returns chunks with explicit boundaries and provenance.

## Documentation

- [Getting started]docs/getting-started.md
- [Chunker guide]docs/chunkers.md
- [Structured chunks and offsets]docs/structured-chunks.md
- [Sizing and token windows]docs/sizing-and-tokens.md
- [Semantic chunking]docs/semantic-chunking.md
- [Structure-aware chunking]docs/structure-aware-chunking.md
- [API contracts]docs/api-contracts.md
- [Release and quality gates]docs/release.md

## Features

- Character-based splitters:
  - `CharacterTextSplitter`
  - `RecursiveCharacterTextSplitter`
- Sentence-aware splitters:
  - `SentenceChunker`
- Semchunk-inspired recursive splitter:
  - `SemchunkSplitter`
- Embedding-boundary splitter:
  - `SemanticChunker`
- Structure-aware splitters:
  - `MarkdownChunker`
  - `HtmlChunker`
  - `CodeChunker` behind the `code` feature
- Pluggable length function (`LengthFn`) for character, word, or tokenizer-based sizing.
- Pluggable embedding function (`EmbeddingFn`) for semantic chunking.
- Zero-copy structured chunks with byte and character offsets.

## Splitters

### Choosing A Chunker

Use `SemchunkSplitter` as the general-purpose default for natural language and
mixed prose because it applies a punctuation-aware delimiter hierarchy before
falling back to smaller units. Use `RecursiveCharacterTextSplitter` when you
want LangChain-style separator behavior. Use `SentenceChunker` when sentence
boundaries are mandatory and embeddings are not available. Use
`SemanticChunker` when a domain-relevant embedder can identify topic shifts; it
is only as good as the embedding signal and falls back to sentence-sized packing
when no embedder is configured. Use `try_split_chunks` with fallible embedders
when provider failures should be returned as `ChunkError` instead of panicking
through the infallible convenience API. Use `MarkdownChunker` or `HtmlChunker` when the
input format carries useful block structure. Use `CodeChunker` when source code
should be split by parser-recognized Rust or Python AST nodes.

### CharacterTextSplitter

LangChain-style single-separator splitting + merge with overlap.

Best for simple and predictable chunk boundaries.

### RecursiveCharacterTextSplitter

LangChain-style recursive fallback over separators (`\n\n` -> `\n` -> ` ` -> char fallback).

Best general-purpose default when text structure varies.

### SentenceChunker

Sentence-aware chunking with sentence-preserving overlap backtracking.

Supports:
- `min_characters_per_sentence`
- `min_sentences_per_chunk`
- custom sentence delimiters

### SemchunkSplitter

Semchunk-inspired recursive splitter with punctuation-aware hierarchy.

Supports:
- adaptive merge (binary-search span fitting)
- optional memoization (`memoize`)
- stricter delimiter precedence mode (`strict_mode`)
- configurable `length_fn` (including tokenizers)

### SemanticChunker

Embedding-similarity boundary detection:
- sentence windows
- cosine similarity between adjacent windows
- Savitzky-Golay smoothing
- local minima boundary detection
- optional skip-window reconnection for tangential asides

Falls back to sentence-based greedy splitting if no embedder is configured.

## Quick Start

```rust
use julienne::{RecursiveCharacterTextSplitter, SentenceChunker, SemchunkSplitter, SemanticChunker};

let text = "Hello world. This is a document with multiple sentences.";

let recursive = RecursiveCharacterTextSplitter::new(500, 50);
let recursive_chunks = recursive.split_text(text);

let sentence = SentenceChunker::new(500, 50);
let sentence_chunks = sentence.split_text(text);

let semchunk = SemchunkSplitter::new(500, 50);
let semchunk_chunks = semchunk.split_text(text);

let semantic = SemanticChunker::new(500, 50);
let semantic_chunks = semantic.split_text(text);
```

For a fuller walkthrough, see [Getting started](docs/getting-started.md).

## Structured Chunks And Offsets

Use `split_chunks` when downstream code needs provenance:

```rust
use julienne::{RecursiveCharacterTextSplitter, TextChunk};

let text = "Intro.\n\nDetails with café.";
let splitter = RecursiveCharacterTextSplitter::new(80, 10);
let chunks: Vec<TextChunk<'_>> = splitter.split_chunks(text);

for chunk in chunks {
    assert_eq!(&text[chunk.start_byte..chunk.end_byte], chunk.text);
}
```

`TextChunk` contains `text`, `start_byte`, `end_byte`, `start_char`, `end_char`,
`measured_length`, and optional metadata. The byte range always indexes the
original input passed to the splitter. The character offsets are counted from
the start of that same input.

`split_text` is the owned-string convenience API. `split_chunks` collects
structured chunks. `chunks` is the iterator-style API for infallible splitters
that can expose structured output without changing the caller-facing contract.

## Token-Aware Length Example

```rust
use julienne::SemchunkSplitter;

fn word_len(s: &str) -> usize {
    s.split_whitespace().count()
}

let splitter = SemchunkSplitter::builder()
    .chunk_size(90)
    .chunk_overlap(15)
    .length_fn(std::sync::Arc::new(word_len))
    .build()
    .unwrap();

let chunks = splitter.split_text("Some longer text...");
```

## Typed Sizing And Token Windows

```rust
use julienne::{ChunkConfig, ChunkSizer, WordSizer, TokenBoundaryProvider, TokenChunker, TokenSpan};

let word_config = ChunkConfig::new(90, 15, WordSizer);
assert_eq!(word_config.sizer.size("one two three"), 3);

#[derive(Clone)]
struct WhitespaceTokens;

impl TokenBoundaryProvider for WhitespaceTokens {
    fn token_spans(&self, input: &str) -> Result<Vec<TokenSpan>, julienne::ChunkError> {
        let mut spans = Vec::new();
        let mut start = None;
        for (idx, ch) in input.char_indices() {
            if ch.is_whitespace() {
                if let Some(s) = start.take() {
                    spans.push(TokenSpan { start_byte: s, end_byte: idx });
                }
            } else if start.is_none() {
                start = Some(idx);
            }
        }
        if let Some(s) = start {
            spans.push(TokenSpan { start_byte: s, end_byte: input.len() });
        }
        Ok(spans)
    }
}

let token_chunks = TokenChunker::new(WhitespaceTokens, 3, 1)
    .unwrap()
    .try_split_text("one two three four five")
    .unwrap();
assert_eq!(token_chunks, vec!["one two three", "three four five"]);
```

Optional tokenizer integrations are feature gated:

- `tiktoken-rs` enables `julienne::token::tiktoken::TiktokenBoundaryProvider`.
- `tokenizers` enables `julienne::token::huggingface::HuggingFaceBoundaryProvider`.
- `unicode-segmentation` enables `GraphemeSizer` and `UnicodeWordSizer`.

```rust
#[cfg(feature = "tiktoken-rs")]
{
    use julienne::token::tiktoken::TiktokenBoundaryProvider;
    use julienne::TokenChunker;

    let provider = TiktokenBoundaryProvider::new(tiktoken_rs::cl100k_base().unwrap());
    let chunks = TokenChunker::new(provider, 128, 16)
        .unwrap()
        .try_split_text("Token-sized chunks using tiktoken.")
        .unwrap();
}
```

```rust
#[cfg(feature = "tokenizers")]
{
    use julienne::token::huggingface::HuggingFaceBoundaryProvider;
    use julienne::TokenChunker;
    use tokenizers::Tokenizer;

    let tokenizer = Tokenizer::from_file("tokenizer.json").unwrap();
    let provider = HuggingFaceBoundaryProvider::new(tokenizer);
    let chunks = TokenChunker::new(provider, 128, 16)
        .unwrap()
        .try_split_text("Token-sized chunks using Hugging Face tokenizers.")
        .unwrap();
}
```

## Structure-Aware Chunking

```rust
use julienne::{MarkdownChunker, HtmlChunker};

let markdown = MarkdownChunker::new(500, 50)
    .unwrap()
    .split_text("# Title\n\nA paragraph.\n\n```rust\nfn main() {}\n```");

let html = HtmlChunker::new(500, 50)
    .unwrap()
    .split_text("<section><h1>Title</h1><p>Body</p></section>");
```

`MarkdownChunker` preserves headings, paragraphs, lists, and fenced code blocks.
`HtmlChunker` works on already-extracted HTML/XML strings and uses block-level
tag boundaries; it does not fetch pages, sanitize markup, remove boilerplate, or
perform readability extraction.

With `--features code`, `CodeChunker` uses tree-sitter parsers for Rust and
Python and returns explicit `ChunkError` values for parser failures or oversized
semantic nodes.

## SemanticChunker Example

```rust
use julienne::SemanticChunker;

fn simple_embedding(text: &str) -> Vec<f32> {
    let lower = text.to_lowercase();
    let a = ["sql", "table", "vectorizer"].iter().map(|k| lower.matches(k).count() as f32).sum::<f32>();
    let b = ["weather", "rain", "forecast"].iter().map(|k| lower.matches(k).count() as f32).sum::<f32>();
    vec![a, b]
}

let chunker = SemanticChunker::builder()
    .chunk_size(500)
    .chunk_overlap(50)
    .window_size(3)
    .skip_window(1)
    .reconnect_similarity_threshold(0.75)
    .max_aside_length(512)
    .embedding_fn(std::sync::Arc::new(simple_embedding))
    .build()
    .unwrap();

let chunks = chunker.split_text("...");
```

## Testing

```bash
cargo test
```

Integration tests use fixtures under:
- `examples/summarize_article.sql`
- `examples/embeddings_from_documents/documents/pgai.md`

## Benchmarks

```bash
cargo bench --bench splitters_bench
```

Code chunker benchmarks are feature-gated:

```bash
cargo bench --bench splitters_bench --features code -- splitters_code
```

The benchmark compares all splitter strategies on markdown and SQL fixtures,
including word-count and `tiktoken` length profiles for semchunk.

## API Contracts

Structured chunk APIs return `TextChunk` with stable source offsets. Public
splitter types and builders are `Send + Sync + Clone` when their configured
handles are. Infallible splitters expose `split_text`, `split_chunks`, and
iterator-style `chunks` where available. Fallible integrations use `try_*`
methods and return `ChunkError`; this includes batch embedding and tree-sitter
code parsing.

## Scope And Non-Goals

This crate is a chunking library. Compared with Chonkie, the focus is Rust-native
range-preserving chunks, explicit fallibility, and configurable splitter
strategies rather than a Python-first experimentation surface. Compared with
Chunkr, this crate does not try to be document intelligence infrastructure: no
OCR, hosted API, document loading service, layout extraction, vector database
handshake, or ingestion pipeline is included. Bring already-extracted text,
Markdown, HTML/XML, or source code strings; the crate returns chunks.