julienne 0.1.0 - Docs.rs

# Chunker Guide

Julienne exposes several chunking strategies because no single boundary model is
right for every input type.

## `CharacterTextSplitter`

Splits on one configured separator, then merges pieces up to `chunk_size` with
optional overlap.

Use it when boundaries are simple and predictable, for example newline-delimited
records or paragraphs already normalized by an upstream process.

```rust
use julienne::CharacterTextSplitter;

let splitter = CharacterTextSplitter::new("\n", 200, 20);
let chunks = splitter.split_text("alpha\nbeta\ngamma");
```

## `RecursiveCharacterTextSplitter`

Tries a separator hierarchy from coarse to fine, then falls back to smaller
units when needed.

Use it when you want LangChain-style behavior and the input may contain mixed
paragraph, line, and word boundaries.

```rust
use julienne::RecursiveCharacterTextSplitter;

let splitter = RecursiveCharacterTextSplitter::new(500, 50);
let chunks = splitter.split_text("First paragraph.\n\nSecond paragraph.");
```

## `SentenceChunker`

Builds chunks from sentence-like units and preserves sentence boundaries where
possible.

Use it when splitting through a sentence would be worse than producing slightly
different packing than a separator-based splitter.

```rust
use julienne::SentenceChunker;

let splitter = SentenceChunker::new(300, 30);
let chunks = splitter.split_text("One sentence. Another sentence. A final sentence.");
```

## `SemchunkSplitter`

Uses a punctuation-aware delimiter hierarchy and adaptive packing. This is the
recommended default for prose and mixed natural-language text.

```rust
use julienne::SemchunkSplitter;

let splitter = SemchunkSplitter::new(500, 50);
let chunks = splitter.split_text("A paragraph, with clauses; and useful punctuation.");
```

## `SemanticChunker`

Detects topic boundaries with embeddings. Configure a batch embedder for
production use. Without an embedder, it falls back to sentence-based packing.

```rust
use julienne::SemanticChunker;

let chunker = SemanticChunker::builder()
    .chunk_size(500)
    .chunk_overlap(50)
    .embedding_fn(std::sync::Arc::new(|text: &str| vec![text.len() as f32]))
    .build()
    .unwrap();

let chunks = chunker.split_text("Topic A. Topic B.");
```

Provider-backed embeddings can fail. Use `try_split_text` or
`try_split_chunks` when those failures should be returned as `ChunkError`
instead of panicking through a convenience API.

## Structure-Aware Chunkers

Use `MarkdownChunker`, `HtmlChunker`, `XmlChunker`, and feature-gated
`CodeChunker` when syntax or document structure is a better boundary source than
plain punctuation.

See [Structure-aware chunking](structure-aware-chunking.md).

## `TokenChunker`

Uses an explicit token-boundary provider and returns fixed token windows with
overlap.

Use it when the unit must be token count rather than character, word, or
semantic boundaries.

See [Sizing and token windows](sizing-and-tokens.md).