julienne 0.1.0

Range-preserving Rust text chunkers for retrieval and embedding pipelines
Documentation
# Getting Started

Julienne cuts already-extracted text into chunks. It is useful when you need
stable input ranges for retrieval, embeddings, indexing, search snippets, or
context windows.

## Install

```toml
[dependencies]
julienne = "0.1"
```

Enable optional integrations only when you need them:

```toml
[dependencies]
julienne = { version = "0.1", features = ["unicode-segmentation", "tiktoken-rs"] }
```

## Basic Usage

`SemchunkSplitter` is the recommended general-purpose starting point for prose
and mixed natural-language text.

```rust
use julienne::SemchunkSplitter;

let input = "Intro paragraph. More detail follows. Final note.";
let splitter = SemchunkSplitter::new(120, 20);
let chunks = splitter.split_text(input);

assert!(!chunks.is_empty());
```

Use `split_text` when owned strings are enough. Use `split_chunks` when
downstream code needs the original source offsets.

```rust
use julienne::{RecursiveCharacterTextSplitter, TextChunk};

let input = "Intro.\n\nDetails with cafe.";
let splitter = RecursiveCharacterTextSplitter::new(80, 10);
let chunks: Vec<TextChunk<'_>> = splitter.split_chunks(input);

for chunk in chunks {
    assert_eq!(&input[chunk.start_byte..chunk.end_byte], chunk.text);
}
```

## Choose A Chunker

Start here:

- `SemchunkSplitter`: general-purpose prose and mixed text.
- `RecursiveCharacterTextSplitter`: predictable LangChain-style separator
  fallback.
- `SentenceChunker`: sentence boundaries are more important than separator
  hierarchy.
- `SemanticChunker`: you have a domain-relevant embedder and want topic-shift
  boundaries.
- `MarkdownChunker`: Markdown block structure matters.
- `HtmlChunker` / `XmlChunker`: already-extracted markup strings carry useful
  block structure.
- `CodeChunker`: Rust or Python source should be split by AST nodes.
- `TokenChunker`: fixed token windows are the desired unit.

See [Chunker guide](chunkers.md) for tradeoffs and examples.

## Validate Locally

```bash
prek run --all-files
```

The canonical gate checks formatting, feature combinations, tests, clippy,
dependency policy, docs, and package verification.