julienne 0.1.0

Range-preserving Rust text chunkers for retrieval and embedding pipelines
Documentation
# Semantic Chunking

`SemanticChunker` uses embeddings to find topic-shift boundaries. It is useful
when paragraphs or sentences do not align with the conceptual sections you want
to retrieve.

## How It Works

The chunker:

1. Splits input into sentence-like units.
2. Builds sliding windows.
3. Embeds those windows.
4. Computes adjacent cosine similarity.
5. Smooths the similarity curve.
6. Chooses local minima as candidate topic boundaries.
7. Packs semantic sections into chunks.

## Configure An Embedder

For production integrations, prefer a batch `Embedder`.

```rust
use julienne::{ChunkError, SemanticChunker};

let embedder = std::sync::Arc::new(
    |inputs: &[&str]| -> Result<Vec<Vec<f32>>, ChunkError> {
        Ok(inputs.iter().map(|input| vec![input.len() as f32]).collect())
    },
);

let chunker = SemanticChunker::builder()
    .chunk_size(500)
    .chunk_overlap(50)
    .embedder(embedder)
    .build()
    .unwrap();

let chunks = chunker.try_split_text("First topic. Second topic.").unwrap();
```

`embedding_fn` remains available as a convenience adapter for simple local
embedders:

```rust
use julienne::SemanticChunker;

let chunker = SemanticChunker::builder()
    .embedding_fn(std::sync::Arc::new(|text: &str| vec![text.len() as f32]))
    .build()
    .unwrap();
```

## Failure Behavior

Use `try_split_text` or `try_split_chunks` when embedding failures should be
returned as `ChunkError::EmbeddingFailure`. The infallible convenience methods
are intended for local infallible configuration and will panic if a configured
fallible embedder fails.

If no embedder is configured, `SemanticChunker` falls back to sentence-based
packing. That fallback is explicit behavior for the no-embedder configuration;
provider errors are not silently ignored.

## Limitations

Semantic chunking quality depends on the embedding signal. A generic or weak
embedder may miss domain boundaries or invent boundaries where the text only
changes vocabulary. Use semchunk or sentence chunking when deterministic
boundary rules matter more than topical similarity.