chunk 0.9.2

The fastest semantic text chunking library — up to 1TB/s chunking throughput
Documentation
<p align="center">
  <img src="assets/memchunk_wide.png" alt="chunk" width="500">
</p>

<h1 align="center">chunk</h1>

<p align="center">
  <em>the fastest text chunking library — up to 1 TB/s throughput</em>
</p>

<p align="center">
  <a href="https://crates.io/crates/chunk"><img src="https://img.shields.io/crates/v/chunk.svg?color=e74c3c" alt="crates.io"></a>
  <a href="https://pypi.org/project/chonkie-core"><img src="https://img.shields.io/pypi/v/chonkie-core.svg?color=e67e22" alt="PyPI"></a>
  <a href="https://www.npmjs.com/package/@chonkiejs/chunk"><img src="https://img.shields.io/npm/v/@chonkiejs/chunk.svg?color=2ecc71" alt="npm"></a>
  <a href="https://docs.rs/chunk"><img src="https://img.shields.io/docsrs/chunk?color=3498db" alt="docs.rs"></a>
  <a href="LICENSE-MIT"><img src="https://img.shields.io/badge/license-MIT%2FApache--2.0-9b59b6.svg" alt="License"></a>
</p>

---

you know how every chunking library claims to be fast? yeah, we actually meant it.

**chunk** splits text at semantic boundaries (periods, newlines, the usual suspects) and does it stupid fast. we're talking "chunk the entire english wikipedia in 120ms" fast.

want to know how? [read the blog post](https://minha.sh/posts/so,-you-want-to-chunk-really-fast) where we nerd out about SIMD instructions and lookup tables.

<p align="center">
  <img src="assets/benchmark.png" alt="Benchmark comparison" width="700">
</p>

<p align="center">
  <em>See <a href="benches/">benches/</a> for detailed benchmarks.</em>
</p>

## 📦 Installation

```bash
cargo add chunk
```

looking for [python](https://github.com/chonkie-inc/chunk/tree/main/packages/python) or [javascript](https://github.com/chonkie-inc/chunk/tree/main/packages/wasm)?

## 🚀 Usage

```rust
use chunk::chunk;

let text = b"Hello world. How are you? I'm fine.\nThanks for asking.";

// With defaults (4KB chunks, split at \n . ?)
let chunks: Vec<&[u8]> = chunk(text).collect();

// With custom size
let chunks: Vec<&[u8]> = chunk(text).size(1024).collect();

// With custom delimiters
let chunks: Vec<&[u8]> = chunk(text).delimiters(b"\n.?!").collect();

// With multi-byte pattern (e.g., metaspace ▁ for SentencePiece tokenizers)
let metaspace = "▁".as_bytes();
let chunks: Vec<&[u8]> = chunk(text).pattern(metaspace).prefix().collect();

// With consecutive pattern handling (split at START of runs, not middle)
let chunks: Vec<&[u8]> = chunk(b"word   next")
    .pattern(b" ")
    .consecutive()
    .collect();

// With forward fallback (search forward if no pattern in backward window)
let chunks: Vec<&[u8]> = chunk(text)
    .pattern(b" ")
    .forward_fallback()
    .collect();
```

## 📝 Citation

If you use chunk in your research, please cite it as follows:

```bibtex
@software{chunk2025,
  author = {Minhas, Bhavnick},
  title = {chunk: The fastest text chunking library},
  year = {2025},
  publisher = {GitHub},
  howpublished = {\url{https://github.com/chonkie-inc/chunk}},
}
```

## 📄 License

Licensed under either of [Apache License, Version 2.0](LICENSE-APACHE) or [MIT license](LICENSE-MIT) at your option.