ChunkNorris
A simple and efficient Rust library for splitting large text into smaller batches based on character count. This is particularly useful when working with large language models (LLMs) that have input size limitations.
Features
CharCountBatcher: Splits text into batches of a specified maximum character length.- Easy to Use: Simple and intuitive API.
- Lightweight: No external dependencies for the core
CharCountBatcher. - Extensible: Designed to accommodate other batching strategies in the future (e.g., sentence-based, semantic unit-based).
- Well-Tested: Includes unit tests to ensure correctness.
Installation
Add chunk_norris to your Cargo.toml:
[]
= "0.1.0" # Replace with the latest version
Usage
CharCountBatcher
use ;
Output
Batch 1: This is an example text.
Batch 2: It will be split into
Batch 3: smaller batches.
SentenceBatcher
use ;
Output
Batch 1: This is an example text.
Batch 2: It will be split into
Batch 3: smaller batches.
Explanation
- Import: use text_batcher::{CharCountBatcher, TextBatch}; brings the necessary structs into scope.
- Create a Batcher: CharCountBatcher::new(25) creates a new CharCountBatcher instance with a maximum batch size of 25 characters.
- Batch the Text: batcher.create_batches(text) splits the input text into a Vec.
- Iterate and Process: The code then iterates through the batches vector and prints the content of each TextBatch.
Advanced Usage (Future Extensions)
The library is designed to be extensible. Although the current version only provides CharCountBatcher, you can implement the BatchingStrategy trait to create custom batching logic:
use ;
// Example: A hypothetical SentenceBatcher (not yet implemented in the library)
You could then use your custom batcher similarly to the CharCountBatcher.
Contributing
Contributions are welcome! If you'd like to add new batching strategies, improve the existing code, or fix any issues, please feel free to open an issue or submit a pull request.
License
This project is licensed under either the MIT License or the Apache License, Version 2.0 at your option.