pub struct Chunker { /* private fields */ }
Expand description
A struct for chunking texts into segments based on a maximum number of tokens per chunk and a token counter function.
§Fields
chunk_size
- The maximum number of tokens that can be in a chunk.token_counter
- A function that counts the number of tokens in a string.splitter
- The Splitter instance used to split the text.
§Example
use semchunk_rs::Chunker;
let chunker = Chunker::new(4, Box::new(|s: &str| s.len() - s.replace(" ", "").len() + 1));
let text = "The quick brown fox jumps over the lazy dog.";
let chunks = chunker.chunk(text);
assert_eq!(chunks, vec!["The quick brown fox", "jumps over the lazy", "dog."]);
With rust_tokenizers
:
use rust_tokenizers::tokenizer::{RobertaTokenizer, Tokenizer};
use semchunk_rs::Chunker;
let tokenizer = RobertaTokenizer::from_file("data/roberta-base-vocab.json", "data/roberta-base-merges.txt", false, false)
.expect("Error loading tokenizer");
let token_counter = Box::new(move |s: &str| {
tokenizer.tokenize(s).len()
});
let chunker = Chunker::new(10, token_counter);
Implementations§
Source§impl Chunker
impl Chunker
Sourcepub fn merge_splits(&self, splits: &[&str], separator: &str) -> (usize, String)
pub fn merge_splits(&self, splits: &[&str], separator: &str) -> (usize, String)
Merges first N splits into a chunk that has <= chunk_size tokens.
§Arguments
splits
- A vector of string slices representing the splits to merge.separator
- The separator used to split the text.
§Returns
A tuple containing:
- The index merging stopped at (not inclusive).
- The merged text.
§Examples
use semchunk_rs::Chunker;
let chunker = Chunker::new(4, Box::new(|s: &str| s.len() - s.replace(" ", "").len() + 1));
let splits = vec!["The", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"];
let separator = " ";
let (split_idx, merged) = chunker.merge_splits(&splits, separator);
assert_eq!(split_idx, 4);
assert_eq!(merged, "The quick brown fox");
Sourcepub fn chunk(&self, text: &str) -> Vec<String>
pub fn chunk(&self, text: &str) -> Vec<String>
Chunks the given text into segments based on the maximum number of tokens per chunk.
§Arguments
text
- A string slice that holds the text to be chunked.
§Examples
use semchunk_rs::Chunker;
let chunker = Chunker::new(4, Box::new(|s: &str| s.len() - s.replace(" ", "").len() + 1));
let text = "The quick brown fox jumps over the lazy dog.";
let chunks = chunker._chunk(text, 0);
assert_eq!(chunks, vec!["The quick brown fox", "jumps over the lazy", "dog."]);
Auto Trait Implementations§
impl Freeze for Chunker
impl !RefUnwindSafe for Chunker
impl !Send for Chunker
impl !Sync for Chunker
impl Unpin for Chunker
impl !UnwindSafe for Chunker
Blanket Implementations§
Source§impl<T> BorrowMut<T> for Twhere
T: ?Sized,
impl<T> BorrowMut<T> for Twhere
T: ?Sized,
Source§fn borrow_mut(&mut self) -> &mut T
fn borrow_mut(&mut self) -> &mut T
Mutably borrows from an owned value. Read more