Struct Chunker

Source
pub struct Chunker { /* private fields */ }
Expand description

A struct for chunking texts into segments based on a maximum number of tokens per chunk and a token counter function.

§Fields

  • chunk_size - The maximum number of tokens that can be in a chunk.
  • token_counter - A function that counts the number of tokens in a string.
  • splitter - The Splitter instance used to split the text.

§Example

use semchunk_rs::Chunker;
let chunker = Chunker::new(4, Box::new(|s: &str| s.len() - s.replace(" ", "").len() + 1));
let text = "The quick brown fox jumps over the lazy dog.";
let chunks = chunker.chunk(text);
assert_eq!(chunks, vec!["The quick brown fox", "jumps over the lazy", "dog."]);

With rust_tokenizers:

use rust_tokenizers::tokenizer::{RobertaTokenizer, Tokenizer};
use semchunk_rs::Chunker;
let tokenizer = RobertaTokenizer::from_file("data/roberta-base-vocab.json", "data/roberta-base-merges.txt", false, false)
   .expect("Error loading tokenizer");
let token_counter = Box::new(move |s: &str| {
   tokenizer.tokenize(s).len()
});
let chunker = Chunker::new(10, token_counter);

Implementations§

Source§

impl Chunker

Source

pub fn new(chunk_size: usize, token_counter: Box<dyn Fn(&str) -> usize>) -> Self

Creates a new Chunker instance. Uses the default Splitter instance. S

§Arguments
  • chunk_size - The maximum number of tokens that can be in a chunk.
  • token_counter - A function that counts the number of tokens in a string.
§Returns

A new Chunker instance.

Source

pub fn splitter(self, splitter: Splitter) -> Self

Sets the splitter for the Chunker instance.

Source

pub fn _chunk(&self, text: &str, recursion_depth: usize) -> Vec<String>

Recursively chunks the given text into segments based on the maximum number of tokens per chunk.

§Arguments
  • text - A string slice that holds the text to be chunked.
  • recursion_depth - The current recursion depth.
§Returns

A vector of string slices representing the chunks of the split text.

Source

pub fn merge_splits(&self, splits: &[&str], separator: &str) -> (usize, String)

Merges first N splits into a chunk that has <= chunk_size tokens.

§Arguments
  • splits - A vector of string slices representing the splits to merge.
  • separator - The separator used to split the text.
§Returns

A tuple containing:

  • The index merging stopped at (not inclusive).
  • The merged text.
§Examples
use semchunk_rs::Chunker;
let chunker = Chunker::new(4, Box::new(|s: &str| s.len() - s.replace(" ", "").len() + 1));
let splits = vec!["The", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"];
let separator = " ";
let (split_idx, merged) = chunker.merge_splits(&splits, separator);
assert_eq!(split_idx, 4);
assert_eq!(merged, "The quick brown fox");
Source

pub fn chunk(&self, text: &str) -> Vec<String>

Chunks the given text into segments based on the maximum number of tokens per chunk.

§Arguments
  • text - A string slice that holds the text to be chunked.
§Examples
use semchunk_rs::Chunker;
 
let chunker = Chunker::new(4, Box::new(|s: &str| s.len() - s.replace(" ", "").len() + 1));
let text = "The quick brown fox jumps over the lazy dog.";
let chunks = chunker._chunk(text, 0);
assert_eq!(chunks, vec!["The quick brown fox", "jumps over the lazy", "dog."]);

Auto Trait Implementations§

§

impl Freeze for Chunker

§

impl !RefUnwindSafe for Chunker

§

impl !Send for Chunker

§

impl !Sync for Chunker

§

impl Unpin for Chunker

§

impl !UnwindSafe for Chunker

Blanket Implementations§

Source§

impl<T> Any for T
where T: 'static + ?Sized,

Source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
Source§

impl<T> Borrow<T> for T
where T: ?Sized,

Source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
Source§

impl<T> BorrowMut<T> for T
where T: ?Sized,

Source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
Source§

impl<T> From<T> for T

Source§

fn from(t: T) -> T

Returns the argument unchanged.

Source§

impl<T, U> Into<U> for T
where U: From<T>,

Source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

Source§

impl<T, U> TryFrom<U> for T
where U: Into<T>,

Source§

type Error = Infallible

The type returned in the event of a conversion error.
Source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
Source§

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

Source§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
Source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.