pub struct TextChunker { /* private fields */ }Expand description
Splits text by paragraphs, newlines, sentences, spaces, and finally graphemes, and builds chunks from the splits that are within the desired token ranges.
Implementations§
Source§impl TextChunker
impl TextChunker
Sourcepub fn new() -> Result<Self>
pub fn new() -> Result<Self>
Creates a new instance of the TextChunker struct using the default TikToken tokenizer.
Sourcepub fn new_with_tokenizer(custom_tokenizer: &Arc<Tokenizer>) -> Self
pub fn new_with_tokenizer(custom_tokenizer: &Arc<Tokenizer>) -> Self
Creates a new instance of the TextChunker struct using a custom tokenizer. For example a Hugging Face tokenizer.
Sourcepub fn max_chunk_token_size(self, max_chunk_token_size: u32) -> Self
pub fn max_chunk_token_size(self, max_chunk_token_size: u32) -> Self
Sets the maximum token size for the chunks. Default is 1024.
max_chunk_token_size- The maxium token sized to be chunked to. Inclusive.
Sourcepub fn min_chunk_token_size(self, min_chunk_token_size: u32) -> Self
pub fn min_chunk_token_size(self, min_chunk_token_size: u32) -> Self
Sets the minimum token size for the chunks. Default is 75% of the absolute_length_max. Used solely for the [DfsTextChunker] to determine the minimum chunk size.
min_chunk_token_size- The minimum token sized to be chunked to..
Sourcepub fn use_dfs_semantic_splitter(self, use_dfs_semantic_splitter: bool) -> Self
pub fn use_dfs_semantic_splitter(self, use_dfs_semantic_splitter: bool) -> Self
The [DfsTextChunker] is faster is completely respective of semantic separators. However, it produces less balanced chunk sizes and will fail if the text cannot be split.
By default the TextChunker attempts to chunk with the [DfsTextChunker] first, and if that fails, it will use the [LinearChunker].
use_dfs_semantic_splitter- Whether to use the DFS semantic splitter to attempt to build valid chunks. Default is true.
Sourcepub fn overlap_percent(self, overlap_percent: f32) -> Self
pub fn overlap_percent(self, overlap_percent: f32) -> Self
Sets the percentage of overlap between chunks. Default is None. The full percentage is used foward for the first chunk, and backwards for the last chunk. Middle chunks evenly split the percentage between forward and backwards.
overlap_percent- The percentage of overlap between chunks. Minimum is 0.01, and maximum is 0.5. Default is None.
Sourcepub fn run(&self, incoming_text: &str) -> Option<Vec<String>>
pub fn run(&self, incoming_text: &str) -> Option<Vec<String>>
Runs the TextChunker on the incoming text and returns the chunks as a vector of strings.
incoming_text- The natural language text to chunk.
Sourcepub fn run_return_result(&self, incoming_text: &str) -> Option<ChunkerResult>
pub fn run_return_result(&self, incoming_text: &str) -> Option<ChunkerResult>
Runs the TextChunker on the incoming text and returns the chunks as a ChunkerResult.
The ChunkerResult contains the incoming text, the initial separator used, the chunks, the tokenizer, and the chunking duration. Useful for testing, benching, and diagnostics.
incoming_text- The natural language text to chunk.
Auto Trait Implementations§
impl Freeze for TextChunker
impl RefUnwindSafe for TextChunker
impl Send for TextChunker
impl Sync for TextChunker
impl Unpin for TextChunker
impl UnsafeUnpin for TextChunker
impl UnwindSafe for TextChunker
Blanket Implementations§
Source§impl<T> BorrowMut<T> for Twhere
T: ?Sized,
impl<T> BorrowMut<T> for Twhere
T: ?Sized,
Source§fn borrow_mut(&mut self) -> &mut T
fn borrow_mut(&mut self) -> &mut T
impl<ST, DT> CastableFrom<ST, Initialized, Initialized> for DT
impl<ST, DT> CastableFrom<ST, Uninit, Uninit> for DT
impl<T> ErasedDestructor for Twhere
T: 'static,
Source§impl<T> Instrument for T
impl<T> Instrument for T
Source§fn instrument(self, span: Span) -> Instrumented<Self>
fn instrument(self, span: Span) -> Instrumented<Self>
Source§fn in_current_span(self) -> Instrumented<Self>
fn in_current_span(self) -> Instrumented<Self>
Source§impl<T> IntoEither for T
impl<T> IntoEither for T
Source§fn into_either(self, into_left: bool) -> Either<Self, Self>
fn into_either(self, into_left: bool) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left is true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read moreSource§fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left(&self) returns true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read more