Skip to main content

TextChunker

Struct TextChunker 

Source
pub struct TextChunker { /* private fields */ }
Expand description

Splits text by paragraphs, newlines, sentences, spaces, and finally graphemes, and builds chunks from the splits that are within the desired token ranges.

Implementations§

Source§

impl TextChunker

Source

pub fn new() -> Result<Self>

Creates a new instance of the TextChunker struct using the default TikToken tokenizer.

Source

pub fn new_with_tokenizer(custom_tokenizer: &Arc<Tokenizer>) -> Self

Creates a new instance of the TextChunker struct using a custom tokenizer. For example a Hugging Face tokenizer.

Source

pub fn max_chunk_token_size(self, max_chunk_token_size: u32) -> Self

Sets the maximum token size for the chunks. Default is 1024.

  • max_chunk_token_size - The maxium token sized to be chunked to. Inclusive.
Source

pub fn min_chunk_token_size(self, min_chunk_token_size: u32) -> Self

Sets the minimum token size for the chunks. Default is 75% of the absolute_length_max. Used solely for the [DfsTextChunker] to determine the minimum chunk size.

  • min_chunk_token_size - The minimum token sized to be chunked to..
Source

pub fn use_dfs_semantic_splitter(self, use_dfs_semantic_splitter: bool) -> Self

The [DfsTextChunker] is faster is completely respective of semantic separators. However, it produces less balanced chunk sizes and will fail if the text cannot be split. By default the TextChunker attempts to chunk with the [DfsTextChunker] first, and if that fails, it will use the [LinearChunker].

  • use_dfs_semantic_splitter - Whether to use the DFS semantic splitter to attempt to build valid chunks. Default is true.
Source

pub fn overlap_percent(self, overlap_percent: f32) -> Self

Sets the percentage of overlap between chunks. Default is None. The full percentage is used foward for the first chunk, and backwards for the last chunk. Middle chunks evenly split the percentage between forward and backwards.

  • overlap_percent - The percentage of overlap between chunks. Minimum is 0.01, and maximum is 0.5. Default is None.
Source

pub fn run(&self, incoming_text: &str) -> Option<Vec<String>>

Runs the TextChunker on the incoming text and returns the chunks as a vector of strings.

  • incoming_text - The natural language text to chunk.
Source

pub fn run_return_result(&self, incoming_text: &str) -> Option<ChunkerResult>

Runs the TextChunker on the incoming text and returns the chunks as a ChunkerResult. The ChunkerResult contains the incoming text, the initial separator used, the chunks, the tokenizer, and the chunking duration. Useful for testing, benching, and diagnostics.

  • incoming_text - The natural language text to chunk.

Auto Trait Implementations§

Blanket Implementations§

Source§

impl<T> Any for T
where T: 'static + ?Sized,

Source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
Source§

impl<T> Borrow<T> for T
where T: ?Sized,

Source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
Source§

impl<T> BorrowMut<T> for T
where T: ?Sized,

Source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
Source§

impl<ST, DT> CastableFrom<ST, Initialized, Initialized> for DT
where ST: ?Sized, DT: ?Sized,

Source§

impl<ST, DT> CastableFrom<ST, Uninit, Uninit> for DT
where ST: ?Sized, DT: ?Sized,

Source§

impl<T> ErasedDestructor for T
where T: 'static,

Source§

impl<T> From<T> for T

Source§

fn from(t: T) -> T

Returns the argument unchanged.

Source§

impl<T> Instrument for T

Source§

fn instrument(self, span: Span) -> Instrumented<Self>

Instruments this type with the provided Span, returning an Instrumented wrapper. Read more
Source§

fn in_current_span(self) -> Instrumented<Self>

Instruments this type with the current Span, returning an Instrumented wrapper. Read more
Source§

impl<T, U> Into<U> for T
where U: From<T>,

Source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

Source§

impl<T> IntoEither for T

Source§

fn into_either(self, into_left: bool) -> Either<Self, Self>

Converts self into a Left variant of Either<Self, Self> if into_left is true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more
Source§

fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
where F: FnOnce(&Self) -> bool,

Converts self into a Left variant of Either<Self, Self> if into_left(&self) returns true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more
Source§

impl<T> MaybeSendSync for T
where T: Send + Sync,

Source§

impl<T> Pointable for T

Source§

const ALIGN: usize

The alignment of pointer.
Source§

type Init = T

The type for initializers.
Source§

unsafe fn init(init: <T as Pointable>::Init) -> usize

Initializes a with the given initializer. Read more
Source§

unsafe fn deref<'a>(ptr: usize) -> &'a T

Dereferences the given pointer. Read more
Source§

unsafe fn deref_mut<'a>(ptr: usize) -> &'a mut T

Mutably dereferences the given pointer. Read more
Source§

unsafe fn drop(ptr: usize)

Drops the object pointed to by the given pointer. Read more
Source§

impl<T> PolicyExt for T
where T: ?Sized,

Source§

fn and<P, B, E>(self, other: P) -> And<T, P>
where T: Sized + Policy<B, E>, P: Policy<B, E>,

Create a new Policy that returns Action::Follow only if self and other return Action::Follow. Read more
Source§

fn or<P, B, E>(self, other: P) -> Or<T, P>
where T: Sized + Policy<B, E>, P: Policy<B, E>,

Create a new Policy that returns Action::Follow if either self or other returns Action::Follow. Read more
Source§

impl<T> Read<Exclusive, BecauseExclusive> for T
where T: ?Sized,

Source§

impl<T, U> TryFrom<U> for T
where U: Into<T>,

Source§

type Error = Infallible

The type returned in the event of a conversion error.
Source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
Source§

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

Source§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
Source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.
Source§

impl<V, T> VZip<V> for T
where V: MultiLane<T>,

Source§

fn vzip(self) -> V

Source§

impl<T> WithSubscriber for T

Source§

fn with_subscriber<S>(self, subscriber: S) -> WithDispatch<Self>
where S: Into<Dispatch>,

Attaches the provided Subscriber to this type, returning a WithDispatch wrapper. Read more
Source§

fn with_current_subscriber(self) -> WithDispatch<Self>

Attaches the current default Subscriber to this type, returning a WithDispatch wrapper. Read more