pub struct Chunker<'a> { /* private fields */ }Expand description
Chunker splits text at delimiter boundaries.
Created via chunk(), can be configured with .size() and .delimiters().
For multi-byte delimiters, use .pattern() instead.
Implementations§
Source§impl<'a> Chunker<'a>
impl<'a> Chunker<'a>
Sourcepub fn delimiters(self, delimiters: &'a [u8]) -> Self
pub fn delimiters(self, delimiters: &'a [u8]) -> Self
Set single-byte delimiters to split on.
Mutually exclusive with pattern() - last one set wins.
Sourcepub fn pattern(self, pattern: &'a [u8]) -> Self
pub fn pattern(self, pattern: &'a [u8]) -> Self
Set a multi-byte pattern to split on.
Use this for multi-byte delimiters like UTF-8 characters (e.g., metaspace ▁).
Mutually exclusive with delimiters() - last one set wins.
use chunk::chunk;
let metaspace = "▁".as_bytes(); // [0xE2, 0x96, 0x81]
let chunks: Vec<_> = chunk(b"Hello\xE2\x96\x81World\xE2\x96\x81Test")
.size(15)
.pattern(metaspace)
.prefix()
.collect();
assert_eq!(chunks[0], b"Hello");
assert_eq!(chunks[1], b"\xE2\x96\x81World\xE2\x96\x81Test");Sourcepub fn prefix(self) -> Self
pub fn prefix(self) -> Self
Put delimiter at the start of the next chunk (prefix mode).
use chunk::chunk;
let chunks: Vec<_> = chunk(b"Hello World").size(8).delimiters(b" ").prefix().collect();
assert_eq!(chunks, vec![b"Hello".as_slice(), b" World".as_slice()]);Sourcepub fn suffix(self) -> Self
pub fn suffix(self) -> Self
Put delimiter at the end of the current chunk (suffix mode, default).
use chunk::chunk;
let chunks: Vec<_> = chunk(b"Hello World").size(8).delimiters(b" ").suffix().collect();
assert_eq!(chunks, vec![b"Hello ".as_slice(), b"World".as_slice()]);Sourcepub fn consecutive(self) -> Self
pub fn consecutive(self) -> Self
Enable consecutive delimiter/pattern handling.
When splitting, ensures we split at the START of a consecutive run of the same delimiter/pattern, not in the middle. For example:
- With pattern: “word▁▁▁next” splits as [“word”][“▁▁▁next”]
- With delimiter: “word\n\n\nnext” splits as [“word”][“\n\n\nnext”]
This is useful for patterns that can merge (like BPE tokenization)
or when consecutive delimiters have semantic meaning (like \n\n
for paragraph breaks).
Works with both .pattern() and .delimiters().
use chunk::chunk;
// With pattern
let text = b"word\xE2\x96\x81\xE2\x96\x81\xE2\x96\x81next"; // word▁▁▁next
let metaspace = b"\xE2\x96\x81";
let chunks: Vec<_> = chunk(text)
.pattern(metaspace)
.size(10)
.prefix()
.consecutive()
.collect();
assert_eq!(chunks[0], b"word");
// With delimiters
let text = b"Hello\n\n\nWorld";
let chunks: Vec<_> = chunk(text)
.delimiters(b"\n")
.size(8)
.prefix()
.consecutive()
.collect();
assert_eq!(chunks[0], b"Hello");
assert_eq!(chunks[1], b"\n\n\nWorld");Sourcepub fn forward_fallback(self) -> Self
pub fn forward_fallback(self) -> Self
Enable forward fallback search.
When no delimiter/pattern is found in the backward search window, search forward from target_end instead of doing a hard split.
This ensures splits always occur at semantic boundaries when possible, even if the nearest boundary is past the target size.
Works with both .pattern() and .delimiters().
use chunk::chunk;
// With pattern
let text = b"verylongword\xE2\x96\x81short"; // verylongword▁short
let metaspace = b"\xE2\x96\x81";
let chunks: Vec<_> = chunk(text)
.pattern(metaspace)
.size(6)
.prefix()
.forward_fallback()
.collect();
// Without forward_fallback: hard split at position 6
// With forward_fallback: finds ▁ at position 12
assert_eq!(chunks[0], b"verylongword");
// With delimiters
let text = b"verylongword next";
let chunks: Vec<_> = chunk(text)
.delimiters(b" ")
.size(6)
.prefix()
.forward_fallback()
.collect();
assert_eq!(chunks[0], b"verylongword");
assert_eq!(chunks[1], b" next");Trait Implementations§
Source§impl<'a> Iterator for Chunker<'a>
impl<'a> Iterator for Chunker<'a>
Source§fn next(&mut self) -> Option<Self::Item>
fn next(&mut self) -> Option<Self::Item>
Source§fn next_chunk<const N: usize>(
&mut self,
) -> Result<[Self::Item; N], IntoIter<Self::Item, N>>where
Self: Sized,
fn next_chunk<const N: usize>(
&mut self,
) -> Result<[Self::Item; N], IntoIter<Self::Item, N>>where
Self: Sized,
iter_next_chunk)N values. Read more1.0.0 · Source§fn size_hint(&self) -> (usize, Option<usize>)
fn size_hint(&self) -> (usize, Option<usize>)
1.0.0 · Source§fn count(self) -> usizewhere
Self: Sized,
fn count(self) -> usizewhere
Self: Sized,
1.0.0 · Source§fn last(self) -> Option<Self::Item>where
Self: Sized,
fn last(self) -> Option<Self::Item>where
Self: Sized,
Source§fn advance_by(&mut self, n: usize) -> Result<(), NonZero<usize>>
fn advance_by(&mut self, n: usize) -> Result<(), NonZero<usize>>
iter_advance_by)n elements. Read more1.0.0 · Source§fn nth(&mut self, n: usize) -> Option<Self::Item>
fn nth(&mut self, n: usize) -> Option<Self::Item>
nth element of the iterator. Read more1.28.0 · Source§fn step_by(self, step: usize) -> StepBy<Self>where
Self: Sized,
fn step_by(self, step: usize) -> StepBy<Self>where
Self: Sized,
1.0.0 · Source§fn chain<U>(self, other: U) -> Chain<Self, <U as IntoIterator>::IntoIter>
fn chain<U>(self, other: U) -> Chain<Self, <U as IntoIterator>::IntoIter>
1.0.0 · Source§fn zip<U>(self, other: U) -> Zip<Self, <U as IntoIterator>::IntoIter>where
Self: Sized,
U: IntoIterator,
fn zip<U>(self, other: U) -> Zip<Self, <U as IntoIterator>::IntoIter>where
Self: Sized,
U: IntoIterator,
Source§fn intersperse(self, separator: Self::Item) -> Intersperse<Self>
fn intersperse(self, separator: Self::Item) -> Intersperse<Self>
iter_intersperse)separator between adjacent
items of the original iterator. Read moreSource§fn intersperse_with<G>(self, separator: G) -> IntersperseWith<Self, G>
fn intersperse_with<G>(self, separator: G) -> IntersperseWith<Self, G>
iter_intersperse)separator
between adjacent items of the original iterator. Read more1.0.0 · Source§fn map<B, F>(self, f: F) -> Map<Self, F>
fn map<B, F>(self, f: F) -> Map<Self, F>
1.0.0 · Source§fn filter<P>(self, predicate: P) -> Filter<Self, P>
fn filter<P>(self, predicate: P) -> Filter<Self, P>
1.0.0 · Source§fn filter_map<B, F>(self, f: F) -> FilterMap<Self, F>
fn filter_map<B, F>(self, f: F) -> FilterMap<Self, F>
1.0.0 · Source§fn enumerate(self) -> Enumerate<Self>where
Self: Sized,
fn enumerate(self) -> Enumerate<Self>where
Self: Sized,
1.0.0 · Source§fn skip_while<P>(self, predicate: P) -> SkipWhile<Self, P>
fn skip_while<P>(self, predicate: P) -> SkipWhile<Self, P>
1.0.0 · Source§fn take_while<P>(self, predicate: P) -> TakeWhile<Self, P>
fn take_while<P>(self, predicate: P) -> TakeWhile<Self, P>
1.57.0 · Source§fn map_while<B, P>(self, predicate: P) -> MapWhile<Self, P>
fn map_while<B, P>(self, predicate: P) -> MapWhile<Self, P>
1.0.0 · Source§fn skip(self, n: usize) -> Skip<Self>where
Self: Sized,
fn skip(self, n: usize) -> Skip<Self>where
Self: Sized,
n elements. Read more1.0.0 · Source§fn take(self, n: usize) -> Take<Self>where
Self: Sized,
fn take(self, n: usize) -> Take<Self>where
Self: Sized,
n elements, or fewer
if the underlying iterator ends sooner. Read more1.0.0 · Source§fn flat_map<U, F>(self, f: F) -> FlatMap<Self, U, F>
fn flat_map<U, F>(self, f: F) -> FlatMap<Self, U, F>
1.29.0 · Source§fn flatten(self) -> Flatten<Self>
fn flatten(self) -> Flatten<Self>
Source§fn map_windows<F, R, const N: usize>(self, f: F) -> MapWindows<Self, F, N>
fn map_windows<F, R, const N: usize>(self, f: F) -> MapWindows<Self, F, N>
iter_map_windows)f for each contiguous window of size N over
self and returns an iterator over the outputs of f. Like slice::windows(),
the windows during mapping overlap as well. Read more1.0.0 · Source§fn inspect<F>(self, f: F) -> Inspect<Self, F>
fn inspect<F>(self, f: F) -> Inspect<Self, F>
1.0.0 · Source§fn by_ref(&mut self) -> &mut Selfwhere
Self: Sized,
fn by_ref(&mut self) -> &mut Selfwhere
Self: Sized,
Iterator. Read moreSource§fn try_collect<B>(
&mut self,
) -> <<Self::Item as Try>::Residual as Residual<B>>::TryType
fn try_collect<B>( &mut self, ) -> <<Self::Item as Try>::Residual as Residual<B>>::TryType
iterator_try_collect)Source§fn collect_into<E>(self, collection: &mut E) -> &mut E
fn collect_into<E>(self, collection: &mut E) -> &mut E
iter_collect_into)1.0.0 · Source§fn partition<B, F>(self, f: F) -> (B, B)
fn partition<B, F>(self, f: F) -> (B, B)
Source§fn is_partitioned<P>(self, predicate: P) -> bool
fn is_partitioned<P>(self, predicate: P) -> bool
iter_is_partitioned)true precede all those that return false. Read more1.27.0 · Source§fn try_fold<B, F, R>(&mut self, init: B, f: F) -> R
fn try_fold<B, F, R>(&mut self, init: B, f: F) -> R
1.27.0 · Source§fn try_for_each<F, R>(&mut self, f: F) -> R
fn try_for_each<F, R>(&mut self, f: F) -> R
1.0.0 · Source§fn fold<B, F>(self, init: B, f: F) -> B
fn fold<B, F>(self, init: B, f: F) -> B
1.51.0 · Source§fn reduce<F>(self, f: F) -> Option<Self::Item>
fn reduce<F>(self, f: F) -> Option<Self::Item>
Source§fn try_reduce<R>(
&mut self,
f: impl FnMut(Self::Item, Self::Item) -> R,
) -> <<R as Try>::Residual as Residual<Option<<R as Try>::Output>>>::TryType
fn try_reduce<R>( &mut self, f: impl FnMut(Self::Item, Self::Item) -> R, ) -> <<R as Try>::Residual as Residual<Option<<R as Try>::Output>>>::TryType
iterator_try_reduce)1.0.0 · Source§fn all<F>(&mut self, f: F) -> bool
fn all<F>(&mut self, f: F) -> bool
1.0.0 · Source§fn any<F>(&mut self, f: F) -> bool
fn any<F>(&mut self, f: F) -> bool
1.0.0 · Source§fn find<P>(&mut self, predicate: P) -> Option<Self::Item>
fn find<P>(&mut self, predicate: P) -> Option<Self::Item>
1.30.0 · Source§fn find_map<B, F>(&mut self, f: F) -> Option<B>
fn find_map<B, F>(&mut self, f: F) -> Option<B>
Source§fn try_find<R>(
&mut self,
f: impl FnMut(&Self::Item) -> R,
) -> <<R as Try>::Residual as Residual<Option<Self::Item>>>::TryType
fn try_find<R>( &mut self, f: impl FnMut(&Self::Item) -> R, ) -> <<R as Try>::Residual as Residual<Option<Self::Item>>>::TryType
try_find)1.0.0 · Source§fn position<P>(&mut self, predicate: P) -> Option<usize>
fn position<P>(&mut self, predicate: P) -> Option<usize>
1.0.0 · Source§fn max(self) -> Option<Self::Item>
fn max(self) -> Option<Self::Item>
1.0.0 · Source§fn min(self) -> Option<Self::Item>
fn min(self) -> Option<Self::Item>
1.6.0 · Source§fn max_by_key<B, F>(self, f: F) -> Option<Self::Item>
fn max_by_key<B, F>(self, f: F) -> Option<Self::Item>
1.15.0 · Source§fn max_by<F>(self, compare: F) -> Option<Self::Item>
fn max_by<F>(self, compare: F) -> Option<Self::Item>
1.6.0 · Source§fn min_by_key<B, F>(self, f: F) -> Option<Self::Item>
fn min_by_key<B, F>(self, f: F) -> Option<Self::Item>
1.15.0 · Source§fn min_by<F>(self, compare: F) -> Option<Self::Item>
fn min_by<F>(self, compare: F) -> Option<Self::Item>
1.0.0 · Source§fn unzip<A, B, FromA, FromB>(self) -> (FromA, FromB)
fn unzip<A, B, FromA, FromB>(self) -> (FromA, FromB)
1.36.0 · Source§fn copied<'a, T>(self) -> Copied<Self>
fn copied<'a, T>(self) -> Copied<Self>
Source§fn array_chunks<const N: usize>(self) -> ArrayChunks<Self, N>where
Self: Sized,
fn array_chunks<const N: usize>(self) -> ArrayChunks<Self, N>where
Self: Sized,
iter_array_chunks)N elements of the iterator at a time. Read more1.11.0 · Source§fn product<P>(self) -> P
fn product<P>(self) -> P
Source§fn cmp_by<I, F>(self, other: I, cmp: F) -> Ordering
fn cmp_by<I, F>(self, other: I, cmp: F) -> Ordering
iter_order_by)Iterator with those
of another with respect to the specified comparison function. Read more1.5.0 · Source§fn partial_cmp<I>(self, other: I) -> Option<Ordering>
fn partial_cmp<I>(self, other: I) -> Option<Ordering>
PartialOrd elements of
this Iterator with those of another. The comparison works like short-circuit
evaluation, returning a result without comparing the remaining elements.
As soon as an order can be determined, the evaluation stops and a result is returned. Read moreSource§fn partial_cmp_by<I, F>(self, other: I, partial_cmp: F) -> Option<Ordering>where
Self: Sized,
I: IntoIterator,
F: FnMut(Self::Item, <I as IntoIterator>::Item) -> Option<Ordering>,
fn partial_cmp_by<I, F>(self, other: I, partial_cmp: F) -> Option<Ordering>where
Self: Sized,
I: IntoIterator,
F: FnMut(Self::Item, <I as IntoIterator>::Item) -> Option<Ordering>,
iter_order_by)Iterator with those
of another with respect to the specified comparison function. Read moreSource§fn eq_by<I, F>(self, other: I, eq: F) -> bool
fn eq_by<I, F>(self, other: I, eq: F) -> bool
iter_order_by)1.5.0 · Source§fn lt<I>(self, other: I) -> bool
fn lt<I>(self, other: I) -> bool
Iterator are lexicographically
less than those of another. Read more1.5.0 · Source§fn le<I>(self, other: I) -> bool
fn le<I>(self, other: I) -> bool
Iterator are lexicographically
less or equal to those of another. Read more1.5.0 · Source§fn gt<I>(self, other: I) -> bool
fn gt<I>(self, other: I) -> bool
Iterator are lexicographically
greater than those of another. Read more1.5.0 · Source§fn ge<I>(self, other: I) -> bool
fn ge<I>(self, other: I) -> bool
Iterator are lexicographically
greater than or equal to those of another. Read more