pub struct MinHashStreaming<T: Tokenizer, const H: usize> { /* private fields */ }Available on crate feature
minhash only.Expand description
Buffered streaming sketcher.
Wraps a MinHashFingerprinter and accumulates UTF-8 bytes across
update calls. finalize runs the offline algorithm on the
accumulated buffer.
Implementations§
Source§impl<T: Tokenizer, const H: usize> MinHashStreaming<T, H>
impl<T: Tokenizer, const H: usize> MinHashStreaming<T, H>
Sourcepub fn new(inner: MinHashFingerprinter<T, H>) -> Self
pub fn new(inner: MinHashFingerprinter<T, H>) -> Self
Construct a streamer wrapping inner.
Buffer cap defaults to DEFAULT_MAX_BUFFER_BYTES (16 MiB).
§Arguments
inner— the offlineMinHashFingerprinterwhose canonicalizer + tokenizer + hash configuration the streamer inherits.
§Example
use txtfp::{
Canonicalizer, MinHashFingerprinter, MinHashStreaming,
ShingleTokenizer, WordTokenizer,
};
let s = MinHashStreaming::<_, 64>::new(MinHashFingerprinter::new(
Canonicalizer::default(),
ShingleTokenizer { k: 3, inner: WordTokenizer },
));
assert_eq!(s.buffered_bytes(), 0);Sourcepub fn with_max_bytes(self, max_bytes: usize) -> Self
pub fn with_max_bytes(self, max_bytes: usize) -> Self
Override the buffer cap.
Useful for tests or constrained environments where 16 MiB is too
generous. Setting the cap below the document size causes the
next update call to return crate::Error::InvalidInput.
§Arguments
max_bytes— maximum total bytes the streamer is willing to accumulate.
Sourcepub fn buffered_bytes(&self) -> usize
pub fn buffered_bytes(&self) -> usize
Total bytes accumulated so far (excluding the unfinished multi-byte UTF-8 carry).
§Returns
The size of the validated UTF-8 prefix. The streamer may also hold a few additional bytes in a transient carry buffer when an update arrives mid-codepoint; those are not counted here.
Trait Implementations§
Source§impl<T: Tokenizer, const H: usize> StreamingFingerprinter for MinHashStreaming<T, H>
impl<T: Tokenizer, const H: usize> StreamingFingerprinter for MinHashStreaming<T, H>
Source§type Output = MinHashSig<H>
type Output = MinHashSig<H>
Available on crate features
minhash or simhash or lsh or tlsh only.The fingerprint produced at end-of-stream.
Source§fn update(&mut self, chunk: &[u8]) -> Result<()>
fn update(&mut self, chunk: &[u8]) -> Result<()>
Available on crate features
minhash or simhash or lsh or tlsh only.Append
chunk to the internal buffer. Read moreAuto Trait Implementations§
impl<T, const H: usize> Freeze for MinHashStreaming<T, H>where
T: Freeze,
impl<T, const H: usize> RefUnwindSafe for MinHashStreaming<T, H>where
T: RefUnwindSafe,
impl<T, const H: usize> Send for MinHashStreaming<T, H>
impl<T, const H: usize> Sync for MinHashStreaming<T, H>
impl<T, const H: usize> Unpin for MinHashStreaming<T, H>where
T: Unpin,
impl<T, const H: usize> UnsafeUnpin for MinHashStreaming<T, H>where
T: UnsafeUnpin,
impl<T, const H: usize> UnwindSafe for MinHashStreaming<T, H>where
T: UnwindSafe,
Blanket Implementations§
Source§impl<T> ArchivePointee for T
impl<T> ArchivePointee for T
Source§type ArchivedMetadata = ()
type ArchivedMetadata = ()
The archived version of the pointer metadata for this type.
Source§fn pointer_metadata(
_: &<T as ArchivePointee>::ArchivedMetadata,
) -> <T as Pointee>::Metadata
fn pointer_metadata( _: &<T as ArchivePointee>::ArchivedMetadata, ) -> <T as Pointee>::Metadata
Converts some archived metadata to the pointer metadata for itself.
Source§impl<T> BorrowMut<T> for Twhere
T: ?Sized,
impl<T> BorrowMut<T> for Twhere
T: ?Sized,
Source§fn borrow_mut(&mut self) -> &mut T
fn borrow_mut(&mut self) -> &mut T
Mutably borrows from an owned value. Read more
Source§impl<T> Instrument for T
impl<T> Instrument for T
Source§fn instrument(self, span: Span) -> Instrumented<Self>
fn instrument(self, span: Span) -> Instrumented<Self>
Source§fn in_current_span(self) -> Instrumented<Self>
fn in_current_span(self) -> Instrumented<Self>
Source§impl<T> IntoEither for T
impl<T> IntoEither for T
Source§fn into_either(self, into_left: bool) -> Either<Self, Self>
fn into_either(self, into_left: bool) -> Either<Self, Self>
Converts
self into a Left variant of Either<Self, Self>
if into_left is true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read moreSource§fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
Converts
self into a Left variant of Either<Self, Self>
if into_left(&self) returns true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read moreSource§impl<T> LayoutRaw for T
impl<T> LayoutRaw for T
Source§fn layout_raw(_: <T as Pointee>::Metadata) -> Result<Layout, LayoutError>
fn layout_raw(_: <T as Pointee>::Metadata) -> Result<Layout, LayoutError>
Returns the layout of the type.
Source§impl<T, N1, N2> Niching<NichedOption<T, N1>> for N2
impl<T, N1, N2> Niching<NichedOption<T, N1>> for N2
Source§unsafe fn is_niched(niched: *const NichedOption<T, N1>) -> bool
unsafe fn is_niched(niched: *const NichedOption<T, N1>) -> bool
Returns whether the given value has been niched. Read more
Source§fn resolve_niched(out: Place<NichedOption<T, N1>>)
fn resolve_niched(out: Place<NichedOption<T, N1>>)
Writes data to
out indicating that a T is niched.