Skip to main content

MinHashStreaming

Struct MinHashStreaming 

Source
pub struct MinHashStreaming<T: Tokenizer, const H: usize> { /* private fields */ }
Available on crate feature minhash only.
Expand description

Buffered streaming sketcher.

Wraps a MinHashFingerprinter and accumulates UTF-8 bytes across update calls. finalize runs the offline algorithm on the accumulated buffer.

Implementations§

Source§

impl<T: Tokenizer, const H: usize> MinHashStreaming<T, H>

Source

pub fn new(inner: MinHashFingerprinter<T, H>) -> Self

Construct a streamer wrapping inner.

Buffer cap defaults to DEFAULT_MAX_BUFFER_BYTES (16 MiB).

§Arguments
  • inner — the offline MinHashFingerprinter whose canonicalizer + tokenizer + hash configuration the streamer inherits.
§Example
use txtfp::{
    Canonicalizer, MinHashFingerprinter, MinHashStreaming,
    ShingleTokenizer, WordTokenizer,
};

let s = MinHashStreaming::<_, 64>::new(MinHashFingerprinter::new(
    Canonicalizer::default(),
    ShingleTokenizer { k: 3, inner: WordTokenizer },
));
assert_eq!(s.buffered_bytes(), 0);
Source

pub fn with_max_bytes(self, max_bytes: usize) -> Self

Override the buffer cap.

Useful for tests or constrained environments where 16 MiB is too generous. Setting the cap below the document size causes the next update call to return crate::Error::InvalidInput.

§Arguments
  • max_bytes — maximum total bytes the streamer is willing to accumulate.
Source

pub fn buffered_bytes(&self) -> usize

Total bytes accumulated so far (excluding the unfinished multi-byte UTF-8 carry).

§Returns

The size of the validated UTF-8 prefix. The streamer may also hold a few additional bytes in a transient carry buffer when an update arrives mid-codepoint; those are not counted here.

Trait Implementations§

Source§

impl<T: Tokenizer, const H: usize> StreamingFingerprinter for MinHashStreaming<T, H>

Source§

type Output = MinHashSig<H>

Available on crate features minhash or simhash or lsh or tlsh only.
The fingerprint produced at end-of-stream.
Source§

fn update(&mut self, chunk: &[u8]) -> Result<()>

Available on crate features minhash or simhash or lsh or tlsh only.
Append chunk to the internal buffer. Read more
Source§

fn finalize(self) -> Result<Self::Output>

Available on crate features minhash or simhash or lsh or tlsh only.
Finalize the running state and produce the fingerprint. Consumes the streamer. Read more
Source§

fn reset(&mut self)

Available on crate features minhash or simhash or lsh or tlsh only.
Drop the buffer so the same streamer can be reused without reallocating. Read more

Auto Trait Implementations§

§

impl<T, const H: usize> Freeze for MinHashStreaming<T, H>
where T: Freeze,

§

impl<T, const H: usize> RefUnwindSafe for MinHashStreaming<T, H>
where T: RefUnwindSafe,

§

impl<T, const H: usize> Send for MinHashStreaming<T, H>

§

impl<T, const H: usize> Sync for MinHashStreaming<T, H>

§

impl<T, const H: usize> Unpin for MinHashStreaming<T, H>
where T: Unpin,

§

impl<T, const H: usize> UnsafeUnpin for MinHashStreaming<T, H>
where T: UnsafeUnpin,

§

impl<T, const H: usize> UnwindSafe for MinHashStreaming<T, H>
where T: UnwindSafe,

Blanket Implementations§

Source§

impl<T> Any for T
where T: 'static + ?Sized,

Source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
Source§

impl<T> ArchivePointee for T

Source§

type ArchivedMetadata = ()

The archived version of the pointer metadata for this type.
Source§

fn pointer_metadata( _: &<T as ArchivePointee>::ArchivedMetadata, ) -> <T as Pointee>::Metadata

Converts some archived metadata to the pointer metadata for itself.
Source§

impl<T> Borrow<T> for T
where T: ?Sized,

Source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
Source§

impl<T> BorrowMut<T> for T
where T: ?Sized,

Source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
Source§

impl<T> From<T> for T

Source§

fn from(t: T) -> T

Returns the argument unchanged.

Source§

impl<T> Instrument for T

Source§

fn instrument(self, span: Span) -> Instrumented<Self>

Instruments this type with the provided Span, returning an Instrumented wrapper. Read more
Source§

fn in_current_span(self) -> Instrumented<Self>

Instruments this type with the current Span, returning an Instrumented wrapper. Read more
Source§

impl<T, U> Into<U> for T
where U: From<T>,

Source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

Source§

impl<T> IntoEither for T

Source§

fn into_either(self, into_left: bool) -> Either<Self, Self>

Converts self into a Left variant of Either<Self, Self> if into_left is true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more
Source§

fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
where F: FnOnce(&Self) -> bool,

Converts self into a Left variant of Either<Self, Self> if into_left(&self) returns true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more
Source§

impl<T> LayoutRaw for T

Source§

fn layout_raw(_: <T as Pointee>::Metadata) -> Result<Layout, LayoutError>

Returns the layout of the type.
Source§

impl<T, N1, N2> Niching<NichedOption<T, N1>> for N2
where T: SharedNiching<N1, N2>, N1: Niching<T>, N2: Niching<T>,

Source§

unsafe fn is_niched(niched: *const NichedOption<T, N1>) -> bool

Returns whether the given value has been niched. Read more
Source§

fn resolve_niched(out: Place<NichedOption<T, N1>>)

Writes data to out indicating that a T is niched.
Source§

impl<T> Pointable for T

Source§

const ALIGN: usize

The alignment of pointer.
Source§

type Init = T

The type for initializers.
Source§

unsafe fn init(init: <T as Pointable>::Init) -> usize

Initializes a with the given initializer. Read more
Source§

unsafe fn deref<'a>(ptr: usize) -> &'a T

Dereferences the given pointer. Read more
Source§

unsafe fn deref_mut<'a>(ptr: usize) -> &'a mut T

Mutably dereferences the given pointer. Read more
Source§

unsafe fn drop(ptr: usize)

Drops the object pointed to by the given pointer. Read more
Source§

impl<T> Pointee for T

Source§

type Metadata = ()

The metadata type for pointers and references to this type.
Source§

impl<T> PolicyExt for T
where T: ?Sized,

Source§

fn and<P, B, E>(self, other: P) -> And<T, P>
where T: Policy<B, E>, P: Policy<B, E>,

Create a new Policy that returns Action::Follow only if self and other return Action::Follow. Read more
Source§

fn or<P, B, E>(self, other: P) -> Or<T, P>
where T: Policy<B, E>, P: Policy<B, E>,

Create a new Policy that returns Action::Follow if either self or other returns Action::Follow. Read more
Source§

impl<T> Same for T

Source§

type Output = T

Should always be Self
Source§

impl<T, U> TryFrom<U> for T
where U: Into<T>,

Source§

type Error = Infallible

The type returned in the event of a conversion error.
Source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
Source§

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

Source§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
Source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.
Source§

impl<V, T> VZip<V> for T
where V: MultiLane<T>,

Source§

fn vzip(self) -> V

Source§

impl<T> WithSubscriber for T

Source§

fn with_subscriber<S>(self, subscriber: S) -> WithDispatch<Self>
where S: Into<Dispatch>,

Attaches the provided Subscriber to this type, returning a WithDispatch wrapper. Read more
Source§

fn with_current_subscriber(self) -> WithDispatch<Self>

Attaches the current default Subscriber to this type, returning a WithDispatch wrapper. Read more