pub struct DedupVectorStore<S, F = fn(&str) -> String>{ /* private fields */ }Expand description
A VectorStore decorator that silently skips documents whose
normalised content fingerprint is already in the seen-set.
§Type parameters
S— the innerVectorStoreimplementation.F— the fingerprint functionfn(&str) -> String. Defaults tonormalized_fingerprint(lowercase + whitespace-collapse + FNV-1a). Provide a custom function viaDedupVectorStore::with_fingerprintwhen you need a different dedup key (e.g. document ID, composite hash).
§Persistence across restarts
The seen-set lives in memory and is cleared on process restart. To
survive restarts, query your storage for existing content fingerprints
on startup and pass them to DedupVectorStore::with_seen. The
normalized_fingerprint function is public so you can pre-compute
hashes from stored records.
let hashes = db.query_all("SELECT content_hash FROM facts").await?;
let store = DedupVectorStore::with_seen(inner, hashes);Implementations§
Source§impl<S: VectorStore> DedupVectorStore<S, fn(&str) -> String>
impl<S: VectorStore> DedupVectorStore<S, fn(&str) -> String>
Sourcepub fn new(inner: S) -> Self
pub fn new(inner: S) -> Self
Wrap inner with an empty seen-set and the default
normalized_fingerprint function.
Sourcepub fn with_seen(inner: S, seen: impl IntoIterator<Item = String>) -> Self
pub fn with_seen(inner: S, seen: impl IntoIterator<Item = String>) -> Self
Wrap inner and pre-populate the seen-set from seen fingerprints.
Use this when re-starting a process and you want to restore the dedup state from previously persisted fingerprints.
Source§impl<S, F> DedupVectorStore<S, F>
impl<S, F> DedupVectorStore<S, F>
Sourcepub fn with_fingerprint(inner: S, f: F) -> Self
pub fn with_fingerprint(inner: S, f: F) -> Self
Wrap inner with a custom fingerprint function and an empty seen-set.
The function receives the raw document text and returns a string key. Documents whose key is already in the seen-set are skipped.
Sourcepub fn with_fingerprint_and_seen(
inner: S,
f: F,
seen: impl IntoIterator<Item = String>,
) -> Self
pub fn with_fingerprint_and_seen( inner: S, f: F, seen: impl IntoIterator<Item = String>, ) -> Self
Wrap inner with a custom fingerprint function and a pre-populated
seen-set.
Sourcepub fn contains(&self, text: &str) -> bool
pub fn contains(&self, text: &str) -> bool
Whether text is already recorded in the seen-set (using the
configured fingerprint function).
Sourcepub fn inner_mut(&mut self) -> &mut S
pub fn inner_mut(&mut self) -> &mut S
Mutable access to the inner store — e.g. to call store-specific
methods not on the VectorStore trait.
Sourcepub fn seen_fingerprints(&self) -> impl Iterator<Item = &str>
pub fn seen_fingerprints(&self) -> impl Iterator<Item = &str>
Iterate over all fingerprints currently held in the seen-set. Useful when persisting state to storage before shutdown.
Sourcepub fn seen_count(&self) -> usize
pub fn seen_count(&self) -> usize
Number of unique fingerprints recorded (≥ documents in the inner store when duplicates were skipped).
Trait Implementations§
Source§impl<S, F> VectorStore for DedupVectorStore<S, F>
impl<S, F> VectorStore for DedupVectorStore<S, F>
Source§fn add_texts<'life0, 'async_trait>(
&'life0 mut self,
texts: Vec<String>,
metadata: Option<Vec<HashMap<String, Value>>>,
) -> Pin<Box<dyn Future<Output = Result<Vec<String>>> + Send + 'async_trait>>where
Self: 'async_trait,
'life0: 'async_trait,
fn add_texts<'life0, 'async_trait>(
&'life0 mut self,
texts: Vec<String>,
metadata: Option<Vec<HashMap<String, Value>>>,
) -> Pin<Box<dyn Future<Output = Result<Vec<String>>> + Send + 'async_trait>>where
Self: 'async_trait,
'life0: 'async_trait,
Filter out texts whose fingerprint is already seen, then delegate the remainder to the inner store.
Skipped documents are represented as "dedup:skipped:{fingerprint}"
in the returned ID list so that callers whose code expects
ids.len() == texts.len() still holds.
Source§fn add_vectors<'life0, 'async_trait>(
&'life0 mut self,
vectors: Vec<Vec<f32>>,
texts: Vec<String>,
metadata: Option<Vec<HashMap<String, Value>>>,
) -> Pin<Box<dyn Future<Output = Result<Vec<String>>> + Send + 'async_trait>>where
Self: 'async_trait,
'life0: 'async_trait,
fn add_vectors<'life0, 'async_trait>(
&'life0 mut self,
vectors: Vec<Vec<f32>>,
texts: Vec<String>,
metadata: Option<Vec<HashMap<String, Value>>>,
) -> Pin<Box<dyn Future<Output = Result<Vec<String>>> + Send + 'async_trait>>where
Self: 'async_trait,
'life0: 'async_trait,
Filter out pre-embedded vectors whose text fingerprint is already seen, then delegate the remainder to the inner store.