Expand description
txtfp — text fingerprinting SDK for Rust.
txtfp extracts compact, deterministic, byte-stable hashes from text
so you can deduplicate corpora, detect near-duplicate documents, and
retrieve semantically similar passages — the fundamental primitives
behind systems like LLM training-set dedup, RAG retrieval, and content
moderation.
The crate compiles no_std + alloc when the std feature is
disabled, so the canonicalizer, tokenizers, and classical
fingerprinters can run on wasm32-unknown-unknown and embedded
targets. The semantic, markup, and PDF features require std.
§Quick tour
- Errors —
Error(#[non_exhaustive]) plus theResultalias. - Canonicalization —
canonical::Canonicalizerand itscanonical::CanonicalizerBuilderimplement the default pipeline (NFKC + simple casefold + Bidi/format strip), with optional UTS #39 confusable skeleton (securityfeature). - Tokenization —
tokenize::Tokenizertrait,tokenize::WordTokenizer,tokenize::GraphemeTokenizer,tokenize::ShingleTokenizer, and feature-gated CJK tokenizers. - Classical fingerprinters —
Fingerprinter(offline) andStreamingFingerprinter(incremental). Implementations:MinHashFingerprinter(minhash),SimHashFingerprinter(simhash), andLshIndex(lsh). - Semantic embeddings —
Embedding,EmbeddingProvider,semantic_similarity(semanticfeature). The trait shape is parity-compatible withaudiofp/imgfprint.
§Example: deduplication
use txtfp::{
Canonicalizer, Fingerprinter, MinHashFingerprinter, ShingleTokenizer,
WordTokenizer, jaccard,
};
let canonicalizer = Canonicalizer::default();
let tokenizer = ShingleTokenizer { k: 5, inner: WordTokenizer };
let fp = MinHashFingerprinter::<_, 128>::new(canonicalizer, tokenizer);
let a = fp.fingerprint("the quick brown fox jumps over the lazy dog").unwrap();
let b = fp.fingerprint("the quick brown fox leaps over the lazy dog").unwrap();
let similarity = jaccard(&a, &b);
assert!(similarity > 0.5);§Cargo features
See the crate’s README.md for the full feature matrix. By default:
std, minhash, simhash, lsh. Drop lsh from default features
to target wasm32-unknown-unknown with a tighter binary surface.
§Stability
Hash byte layouts (MinHashSig, SimHash64) are semver-frozen
as of v0.1.0. Each signature struct is prefixed with a u16 schema
version so on-disk fingerprints can be safely round-tripped.
§Provenance
txtfp mirrors the conventions of two sibling crates:
audiofp— audio fingerprinting.imgfprint— image fingerprinting.
Re-exports§
pub use markup::MarkdownOptions;markuppub use markup::html_to_text;markuppub use markup::markdown_to_text;markuppub use markup::markdown_to_text_with;markuppub use pdf::PdfOptions;pdfpub use pdf::pdf_to_text;pdfpub use pdf::pdf_to_text_with;pdfpub use canonical::Canonicalizer;pub use canonical::CanonicalizerBuilder;pub use canonical::CaseFold;pub use canonical::Normalization;pub use canonical::canonicalize;pub use tokenize::GraphemeTokenizer;pub use tokenize::ShingleTokenizer;pub use tokenize::Tokenizer;pub use tokenize::WordTokenizer;pub use tokenize::CjkSegmenter;cjkpub use tokenize::CjkTokenizer;cjkpub use classical::Fingerprinter;minhashorsimhashorlshortlshpub use classical::StreamingFingerprinter;minhashorsimhashorlshortlshpub use classical::minhash::HashFamily;minhashpub use classical::minhash::MinHashFingerprinter;minhashpub use classical::minhash::MinHashFingerprinterBuilder;minhashpub use classical::minhash::MinHashSig;minhashpub use classical::minhash::MinHashStreaming;minhashpub use classical::minhash::jaccard;minhashpub use classical::simhash::IdfTable;simhashpub use classical::simhash::SimHash64;simhashpub use classical::simhash::SimHashFingerprinter;simhashpub use classical::simhash::SimHashFingerprinterBuilder;simhashpub use classical::simhash::SimHashStreaming;simhashpub use classical::simhash::Weighting;simhashpub use classical::simhash::cosine_estimate;simhashpub use classical::simhash::hamming;simhashpub use classical::lsh::LshIndex;lshpub use classical::lsh::LshIndexBuilder;lshpub use classical::tlsh::MIN_INPUT_BYTES as TLSH_MIN_INPUT_BYTES;tlshpub use classical::tlsh::TlshFingerprinter;tlshpub use classical::tlsh::tlsh_distance;tlshpub use semantic::ChunkMode;semanticpub use semantic::ChunkingStrategy;semanticpub use semantic::Embedding;semanticpub use semantic::EmbeddingProvider;semanticpub use semantic::LocalProvider;semanticpub use semantic::LocalProviderBuilder;semanticpub use semantic::Pooling;semanticpub use semantic::chunk_for_model;semanticpub use semantic::semantic_similarity;semantic
Modules§
- algo
- Stable algorithm identifier embedded in
FingerprintMetadata::algorithm. - canonical
- Canonicalization pipeline.
- classical
minhashorsimhashorlshortlsh - Classical (non-neural) fingerprinters.
- markup
markup - HTML and Markdown → plain text helpers, behind the
markupfeature. - pdf
pdf - PDF → plain text helper, behind the
pdffeature. - semantic
semantic - Semantic embedding support.
- tokenize
- Tokenizers — split canonicalized text into the token stream that feeds the classical fingerprinters.
Structs§
- Fingerprint
Metadata - Metadata describing how a
Fingerprintwas produced. - Tlsh
Fingerprint tlsh - Wrapper around
tlsh2’s 48-byte body fingerprint, kept opaque so that internal type churn in the upstream crate does not break us.
Enums§
- Error
- All errors surfaced by
txtfp. - Fingerprint
minhashorsimhashortlshorsemantic - Cross-variant fingerprint container.
Constants§
- FORMAT_
VERSION - On-disk format version for the cross-modal fingerprint database.
- UNCOMPUTED_
CONFIG_ HASH - Sentinel value for
FingerprintMetadata::config_hashmeaning “this metadata was produced without a canonicalizer / tokenizer / algorithm-config triple in scope, so the hash is not authoritative”. - VERSION
- Crate version string, sourced from
Cargo.toml.
Functions§
- config_
hash - Hash a canonicalizer + tokenizer + algorithm-specific config string
into the
config_hashfield ofFingerprintMetadata.
Type Aliases§
- Result
- Shorthand for
core::result::Result<T, Error>.