Skip to main content

Crate txtfp

Crate txtfp 

Source
Expand description

txtfp — text fingerprinting SDK for Rust.

txtfp extracts compact, deterministic, byte-stable hashes from text so you can deduplicate corpora, detect near-duplicate documents, and retrieve semantically similar passages — the fundamental primitives behind systems like LLM training-set dedup, RAG retrieval, and content moderation.

The crate compiles no_std + alloc when the std feature is disabled, so the canonicalizer, tokenizers, and classical fingerprinters can run on wasm32-unknown-unknown and embedded targets. The semantic, markup, and PDF features require std.

§Quick tour

§Example: deduplication

use txtfp::{
    Canonicalizer, Fingerprinter, MinHashFingerprinter, ShingleTokenizer,
    WordTokenizer, jaccard,
};

let canonicalizer = Canonicalizer::default();
let tokenizer     = ShingleTokenizer { k: 5, inner: WordTokenizer };
let fp            = MinHashFingerprinter::<_, 128>::new(canonicalizer, tokenizer);

let a = fp.fingerprint("the quick brown fox jumps over the lazy dog").unwrap();
let b = fp.fingerprint("the quick brown fox leaps over the lazy dog").unwrap();

let similarity = jaccard(&a, &b);
assert!(similarity > 0.5);

§Cargo features

See the crate’s README.md for the full feature matrix. By default: std, minhash, simhash, lsh. Drop lsh from default features to target wasm32-unknown-unknown with a tighter binary surface.

§Stability

Hash byte layouts (MinHashSig, SimHash64) are semver-frozen as of v0.1.0. Each signature struct is prefixed with a u16 schema version so on-disk fingerprints can be safely round-tripped.

§Provenance

txtfp mirrors the conventions of two sibling crates:

  • audiofp — audio fingerprinting.
  • imgfprint — image fingerprinting.

Re-exports§

pub use markup::MarkdownOptions;markup
pub use markup::html_to_text;markup
pub use markup::markdown_to_text;markup
pub use markup::markdown_to_text_with;markup
pub use pdf::PdfOptions;pdf
pub use pdf::pdf_to_text;pdf
pub use pdf::pdf_to_text_with;pdf
pub use canonical::Canonicalizer;
pub use canonical::CanonicalizerBuilder;
pub use canonical::CaseFold;
pub use canonical::Normalization;
pub use canonical::canonicalize;
pub use tokenize::GraphemeTokenizer;
pub use tokenize::ShingleTokenizer;
pub use tokenize::Tokenizer;
pub use tokenize::WordTokenizer;
pub use tokenize::CjkSegmenter;cjk
pub use tokenize::CjkTokenizer;cjk
pub use classical::Fingerprinter;minhash or simhash or lsh or tlsh
pub use classical::StreamingFingerprinter;minhash or simhash or lsh or tlsh
pub use classical::minhash::HashFamily;minhash
pub use classical::minhash::MinHashFingerprinter;minhash
pub use classical::minhash::MinHashFingerprinterBuilder;minhash
pub use classical::minhash::MinHashSig;minhash
pub use classical::minhash::MinHashStreaming;minhash
pub use classical::minhash::jaccard;minhash
pub use classical::simhash::IdfTable;simhash
pub use classical::simhash::SimHash64;simhash
pub use classical::simhash::SimHashFingerprinter;simhash
pub use classical::simhash::SimHashFingerprinterBuilder;simhash
pub use classical::simhash::SimHashStreaming;simhash
pub use classical::simhash::Weighting;simhash
pub use classical::simhash::cosine_estimate;simhash
pub use classical::simhash::hamming;simhash
pub use classical::lsh::LshIndex;lsh
pub use classical::lsh::LshIndexBuilder;lsh
pub use classical::tlsh::MIN_INPUT_BYTES as TLSH_MIN_INPUT_BYTES;tlsh
pub use classical::tlsh::TlshFingerprinter;tlsh
pub use classical::tlsh::tlsh_distance;tlsh
pub use semantic::ChunkMode;semantic
pub use semantic::ChunkingStrategy;semantic
pub use semantic::Embedding;semantic
pub use semantic::EmbeddingProvider;semantic
pub use semantic::LocalProvider;semantic
pub use semantic::LocalProviderBuilder;semantic
pub use semantic::Pooling;semantic
pub use semantic::chunk_for_model;semantic
pub use semantic::semantic_similarity;semantic

Modules§

algo
Stable algorithm identifier embedded in FingerprintMetadata::algorithm.
canonical
Canonicalization pipeline.
classicalminhash or simhash or lsh or tlsh
Classical (non-neural) fingerprinters.
markupmarkup
HTML and Markdown → plain text helpers, behind the markup feature.
pdfpdf
PDF → plain text helper, behind the pdf feature.
semanticsemantic
Semantic embedding support.
tokenize
Tokenizers — split canonicalized text into the token stream that feeds the classical fingerprinters.

Structs§

FingerprintMetadata
Metadata describing how a Fingerprint was produced.
TlshFingerprinttlsh
Wrapper around tlsh2’s 48-byte body fingerprint, kept opaque so that internal type churn in the upstream crate does not break us.

Enums§

Error
All errors surfaced by txtfp.
Fingerprintminhash or simhash or tlsh or semantic
Cross-variant fingerprint container.

Constants§

FORMAT_VERSION
On-disk format version for the cross-modal fingerprint database.
UNCOMPUTED_CONFIG_HASH
Sentinel value for FingerprintMetadata::config_hash meaning “this metadata was produced without a canonicalizer / tokenizer / algorithm-config triple in scope, so the hash is not authoritative”.
VERSION
Crate version string, sourced from Cargo.toml.

Functions§

config_hash
Hash a canonicalizer + tokenizer + algorithm-specific config string into the config_hash field of FingerprintMetadata.

Type Aliases§

Result
Shorthand for core::result::Result<T, Error>.