Skip to main content

Module codebook

Module codebook 

Source

Structs§

Codebook
CodebookEntry
Cross-file semantic deduplication via TF-IDF codebook.

Functions§

find_semantic_duplicates
Identify semantically duplicate blocks across files. IDF is computed over the full file corpus for accurate weighting.
tfidf_cosine_similarity
Cosine similarity between two documents using TF-IDF vectors. IDF is computed over the two-document corpus to down-weight common terms like fn, let, return and up-weight domain-specific identifiers.
tfidf_cosine_similarity_with_corpus
TF-IDF cosine similarity with IDF computed over a larger corpus.