Skip to main content

Module entropy

Module entropy 

Source

Structs§

EntropyAnalysis
Per-line entropy statistics for a block of content.
EntropyResult
Result of entropy-based compression: output text, token counts, and techniques used.

Enums§

CompressibilityClass
Classification of content compressibility based on Kolmogorov proxy (gzip ratio).

Functions§

analyze_entropy
Analyzes per-line BPE token entropy, counting low/high entropy lines.
compressibility_class
Classify how compressible content is based on gzip ratio.
entropy_compress
Compresses content by removing low-entropy lines and deduplicating patterns.
entropy_compress_adaptive
Entropy compression with file-type-adaptive thresholds and event emission.
jaccard_similarity
Computes word-set Jaccard similarity between two strings (0.0–1.0).
kolmogorov_proxy
Kolmogorov complexity proxy: K(x) ≈ len(gzip(x)) / len(x). Lower values = more compressible = more redundant.
minhash_signature
Minhash signature for approximate Jaccard via LSH. Uses k independent hash functions (polynomial hashing with different seeds).
minhash_similarity
Approximate Jaccard from two minhash signatures.
ngram_jaccard
N-gram Jaccard similarity — preserves word order (unlike word-set Jaccard).
normalized_token_entropy
Normalized Shannon entropy: H(X) / log₂(n) where n = number of unique symbols. Returns a value in [0, 1] where 0 = perfectly predictable, 1 = maximum entropy. This makes thresholds comparable across different alphabet sizes.
shannon_entropy
Computes Shannon entropy (bits) over character frequencies in the text.
token_entropy
Shannon entropy over BPE token IDs (o200k_base). More LLM-relevant than character entropy since LLMs process BPE tokens.