Module entropy

Structs§

EntropyAnalysis: Per-line entropy statistics for a block of content.
EntropyResult: Result of entropy-based compression: output text, token counts, and techniques used.

CompressibilityClass: Classification of content compressibility based on Kolmogorov proxy (gzip ratio).

analyze_entropy: Analyzes per-line BPE token entropy, counting low/high entropy lines.
compressibility_class: Classify how compressible content is based on gzip ratio.
entropy_compress: Compresses content by removing low-entropy lines and deduplicating patterns.
entropy_compress_adaptive: Entropy compression with file-type-adaptive thresholds and event emission.
jaccard_similarity: Computes word-set Jaccard similarity between two strings (0.0–1.0).
kolmogorov_proxy: Kolmogorov complexity proxy: K(x) ≈ len(gzip(x)) / len(x). Lower values = more compressible = more redundant.
minhash_signature: Minhash signature for approximate Jaccard via LSH. Uses k independent hash functions (polynomial hashing with different seeds).
minhash_similarity: Approximate Jaccard from two minhash signatures.
ngram_jaccard: N-gram Jaccard similarity — preserves word order (unlike word-set Jaccard).
normalized_token_entropy: Normalized Shannon entropy: H(X) / log₂(n) where n = number of unique symbols. Returns a value in [0, 1] where 0 = perfectly predictable, 1 = maximum entropy. This makes thresholds comparable across different alphabet sizes.
shannon_entropy: Computes Shannon entropy (bits) over character frequencies in the text.
token_entropy: Shannon entropy over BPE token IDs (o200k_base). More LLM-relevant than character entropy since LLMs process BPE tokens.