Skip to main content Module entropy Copy item path Source EntropyAnalysis Per-line entropy statistics for a block of content. EntropyResult Result of entropy-based compression: output text, token counts, and techniques used. CompressibilityClass Classification of content compressibility based on Kolmogorov proxy (gzip ratio). analyze_entropy Analyzes per-line BPE token entropy, counting low/high entropy lines. compressibility_class Classify how compressible content is based on gzip ratio. entropy_compress Compresses content by removing low-entropy lines and deduplicating patterns. entropy_compress_adaptive Entropy compression with file-type-adaptive thresholds and event emission. jaccard_similarity Computes word-set Jaccard similarity between two strings (0.0–1.0). kolmogorov_proxy Kolmogorov complexity proxy: K(x) ≈ len(gzip(x)) / len(x).
Lower values = more compressible = more redundant. minhash_signature Minhash signature for approximate Jaccard via LSH.
Uses k independent hash functions (polynomial hashing with different seeds). minhash_similarity Approximate Jaccard from two minhash signatures. ngram_jaccard N-gram Jaccard similarity — preserves word order (unlike word-set Jaccard). normalized_token_entropy Normalized Shannon entropy: H(X) / log₂(n) where n = number of unique symbols.
Returns a value in [0, 1] where 0 = perfectly predictable, 1 = maximum entropy.
This makes thresholds comparable across different alphabet sizes. shannon_entropy Computes Shannon entropy (bits) over character frequencies in the text. token_entropy Shannon entropy over BPE token IDs (o200k_base).
More LLM-relevant than character entropy since LLMs process BPE tokens.