Module entropy

Source

Structs§

EntropyAnalysis: Per-line entropy statistics for a block of content.
EntropyResult: Result of entropy-based compression: output text, token counts, and techniques used.

Enums§

CompressibilityClass: Classification of content compressibility based on Kolmogorov proxy (gzip ratio).

Functions§

analyze_entropy: Analyzes per-line BPE token entropy, counting low/high entropy lines.
compressibility_class: Classify how compressible content is based on gzip ratio.
entropy_compress: Compresses content by removing low-entropy lines and deduplicating patterns.
entropy_compress_adaptive: Entropy compression with file-type-adaptive thresholds and event emission.
entropy_compress_task_conditioned: Task-conditioned entropy compression: lines that would normally be dropped for low entropy are kept if they contain task-relevant keywords. This is the Information Bottleneck proxy: we compress away only what is neither surprising (high H) nor task-relevant (mentions goal concepts). Falls back to pure entropy when task_keywords is empty.
jaccard_similarity: Computes word-set Jaccard similarity between two strings (0.0–1.0).
kolmogorov_proxy: Kolmogorov complexity proxy: K(x) ≈ len(gzip(x)) / len(x). Lower values = more compressible = more redundant.
minhash_signature: Minhash signature for approximate Jaccard via LSH. Uses k independent hash functions (polynomial hashing with different seeds).
minhash_similarity: Approximate Jaccard from two minhash signatures.
ngram_jaccard: N-gram Jaccard similarity — preserves word order (unlike word-set Jaccard).
normalized_token_entropy: Normalized Shannon entropy: H(X) / log₂(n) where n = number of unique symbols. Returns a value in [0, 1] where 0 = perfectly predictable, 1 = maximum entropy. This makes thresholds comparable across different alphabet sizes.
normalized_token_entropy_from_ids: Normalized Shannon entropy over encoded token IDs: H(X) / log₂(n), n = unique token count.
shannon_entropy: Computes Shannon entropy (bits) over character frequencies in the text.
token_entropy: Shannon entropy over BPE token IDs (o200k_base). More LLM-relevant than character entropy since LLMs process BPE tokens.
token_entropy_from_ids: Shannon entropy over already-encoded BPE token IDs (o200k_base).

Module entropy

Module entropy Copy item path

Structs§

Enums§

Functions§

Module entropy