Expand description
Predictive Surprise Scoring — conditional entropy relative to LLM knowledge.
Instead of measuring Shannon entropy in isolation (H(X)), we measure how surprising each line is to the LLM: H(X | LLM_knowledge).
Approximation: use BPE token frequency ranks from o200k_base as a proxy for P(token | LLM). Common tokens (high frequency rank) carry low surprise; rare tokens (low rank / unknown to the vocab) carry high surprise.
Scientific basis: Cross-entropy H(P,Q) = -sum(P(x) * log Q(x)) where P is the true distribution and Q is the model’s prior.
Enums§
- Surprise
Level - Classify how surprising a line is relative to the LLM’s expected knowledge. Uses empirically calibrated thresholds for o200k_base.
Functions§
- classify_
surprise - line_
surprise - Compute the surprise score for a line of text.
- should_
keep_ line - Enhanced entropy filter that combines Shannon entropy with predictive surprise. Lines pass if EITHER their entropy is above threshold OR their surprise is high. This prevents dropping lines that look “low entropy” but contain rare, unique tokens.