Skip to main content

Module surprise

Module surprise 

Source
Expand description

Predictive Surprise Scoring — conditional entropy relative to LLM knowledge.

Instead of measuring Shannon entropy in isolation (H(X)), we measure how surprising each line is to the LLM: H(X | LLM_knowledge).

Approximation: use BPE token frequency ranks from o200k_base as a proxy for P(token | LLM). Common tokens (high frequency rank) carry low surprise; rare tokens (low rank / unknown to the vocab) carry high surprise.

Scientific basis: Cross-entropy H(P,Q) = -sum(P(x) * log Q(x)) where P is the true distribution and Q is the model’s prior.

Enums§

SurpriseLevel
Classify how surprising a line is relative to the LLM’s expected knowledge. Uses empirically calibrated thresholds for o200k_base.

Functions§

classify_surprise
line_surprise
Compute the surprise score for a line of text.
should_keep_line
Enhanced entropy filter that combines Shannon entropy with predictive surprise. Lines pass if EITHER their entropy is above threshold OR their surprise is high. This prevents dropping lines that look “low entropy” but contain rare, unique tokens.