Module surprise

Expand description

Predictive Surprise Scoring — conditional entropy relative to LLM knowledge.

Instead of measuring Shannon entropy in isolation (H(X)), we measure how surprising each line is to the LLM: H(X | LLM_knowledge).

Approximation: use BPE token frequency ranks from o200k_base as a proxy for P(token | LLM). Common tokens (high frequency rank) carry low surprise; rare tokens (low rank / unknown to the vocab) carry high surprise.

Scientific basis: Cross-entropy H(P,Q) = -sum(P(x) * log Q(x)) where P is the true distribution and Q is the model’s prior.

Enums§

SurpriseLevel: Classify how surprising a line is relative to the LLM’s expected knowledge. Uses empirically calibrated thresholds for o200k_base.

Functions§

classify_surprise
line_surprise: Compute the surprise score for a line of text.
should_keep_line: Enhanced entropy filter that combines Shannon entropy with predictive surprise. Lines pass if EITHER their entropy is above threshold OR their surprise is high. This prevents dropping lines that look “low entropy” but contain rare, unique tokens.

Module surprise

Module surprise Copy item path

Enums§

Functions§

Module surprise