Module perplexity

Expand description

Character-class bigram perplexity filter for detecting adversarial suffixes.

Adversarial prompt injection attacks often append high-perplexity suffix strings that look nothing like natural language. This filter scores the suffix window of incoming prompts using character-class bigram frequencies and blocks prompts that exceed the perplexity threshold.