Expand description
Character-class bigram perplexity filter for detecting adversarial suffixes.
Adversarial prompt injection attacks often append high-perplexity suffix strings that look nothing like natural language. This filter scores the suffix window of incoming prompts using character-class bigram frequencies and blocks prompts that exceed the perplexity threshold.
Enums§
- Perplexity
Result - Result of perplexity filter analysis.
Functions§
- analyze_
suffix - Analyze the suffix window of a prompt for adversarial content.
- bigram_
perplexity - Compute character-class bigram perplexity for a string.
- symbol_
ratio - Compute the ratio of symbol/punctuation characters in the text.