Expand description
Text tokenization and normalization.
Tokenization converts text into a sequence of tokens that can be matched against license rules. This module implements ScanCode-compatible tokenization.
Functions§
- count_
tokens - Count tokens in text without allocating strings.
- parse_
required_ phrase_ spans - Parse {{…}} required phrase markers from rule text.
- tokenize
- Tokenizes text to match index rules and queries.
- tokenize_
as_ ids - Tokenizes text and returns QueryTokens directly, avoiding string allocation.
- tokenize_
with_ stopwords - Tokenize text and track stopwords by position.
- tokenize_
without_ stopwords - Tokenizes text without filtering stopwords.