Module tokenize

Expand description

Text tokenization and normalization.

Tokenization converts text into a sequence of tokens that can be matched against license rules. This module implements ScanCode-compatible tokenization.

Functions§

count_tokens: Count tokens in text without allocating strings.
parse_required_phrase_spans: Parse {{…}} required phrase markers from rule text.
tokenize: Tokenizes text to match index rules and queries.
tokenize_as_ids: Tokenizes text and returns QueryTokens directly, avoiding string allocation.
tokenize_with_stopwords: Tokenize text and track stopwords by position.
tokenize_without_stopwords: Tokenizes text without filtering stopwords.

Module tokenize

Module tokenize Copy item path

Functions§

Module tokenize