Skip to main content

Module tokenize

Module tokenize 

Source
Expand description

Text tokenization and normalization.

Tokenization converts text into a sequence of tokens that can be matched against license rules. This module implements ScanCode-compatible tokenization.

Functions§

count_tokens
Count tokens in text without allocating strings.
parse_required_phrase_spans
Parse {{…}} required phrase markers from rule text.
tokenize
Tokenizes text to match index rules and queries.
tokenize_as_ids
Tokenizes text and returns QueryTokens directly, avoiding string allocation.
tokenize_with_stopwords
Tokenize text and track stopwords by position.
tokenize_without_stopwords
Tokenizes text without filtering stopwords.