pub struct LicenseIndex {Show 18 fields
pub dictionary: TokenDictionary,
pub len_legalese: usize,
pub rid_by_hash: HashMap<[u8; 20], usize>,
pub rules_by_rid: Vec<Rule>,
pub tids_by_rid: Vec<Vec<TokenId>>,
pub rules_automaton: Automaton,
pub unknown_automaton: Automaton,
pub sets_by_rid: HashMap<usize, HashSet<TokenId>>,
pub msets_by_rid: HashMap<usize, HashMap<TokenId, usize>>,
pub high_sets_by_rid: HashMap<usize, HashSet<TokenId>>,
pub high_postings_by_rid: HashMap<usize, HashMap<TokenId, Vec<usize>>>,
pub false_positive_rids: HashSet<usize>,
pub approx_matchable_rids: HashSet<usize>,
pub licenses_by_key: HashMap<String, License>,
pub pattern_id_to_rid: Vec<usize>,
pub rid_by_spdx_key: HashMap<String, usize>,
pub unknown_spdx_rid: Option<usize>,
pub rids_by_high_tid: HashMap<TokenId, HashSet<usize>>,
}Expand description
License index containing all data structures for efficient license detection.
The LicenseIndex holds multiple index structures that enable different matching strategies: hash-based exact matching, Aho-Corasick automaton matching, set-based candidate selection, and sequence matching.
Based on the Python ScanCode Toolkit implementation at: reference/scancode-toolkit/src/licensedcode/index.py
§Index Structures
The index maintains several data structures for different matching strategies:
- Hash matching:
rid_by_hashfor exact hash-based matches - Automaton matching:
rules_automatonandunknown_automatonfor pattern matching - Candidate selection:
sets_by_ridandmsets_by_ridfor set-based ranking - Sequence matching:
high_postings_by_ridfor high-value token position tracking - Rule classification:
false_positive_rids,approx_matchable_rids
Fields§
§dictionary: TokenDictionaryToken dictionary mapping token strings to integer IDs.
IDs 0 to len_legalese-1 are reserved for legalese tokens (high-value words). IDs len_legalese and above are assigned to other tokens as encountered.
len_legalese: usizeNumber of legalese tokens.
Tokens with ID < len_legalese are considered high-value legalese words. Tokens with ID >= len_legalese are considered low-value tokens.
Corresponds to Python: self.len_legalese = 0 (line 185)
rid_by_hash: HashMap<[u8; 20], usize>Mapping from rule hash to rule ID for hash-based exact matching.
This enables fast exact matches using a hash of the rule's token IDs. Each hash maps to exactly one rule ID.
Note: The hash is a 20-byte SHA1 digest, stored as a key in HashMap. In practice, we use a HashMap<[u8; 20], usize>.
Corresponds to Python: self.rid_by_hash = {} (line 216)
rules_by_rid: Vec<Rule>Rules indexed by rule ID.
Maps rule IDs to Rule objects for quick lookup.
Corresponds to Python: self.rules_by_rid = [] (line 201)
tids_by_rid: Vec<Vec<TokenId>>Token ID sequences indexed by rule ID.
Maps rule IDs to their token ID sequences.
Corresponds to Python: self.tids_by_rid = [] (line 204)
rules_automaton: AutomatonAho-Corasick automaton built from all rule token sequences.
Supports efficient multi-pattern matching of token ID sequences. Used for exact matching of complete rules or rule fragments in query text.
Corresponds to Python: self.rules_automaton = match_aho.get_automaton() (line 219)
unknown_automaton: AutomatonAho-Corasick automaton for unknown license detection.
Separate automaton used to detect license-like text that doesn't match any known rule. Populated with ngrams from all approx-matchable rules.
Corresponds to Python: self.unknown_automaton = match_unknown.get_automaton() (line 222)
sets_by_rid: HashMap<usize, HashSet<TokenId>>Token ID sets per rule for candidate selection.
Maps rule IDs to sets of unique token IDs present in that rule. Used for efficient candidate selection based on token overlap.
Corresponds to Python: self.sets_by_rid = [] (line 212)
msets_by_rid: HashMap<usize, HashMap<TokenId, usize>>Token ID multisets per rule for candidate ranking.
Maps rule IDs to multisets (bags) of token IDs with their frequencies. Used for ranking candidates by token frequency overlap.
Corresponds to Python: self.msets_by_rid = [] (line 213)
high_sets_by_rid: HashMap<usize, HashSet<TokenId>>High-value token sets per rule for early candidate rejection.
Maps rule IDs to sets containing only high-value (legalese) token IDs.
This is a subset of sets_by_rid for faster intersection computation
and early rejection of candidates that won’t pass the high-token threshold.
Precomputed during index building to avoid redundant filtering at runtime.
high_postings_by_rid: HashMap<usize, HashMap<TokenId, Vec<usize>>>Inverted index of high-value token positions per rule.
Maps rule IDs to a mapping from high-value token IDs to their positions within the rule. Only contains positions for tokens with IDs < len_legalese.
This structure speeds up sequence matching by allowing quick lookup of where high-value tokens appear in each rule.
Corresponds to Python: self.high_postings_by_rid = [] (line 209)
In Python: postings = {tid: array('h', [positions, ...])}
false_positive_rids: HashSet<usize>Set of rule IDs for false positive rules.
False positive rules are used for exact matching and post-matching filtering to subtract spurious matches.
Corresponds to Python: self.false_positive_rids = set() (line 230)
approx_matchable_rids: HashSet<usize>Set of rule IDs that can be matched approximately.
Only rules marked as approx-matchable participate in sequence matching. Other rules can only be matched exactly using the automaton.
Note: This field is kept for Python parity documentation and test usage.
The inverted index (rids_by_high_tid) now handles candidate filtering
more efficiently, making direct iteration over this set unnecessary.
Corresponds to Python: self.approx_matchable_rids = set() (line 234)
licenses_by_key: HashMap<String, License>Mapping from ScanCode license key to License object.
Provides access to license metadata for building SPDX mappings and validating license expressions.
Corresponds to Python: get_licenses_db() in models.py
pattern_id_to_rid: Vec<usize>Maps AhoCorasick pattern_id to rule id (rid).
This is needed because the AhoCorasick pattern_id is just the index in the patterns iterator used to build the automaton, not the actual rule id. In Python, the automaton stores (rid, start, end) tuples as values, so the rid is retrieved from the stored value. In Rust, we maintain this mapping instead.
Corresponds to Python: automaton values contain (rid, istart, iend)
rid_by_spdx_key: HashMap<String, usize>Mapping from SPDX license key to rule ID.
Enables direct lookup of rules by their SPDX license key, including aliases like “GPL-2.0+” -> gpl-2.0-plus.
Keys are stored lowercase for case-insensitive lookup.
Corresponds to Python: self.licenses_by_spdx_key in cache.py
unknown_spdx_rid: Option<usize>Rule ID for the unknown-spdx license.
Used as a fallback when an SPDX identifier is not recognized.
Corresponds to Python: get_unknown_spdx_symbol() in cache.py
rids_by_high_tid: HashMap<TokenId, HashSet<usize>>Inverted index mapping high-value token IDs to rule IDs.
This enables fast candidate selection by only examining rules that share at least one high-value (legalese) token with the query. Without this index, candidate selection would iterate over all 37,000+ rules for every file, making license detection extremely slow.
Only contains entries for tokens with ID < len_legalese (high-value tokens). Rules not in approx_matchable_rids are excluded from this index.
Implementations§
Source§impl LicenseIndex
impl LicenseIndex
Sourcepub fn new(dictionary: TokenDictionary) -> Self
pub fn new(dictionary: TokenDictionary) -> Self
Create a new empty license index.
This constructor initializes all index structures with empty collections. The index can be populated with rules using the indexing methods (to be implemented in future phases).
§Returns
A new LicenseIndex instance with empty index structures
Sourcepub fn with_legalese_count(legalese_count: usize) -> Self
pub fn with_legalese_count(legalese_count: usize) -> Self
Trait Implementations§
Source§impl Clone for LicenseIndex
impl Clone for LicenseIndex
Source§fn clone(&self) -> LicenseIndex
fn clone(&self) -> LicenseIndex
1.0.0 · Source§fn clone_from(&mut self, source: &Self)
fn clone_from(&mut self, source: &Self)
source. Read moreSource§impl Debug for LicenseIndex
impl Debug for LicenseIndex
Auto Trait Implementations§
impl Freeze for LicenseIndex
impl RefUnwindSafe for LicenseIndex
impl Send for LicenseIndex
impl Sync for LicenseIndex
impl Unpin for LicenseIndex
impl UnsafeUnpin for LicenseIndex
impl UnwindSafe for LicenseIndex
Blanket Implementations§
Source§impl<T> BorrowMut<T> for Twhere
T: ?Sized,
impl<T> BorrowMut<T> for Twhere
T: ?Sized,
Source§fn borrow_mut(&mut self) -> &mut T
fn borrow_mut(&mut self) -> &mut T
Source§impl<T> CloneToUninit for Twhere
T: Clone,
impl<T> CloneToUninit for Twhere
T: Clone,
Source§impl<T> IntoEither for T
impl<T> IntoEither for T
Source§fn into_either(self, into_left: bool) -> Either<Self, Self>
fn into_either(self, into_left: bool) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left is true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read moreSource§fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left(&self) returns true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read more