Struct LicenseIndex

Source

pub struct LicenseIndex {Show 18 fields
    pub dictionary: TokenDictionary,
    pub len_legalese: usize,
    pub rid_by_hash: HashMap<[u8; 20], usize>,
    pub rules_by_rid: Vec<Rule>,
    pub tids_by_rid: Vec<Vec<TokenId>>,
    pub rules_automaton: Automaton,
    pub unknown_automaton: Automaton,
    pub sets_by_rid: HashMap<usize, HashSet<TokenId>>,
    pub msets_by_rid: HashMap<usize, HashMap<TokenId, usize>>,
    pub high_sets_by_rid: HashMap<usize, HashSet<TokenId>>,
    pub high_postings_by_rid: HashMap<usize, HashMap<TokenId, Vec<usize>>>,
    pub false_positive_rids: HashSet<usize>,
    pub approx_matchable_rids: HashSet<usize>,
    pub licenses_by_key: HashMap<String, License>,
    pub pattern_id_to_rid: Vec<usize>,
    pub rid_by_spdx_key: HashMap<String, usize>,
    pub unknown_spdx_rid: Option<usize>,
    pub rids_by_high_tid: HashMap<TokenId, HashSet<usize>>,
}

Expand description

License index containing all data structures for efficient license detection.

The LicenseIndex holds multiple index structures that enable different matching strategies: hash-based exact matching, Aho-Corasick automaton matching, set-based candidate selection, and sequence matching.

Based on the Python ScanCode Toolkit implementation at: reference/scancode-toolkit/src/licensedcode/index.py

§Index Structures

The index maintains several data structures for different matching strategies:

Hash matching: rid_by_hash for exact hash-based matches
Automaton matching: rules_automaton and unknown_automaton for pattern matching
Candidate selection: sets_by_rid and msets_by_rid for set-based ranking
Sequence matching: high_postings_by_rid for high-value token position tracking
Rule classification: false_positive_rids, approx_matchable_rids

Fields§

§dictionary: TokenDictionary

Token dictionary mapping token strings to integer IDs.

IDs 0 to len_legalese-1 are reserved for legalese tokens (high-value words). IDs len_legalese and above are assigned to other tokens as encountered.

§len_legalese: usize

Number of legalese tokens.

Tokens with ID < len_legalese are considered high-value legalese words. Tokens with ID >= len_legalese are considered low-value tokens.

Corresponds to Python: self.len_legalese = 0 (line 185)

§rid_by_hash: HashMap<[u8; 20], usize>

Mapping from rule hash to rule ID for hash-based exact matching.

This enables fast exact matches using a hash of the rule's token IDs. Each hash maps to exactly one rule ID.

Note: The hash is a 20-byte SHA1 digest, stored as a key in HashMap. In practice, we use a HashMap<[u8; 20], usize>.

Corresponds to Python: self.rid_by_hash = {} (line 216)

§rules_by_rid: Vec<Rule>

Rules indexed by rule ID.

Maps rule IDs to Rule objects for quick lookup.

Corresponds to Python: self.rules_by_rid = [] (line 201)

§tids_by_rid: Vec<Vec<TokenId>>

Token ID sequences indexed by rule ID.

Maps rule IDs to their token ID sequences.

Corresponds to Python: self.tids_by_rid = [] (line 204)

§rules_automaton: Automaton

Aho-Corasick automaton built from all rule token sequences.

Supports efficient multi-pattern matching of token ID sequences. Used for exact matching of complete rules or rule fragments in query text.

Corresponds to Python: self.rules_automaton = match_aho.get_automaton() (line 219)

§unknown_automaton: Automaton

Aho-Corasick automaton for unknown license detection.

Separate automaton used to detect license-like text that doesn't match any known rule. Populated with ngrams from all approx-matchable rules.

Corresponds to Python: self.unknown_automaton = match_unknown.get_automaton() (line 222)

§sets_by_rid: HashMap<usize, HashSet<TokenId>>

Token ID sets per rule for candidate selection.

Maps rule IDs to sets of unique token IDs present in that rule. Used for efficient candidate selection based on token overlap.

Corresponds to Python: self.sets_by_rid = [] (line 212)

§msets_by_rid: HashMap<usize, HashMap<TokenId, usize>>

Token ID multisets per rule for candidate ranking.

Maps rule IDs to multisets (bags) of token IDs with their frequencies. Used for ranking candidates by token frequency overlap.

Corresponds to Python: self.msets_by_rid = [] (line 213)

§high_sets_by_rid: HashMap<usize, HashSet<TokenId>>

High-value token sets per rule for early candidate rejection.

Maps rule IDs to sets containing only high-value (legalese) token IDs. This is a subset of sets_by_rid for faster intersection computation and early rejection of candidates that won’t pass the high-token threshold.

Precomputed during index building to avoid redundant filtering at runtime.

§high_postings_by_rid: HashMap<usize, HashMap<TokenId, Vec<usize>>>

Inverted index of high-value token positions per rule.

Maps rule IDs to a mapping from high-value token IDs to their positions within the rule. Only contains positions for tokens with IDs < len_legalese.

This structure speeds up sequence matching by allowing quick lookup of where high-value tokens appear in each rule.

Corresponds to Python: self.high_postings_by_rid = [] (line 209) In Python: postings = {tid: array('h', [positions, ...])}

§false_positive_rids: HashSet<usize>

Set of rule IDs for false positive rules.

False positive rules are used for exact matching and post-matching filtering to subtract spurious matches.

Corresponds to Python: self.false_positive_rids = set() (line 230)

§approx_matchable_rids: HashSet<usize>

Set of rule IDs that can be matched approximately.

Only rules marked as approx-matchable participate in sequence matching. Other rules can only be matched exactly using the automaton.

Note: This field is kept for Python parity documentation and test usage. The inverted index (rids_by_high_tid) now handles candidate filtering more efficiently, making direct iteration over this set unnecessary.

Corresponds to Python: self.approx_matchable_rids = set() (line 234)

§licenses_by_key: HashMap<String, License>

Mapping from ScanCode license key to License object.

Provides access to license metadata for building SPDX mappings and validating license expressions.

Corresponds to Python: get_licenses_db() in models.py

§pattern_id_to_rid: Vec<usize>

Maps AhoCorasick pattern_id to rule id (rid).

This is needed because the AhoCorasick pattern_id is just the index in the patterns iterator used to build the automaton, not the actual rule id. In Python, the automaton stores (rid, start, end) tuples as values, so the rid is retrieved from the stored value. In Rust, we maintain this mapping instead.

Corresponds to Python: automaton values contain (rid, istart, iend)

§rid_by_spdx_key: HashMap<String, usize>

Mapping from SPDX license key to rule ID.

Enables direct lookup of rules by their SPDX license key, including aliases like “GPL-2.0+” -> gpl-2.0-plus.

Keys are stored lowercase for case-insensitive lookup.

Corresponds to Python: self.licenses_by_spdx_key in cache.py

§unknown_spdx_rid: Option<usize>

Rule ID for the unknown-spdx license.

Used as a fallback when an SPDX identifier is not recognized.

Corresponds to Python: get_unknown_spdx_symbol() in cache.py

§rids_by_high_tid: HashMap<TokenId, HashSet<usize>>

Inverted index mapping high-value token IDs to rule IDs.

This enables fast candidate selection by only examining rules that share at least one high-value (legalese) token with the query. Without this index, candidate selection would iterate over all 37,000+ rules for every file, making license detection extremely slow.

Only contains entries for tokens with ID < len_legalese (high-value tokens). Rules not in approx_matchable_rids are excluded from this index.

Struct LicenseIndex Copy item path

§Index Structures

Fields§

Implementations§

impl LicenseIndex

pub fn new(dictionary: TokenDictionary) -> Self

§Returns

pub fn with_legalese_count(legalese_count: usize) -> Self

§Arguments

§Returns

Trait Implementations§

impl Clone for LicenseIndex

fn clone(&self) -> LicenseIndex

fn clone_from(&mut self, source: &Self)

impl Debug for LicenseIndex

fn fmt(&self, f: &mut Formatter<'_>) -> Result

impl Default for LicenseIndex

fn default() -> Self

Auto Trait Implementations§

impl Freeze for LicenseIndex

impl RefUnwindSafe for LicenseIndex

impl Send for LicenseIndex

impl Sync for LicenseIndex

impl Unpin for LicenseIndex

impl UnsafeUnpin for LicenseIndex

impl UnwindSafe for LicenseIndex

Blanket Implementations§

impl<T> Any for Twhere T: 'static + ?Sized,

fn type_id(&self) -> TypeId

impl<T> Borrow<T> for Twhere T: ?Sized,

fn borrow(&self) -> &T

impl<T> BorrowMut<T> for Twhere T: ?Sized,

fn borrow_mut(&mut self) -> &mut T

impl<T> CloneToUninit for Twhere T: Clone,

unsafe fn clone_to_uninit(&self, dest: *mut u8)

impl<T, U> ExactFrom<T> for Uwhere U: TryFrom<T>,

fn exact_from(value: T) -> U

impl<T, U> ExactInto<U> for Twhere U: ExactFrom<T>,

fn exact_into(self) -> U

impl<T> From<T> for T

fn from(t: T) -> T

impl<T, U> Into<U> for Twhere U: From<T>,

fn into(self) -> U

impl<T> IntoEither for T

fn into_either(self, into_left: bool) -> Either<Self, Self>

fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>where F: FnOnce(&Self) -> bool,

impl<T, U> OverflowingInto<U> for Twhere U: OverflowingFrom<T>,

fn overflowing_into(self) -> (U, bool)

impl<T> Pointable for T

const ALIGN: usize

type Init = T

unsafe fn init(init: <T as Pointable>::Init) -> usize

unsafe fn deref<'a>(ptr: usize) -> &'a T

unsafe fn deref_mut<'a>(ptr: usize) -> &'a mut T

unsafe fn drop(ptr: usize)

impl<T, U> RoundingInto<U> for Twhere U: RoundingFrom<T>,

fn rounding_into(self, rm: RoundingMode) -> (U, Ordering)

impl<T> Same for T

type Output = T

impl<T, U> SaturatingInto<U> for Twhere U: SaturatingFrom<T>,

fn saturating_into(self) -> U

impl<T> ToDebugString for Twhere T: Debug,

fn to_debug_string(&self) -> String

§Examples

impl<T> ToOwned for Twhere T: Clone,

type Owned = T

fn to_owned(&self) -> T

fn clone_into(&self, target: &mut T)

impl<T, U> TryFrom<U> for Twhere U: Into<T>,

type Error = Infallible

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

impl<T, U> TryInto<U> for Twhere U: TryFrom<T>,

type Error = <U as TryFrom<T>>::Error

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

impl<V, T> VZip<V> for Twhere V: MultiLane<T>,

fn vzip(self) -> V

impl<T, U> WrappingInto<U> for Twhere U: WrappingFrom<T>,

fn wrapping_into(self) -> U

Struct LicenseIndex

impl<T> Any for T
where T: 'static + ?Sized,

impl<T> Borrow<T> for T
where T: ?Sized,

impl<T> BorrowMut<T> for T
where T: ?Sized,

impl<T> CloneToUninit for T
where T: Clone,

impl<T, U> ExactFrom<T> for U
where U: TryFrom<T>,

impl<T, U> ExactInto<U> for T
where U: ExactFrom<T>,

impl<T, U> Into<U> for T
where U: From<T>,

fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
where F: FnOnce(&Self) -> bool,

impl<T, U> OverflowingInto<U> for T
where U: OverflowingFrom<T>,

impl<T, U> RoundingInto<U> for T
where U: RoundingFrom<T>,

impl<T, U> SaturatingInto<U> for T
where U: SaturatingFrom<T>,

impl<T> ToDebugString for T
where T: Debug,

impl<T> ToOwned for T
where T: Clone,

impl<T, U> TryFrom<U> for T
where U: Into<T>,

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

impl<V, T> VZip<V> for T
where V: MultiLane<T>,

impl<T, U> WrappingInto<U> for T
where U: WrappingFrom<T>,