Skip to main content

LicenseIndex

Struct LicenseIndex 

Source
pub struct LicenseIndex {
Show 18 fields pub dictionary: TokenDictionary, pub len_legalese: usize, pub rid_by_hash: HashMap<[u8; 20], usize>, pub rules_by_rid: Vec<Rule>, pub tids_by_rid: Vec<Vec<TokenId>>, pub rules_automaton: Automaton, pub unknown_automaton: Automaton, pub sets_by_rid: HashMap<usize, HashSet<TokenId>>, pub msets_by_rid: HashMap<usize, HashMap<TokenId, usize>>, pub high_sets_by_rid: HashMap<usize, HashSet<TokenId>>, pub high_postings_by_rid: HashMap<usize, HashMap<TokenId, Vec<usize>>>, pub false_positive_rids: HashSet<usize>, pub approx_matchable_rids: HashSet<usize>, pub licenses_by_key: HashMap<String, License>, pub pattern_id_to_rid: Vec<usize>, pub rid_by_spdx_key: HashMap<String, usize>, pub unknown_spdx_rid: Option<usize>, pub rids_by_high_tid: HashMap<TokenId, HashSet<usize>>,
}
Expand description

License index containing all data structures for efficient license detection.

The LicenseIndex holds multiple index structures that enable different matching strategies: hash-based exact matching, Aho-Corasick automaton matching, set-based candidate selection, and sequence matching.

Based on the Python ScanCode Toolkit implementation at: reference/scancode-toolkit/src/licensedcode/index.py

§Index Structures

The index maintains several data structures for different matching strategies:

  • Hash matching: rid_by_hash for exact hash-based matches
  • Automaton matching: rules_automaton and unknown_automaton for pattern matching
  • Candidate selection: sets_by_rid and msets_by_rid for set-based ranking
  • Sequence matching: high_postings_by_rid for high-value token position tracking
  • Rule classification: false_positive_rids, approx_matchable_rids

Fields§

§dictionary: TokenDictionary

Token dictionary mapping token strings to integer IDs.

IDs 0 to len_legalese-1 are reserved for legalese tokens (high-value words). IDs len_legalese and above are assigned to other tokens as encountered.

§len_legalese: usize

Number of legalese tokens.

Tokens with ID < len_legalese are considered high-value legalese words. Tokens with ID >= len_legalese are considered low-value tokens.

Corresponds to Python: self.len_legalese = 0 (line 185)

§rid_by_hash: HashMap<[u8; 20], usize>

Mapping from rule hash to rule ID for hash-based exact matching.

This enables fast exact matches using a hash of the rule's token IDs. Each hash maps to exactly one rule ID.

Note: The hash is a 20-byte SHA1 digest, stored as a key in HashMap. In practice, we use a HashMap<[u8; 20], usize>.

Corresponds to Python: self.rid_by_hash = {} (line 216)

§rules_by_rid: Vec<Rule>

Rules indexed by rule ID.

Maps rule IDs to Rule objects for quick lookup.

Corresponds to Python: self.rules_by_rid = [] (line 201)

§tids_by_rid: Vec<Vec<TokenId>>

Token ID sequences indexed by rule ID.

Maps rule IDs to their token ID sequences.

Corresponds to Python: self.tids_by_rid = [] (line 204)

§rules_automaton: Automaton

Aho-Corasick automaton built from all rule token sequences.

Supports efficient multi-pattern matching of token ID sequences. Used for exact matching of complete rules or rule fragments in query text.

Corresponds to Python: self.rules_automaton = match_aho.get_automaton() (line 219)

§unknown_automaton: Automaton

Aho-Corasick automaton for unknown license detection.

Separate automaton used to detect license-like text that doesn't match any known rule. Populated with ngrams from all approx-matchable rules.

Corresponds to Python: self.unknown_automaton = match_unknown.get_automaton() (line 222)

§sets_by_rid: HashMap<usize, HashSet<TokenId>>

Token ID sets per rule for candidate selection.

Maps rule IDs to sets of unique token IDs present in that rule. Used for efficient candidate selection based on token overlap.

Corresponds to Python: self.sets_by_rid = [] (line 212)

§msets_by_rid: HashMap<usize, HashMap<TokenId, usize>>

Token ID multisets per rule for candidate ranking.

Maps rule IDs to multisets (bags) of token IDs with their frequencies. Used for ranking candidates by token frequency overlap.

Corresponds to Python: self.msets_by_rid = [] (line 213)

§high_sets_by_rid: HashMap<usize, HashSet<TokenId>>

High-value token sets per rule for early candidate rejection.

Maps rule IDs to sets containing only high-value (legalese) token IDs. This is a subset of sets_by_rid for faster intersection computation and early rejection of candidates that won’t pass the high-token threshold.

Precomputed during index building to avoid redundant filtering at runtime.

§high_postings_by_rid: HashMap<usize, HashMap<TokenId, Vec<usize>>>

Inverted index of high-value token positions per rule.

Maps rule IDs to a mapping from high-value token IDs to their positions within the rule. Only contains positions for tokens with IDs < len_legalese.

This structure speeds up sequence matching by allowing quick lookup of where high-value tokens appear in each rule.

Corresponds to Python: self.high_postings_by_rid = [] (line 209) In Python: postings = {tid: array('h', [positions, ...])}

§false_positive_rids: HashSet<usize>

Set of rule IDs for false positive rules.

False positive rules are used for exact matching and post-matching filtering to subtract spurious matches.

Corresponds to Python: self.false_positive_rids = set() (line 230)

§approx_matchable_rids: HashSet<usize>

Set of rule IDs that can be matched approximately.

Only rules marked as approx-matchable participate in sequence matching. Other rules can only be matched exactly using the automaton.

Note: This field is kept for Python parity documentation and test usage. The inverted index (rids_by_high_tid) now handles candidate filtering more efficiently, making direct iteration over this set unnecessary.

Corresponds to Python: self.approx_matchable_rids = set() (line 234)

§licenses_by_key: HashMap<String, License>

Mapping from ScanCode license key to License object.

Provides access to license metadata for building SPDX mappings and validating license expressions.

Corresponds to Python: get_licenses_db() in models.py

§pattern_id_to_rid: Vec<usize>

Maps AhoCorasick pattern_id to rule id (rid).

This is needed because the AhoCorasick pattern_id is just the index in the patterns iterator used to build the automaton, not the actual rule id. In Python, the automaton stores (rid, start, end) tuples as values, so the rid is retrieved from the stored value. In Rust, we maintain this mapping instead.

Corresponds to Python: automaton values contain (rid, istart, iend)

§rid_by_spdx_key: HashMap<String, usize>

Mapping from SPDX license key to rule ID.

Enables direct lookup of rules by their SPDX license key, including aliases like “GPL-2.0+” -> gpl-2.0-plus.

Keys are stored lowercase for case-insensitive lookup.

Corresponds to Python: self.licenses_by_spdx_key in cache.py

§unknown_spdx_rid: Option<usize>

Rule ID for the unknown-spdx license.

Used as a fallback when an SPDX identifier is not recognized.

Corresponds to Python: get_unknown_spdx_symbol() in cache.py

§rids_by_high_tid: HashMap<TokenId, HashSet<usize>>

Inverted index mapping high-value token IDs to rule IDs.

This enables fast candidate selection by only examining rules that share at least one high-value (legalese) token with the query. Without this index, candidate selection would iterate over all 37,000+ rules for every file, making license detection extremely slow.

Only contains entries for tokens with ID < len_legalese (high-value tokens). Rules not in approx_matchable_rids are excluded from this index.

Implementations§

Source§

impl LicenseIndex

Source

pub fn new(dictionary: TokenDictionary) -> Self

Create a new empty license index.

This constructor initializes all index structures with empty collections. The index can be populated with rules using the indexing methods (to be implemented in future phases).

§Returns

A new LicenseIndex instance with empty index structures

Source

pub fn with_legalese_count(legalese_count: usize) -> Self

Create a new empty license index with the specified legalese count.

Convenience method that creates a new TokenDictionary and LicenseIndex in one call.

§Arguments
  • legalese_count - Number of reserved legalese token IDs
§Returns

A new LicenseIndex instance with a new TokenDictionary

Trait Implementations§

Source§

impl Clone for LicenseIndex

Source§

fn clone(&self) -> LicenseIndex

Returns a duplicate of the value. Read more
1.0.0 · Source§

fn clone_from(&mut self, source: &Self)

Performs copy-assignment from source. Read more
Source§

impl Debug for LicenseIndex

Source§

fn fmt(&self, f: &mut Formatter<'_>) -> Result

Formats the value using the given formatter. Read more
Source§

impl Default for LicenseIndex

Source§

fn default() -> Self

Returns the “default value” for a type. Read more

Auto Trait Implementations§

Blanket Implementations§

Source§

impl<T> Any for T
where T: 'static + ?Sized,

Source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
Source§

impl<T> Borrow<T> for T
where T: ?Sized,

Source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
Source§

impl<T> BorrowMut<T> for T
where T: ?Sized,

Source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
Source§

impl<T> CloneToUninit for T
where T: Clone,

Source§

unsafe fn clone_to_uninit(&self, dest: *mut u8)

🔬This is a nightly-only experimental API. (clone_to_uninit)
Performs copy-assignment from self to dest. Read more
Source§

impl<T, U> ExactFrom<T> for U
where U: TryFrom<T>,

Source§

fn exact_from(value: T) -> U

Source§

impl<T, U> ExactInto<U> for T
where U: ExactFrom<T>,

Source§

fn exact_into(self) -> U

Source§

impl<T> From<T> for T

Source§

fn from(t: T) -> T

Returns the argument unchanged.

Source§

impl<T, U> Into<U> for T
where U: From<T>,

Source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

Source§

impl<T> IntoEither for T

Source§

fn into_either(self, into_left: bool) -> Either<Self, Self>

Converts self into a Left variant of Either<Self, Self> if into_left is true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more
Source§

fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
where F: FnOnce(&Self) -> bool,

Converts self into a Left variant of Either<Self, Self> if into_left(&self) returns true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more
Source§

impl<T, U> OverflowingInto<U> for T
where U: OverflowingFrom<T>,

Source§

impl<T> Pointable for T

Source§

const ALIGN: usize

The alignment of pointer.
Source§

type Init = T

The type for initializers.
Source§

unsafe fn init(init: <T as Pointable>::Init) -> usize

Initializes a with the given initializer. Read more
Source§

unsafe fn deref<'a>(ptr: usize) -> &'a T

Dereferences the given pointer. Read more
Source§

unsafe fn deref_mut<'a>(ptr: usize) -> &'a mut T

Mutably dereferences the given pointer. Read more
Source§

unsafe fn drop(ptr: usize)

Drops the object pointed to by the given pointer. Read more
Source§

impl<T, U> RoundingInto<U> for T
where U: RoundingFrom<T>,

Source§

impl<T> Same for T

Source§

type Output = T

Should always be Self
Source§

impl<T, U> SaturatingInto<U> for T
where U: SaturatingFrom<T>,

Source§

impl<T> ToDebugString for T
where T: Debug,

Source§

fn to_debug_string(&self) -> String

Returns the String produced by Ts Debug implementation.

§Examples
use malachite_base::strings::ToDebugString;

assert_eq!([1, 2, 3].to_debug_string(), "[1, 2, 3]");
assert_eq!(
    [vec![2, 3], vec![], vec![4]].to_debug_string(),
    "[[2, 3], [], [4]]"
);
assert_eq!(Some(5).to_debug_string(), "Some(5)");
Source§

impl<T> ToOwned for T
where T: Clone,

Source§

type Owned = T

The resulting type after obtaining ownership.
Source§

fn to_owned(&self) -> T

Creates owned data from borrowed data, usually by cloning. Read more
Source§

fn clone_into(&self, target: &mut T)

Uses borrowed data to replace owned data, usually by cloning. Read more
Source§

impl<T, U> TryFrom<U> for T
where U: Into<T>,

Source§

type Error = Infallible

The type returned in the event of a conversion error.
Source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
Source§

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

Source§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
Source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.
Source§

impl<V, T> VZip<V> for T
where V: MultiLane<T>,

Source§

fn vzip(self) -> V

Source§

impl<T, U> WrappingInto<U> for T
where U: WrappingFrom<T>,

Source§

fn wrapping_into(self) -> U