Expand description
Approximate sequence matching for license detection.
This module implements sequence-based matching using set similarity for candidate selection, followed by sequence alignment to find matching blocks.
Based on Python ScanCode Toolkit implementation at: reference/scancode-toolkit/src/licensedcode/match_seq.py
§Near-Duplicate Detection
This module implements Phase 2 of Python’s 3-phase matching pipeline:
- Phase 1: Hash & Aho-Corasick (exact matches)
- Phase 2: Near-duplicate detection - check whole file for high-resemblance candidates
- Phase 3: Query run matching (if no near-duplicates found)
The near-duplicate detection finds rules with high resemblance (>= 0.8) to the entire query, which helps match combined rules instead of partial rules.
Constants§
- HIGH_
RESEMBLANCE_ THRESHOLD_ TENTHS - MATCH_
SEQ - MAX_
NEAR_ DUPE_ CANDIDATES - Default number of top near-duplicate candidates to consider.