Module seq_match

Expand description

Approximate sequence matching for license detection.

This module implements sequence-based matching using set similarity for candidate selection, followed by sequence alignment to find matching blocks.

Based on Python ScanCode Toolkit implementation at: reference/scancode-toolkit/src/licensedcode/match_seq.py

§Near-Duplicate Detection

This module implements Phase 2 of Python’s 3-phase matching pipeline:

Phase 1: Hash & Aho-Corasick (exact matches)
Phase 2: Near-duplicate detection - check whole file for high-resemblance candidates
Phase 3: Query run matching (if no near-duplicates found)

The near-duplicate detection finds rules with high resemblance (>= 0.8) to the entire query, which helps match combined rules instead of partial rules.

Constants§

HIGH_RESEMBLANCE_THRESHOLD_TENTHS
MATCH_SEQ
MAX_NEAR_DUPE_CANDIDATES: Default number of top near-duplicate candidates to consider.

Module seq_match

Module seq_match Copy item path

§Near-Duplicate Detection

Constants§

Module seq_match