1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
//! Common license-specific word dictionary (legalese).
//!
//! This module defines legalese tokens - common words specific to licenses
//! that are high-value for license detection. These words get lower token IDs,
//! making them more significant during matching.
//!
//! **IMPORTANT**: This dictionary is ported from the Python reference at
//! `reference/scancode-toolkit/src/licensedcode/legalese.py`.
//!
//! The Python reference contains 4506 words (including spelling variants and
//! typos that map to the same token IDs). Multiple words can map to the same
//! token ID when they are considered equivalent.
//!
//! The data is generated at build time by `build.rs` from
//! `resources/license_detection/legalese_data.txt`, serialized as an rkyv
//! `BTreeMap<String, u16>` artifact, and loaded via `include_bytes!` for
//! zero-copy access. Values are bare `u16` rather than `TokenId` because
//! `build.rs` cannot depend on the main crate's types; the caller wraps
//! them with `TokenId::new()` at the call site.
use BTreeMap;
use Archived;
;
const LEGALESE_RKYV_LEN: usize = ;
static LEGALESE_RKYV: AlignedSlice = ;
/// Get the archived legalese dictionary for zero-copy iteration.
///
/// Returns a reference to the rkyv-archived `BTreeMap<String, u16>`,
/// which can be iterated directly without intermediate allocations.
/// Values are bare `u16` that get wrapped in `TokenId` at the call site.