1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
//! Confusable / spoof detection (UTS #39). Requires the `alloc` feature.
use confusables as gen;
use nfd;
use is_default_ignorable;
use ;
use String;
use Vec;
/// The UTS #39 *confusable skeleton* of `s`: drop `Default_Ignorable_Code_Point`
/// characters, NFD, replace each character by its confusable prototype, then NFD
/// again. Two strings are visually confusable iff their skeletons are equal — see
/// [`confusable`].
///
/// Stripping default-ignorables (e.g. ZWSP U+200B, ZWJ/ZWNJ, variation selectors)
/// is required by UTS #39: such characters are invisible in rendering, so an
/// attacker could otherwise hide them inside a homograph (`"pay\u{200B}pal"`) to
/// evade detection.
///
/// ```
/// use intl::unicode::spoof::skeleton;
/// // Cyrillic "а" and Latin "a" share a skeleton.
/// assert_eq!(skeleton("pаypal"), skeleton("paypal"));
/// // An interspersed zero-width space is ignored.
/// assert_eq!(skeleton("pay\u{200B}pal"), skeleton("paypal"));
/// ```
/// `true` if `a` and `b` are confusable (have the same [`skeleton`]) yet are not
/// the same string.
/// `true` if `s` is *single-script* under UTS #39 "Single Script" resolution:
/// the intersection of the `Script_Extensions` sets of all its characters is
/// non-empty, i.e. there exists at least one script every character can be
/// written in.
///
/// Resolution uses each character's full `Script_Extensions` set, not just its
/// primary `Script`. So U+30FC (KATAKANA-HIRAGANA PROLONGED SOUND MARK), whose
/// primary `Script` is `Common` but whose `Script_Extensions` is `{Hira, Kana,
/// ...}`, constrains the running script set rather than being ignored.
/// Characters whose `Script_Extensions` is exactly `{Common}` or `{Inherited}`
/// (shared punctuation, digits, combining marks, …) are compatible with every
/// script and impose no constraint. An empty string is single-script.
///
/// A `false` result flags a mixed-script string — a common spoofing signal.
///
/// Resolution uses the UTS #39 *augmented* script sets, so the CJK writing
/// systems are handled: Han is treated as compatible with Japanese (Han +
/// Hiragana + Katakana), Korean (Han + Hangul), and Chinese (Han + Bopomofo).
/// Thus `日本語` (Han) mixed with kana stays single-script, and Han mixed with
/// Hangul stays single-script — but Hiragana mixed with Hangul (Japanese vs
/// Korean) is *not*, because those share no augmented script.
///
/// ```
/// use intl::unicode::spoof::is_single_script;
/// assert!(is_single_script("hello"));
/// // Latin + Cyrillic 'у' (U+0443) — mixed script.
/// assert!(!is_single_script("paуpal"));
/// assert!(is_single_script(""));
/// // Shared punctuation and digits keep Latin text single-script.
/// assert!(is_single_script("abc-123"));
/// // Han + Hiragana is Japanese — single script.
/// assert!(is_single_script("漢は"));
/// // Hiragana + Hangul is Japanese vs Korean — mixed script.
/// assert!(!is_single_script("は한"));
/// ```
/// A token in an augmented script set (UTS #39 §5.1). Regular scripts are
/// carried as [`ScriptTok::Scr`]; the three CJK "augmented" writing systems get
/// their own tokens so that, e.g., Han + Hiragana resolves to a single script
/// (both contain `Jpan`) while Hiragana + Hangul does not.
/// The augmented script set of `c` (UTS #39 §5.1): its `Script_Extensions`, with
/// Han mapped to {Japanese, Korean, Han-Bopomofo}, Hiragana/Katakana to
/// Japanese, Hangul to Korean, and Bopomofo to Han-Bopomofo. Han thus stays
/// compatible with each individual CJK system without making those systems
/// compatible with each other.