1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
//! Shared UTF-8 decoding helper for the transform engines.
/// Decodes one non-ASCII UTF-8 codepoint from `bytes[offset..]`.
///
/// This function handles only multi-byte sequences (lead byte `>= 0xC0`); it
/// must not be called on ASCII bytes (`< 0x80`). All callers must first
/// advance past ASCII bytes (e.g., via [`super::simd::skip_ascii_simd`])
/// before invoking this function.
///
/// # Returns
///
/// `(codepoint, byte_length)` where:
/// - `codepoint` is the decoded Unicode scalar value.
/// - `byte_length` is the number of bytes consumed: 2 for U+0080–U+07FF,
/// 3 for U+0800–U+FFFF (includes all CJK Unified Ideographs), 4 for
/// U+10000–U+10FFFF (supplementary planes).
///
/// ```text
/// // Decoding '中' (U+4E2D, 3-byte UTF-8: E4 B8 AD):
/// let text = "abc中def";
/// let bytes = text.as_bytes();
/// let (cp, len) = unsafe { decode_utf8_raw(bytes, 3) };
/// assert_eq!(cp, 0x4E2D); // '中'
/// assert_eq!(len, 3); // 3-byte sequence
/// ```
///
/// # Safety
///
/// - `offset` must point at a valid UTF-8 continuation-sequence start (lead
/// byte `>= 0xC0`). Callers guarantee this by only invoking after confirming
/// `bytes[offset] >= 0x80`, inside a `&str` (which is always valid UTF-8).
/// - `bytes[offset .. offset + byte_length]` must be in bounds. This is
/// guaranteed because the input originates from a `&str` whose total length
/// covers the full multi-byte sequence.
/// - Each `get_unchecked` reads a continuation byte at a known offset (1, 2,
/// or 3 past the lead byte). The lead byte's high bits determine how many
/// continuation bytes exist, and valid UTF-8 guarantees they are present.
pub unsafe