1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
//! Text transformation pipeline for standardizing input before pattern
//! matching.
//!
//! This module converts raw input text through a configurable series of steps —
//! Traditional-to-Simplified Chinese conversion, codepoint deletion,
//! normalization, and CJK romanization — so that [`crate::SimpleMatcher`] can
//! match patterns against both raw and transformed forms of the same text.
//!
//! # Public API
//!
//! | Item | Purpose |
//! |------|---------|
//! | [`ProcessType`] | Bitflags selecting which transformation steps to apply. |
//! | [`text_process`] | Applies a composite pipeline and returns the final result. |
//! | [`reduce_text_process`] | Applies a pipeline and records each intermediate change. |
//! | [`reduce_text_process_emit`] | Like `reduce_text_process`, but merges replace-type steps in-place. |
//!
//! # Internal structure
//!
//! - [`step`] — [`TransformStep`](step::TransformStep) enum and the global
//! `OnceLock` cache that lazily compiles each single-bit step once.
//! - [`transform`] — Low-level engines (charwise page-table, Aho-Corasick
//! normalizer, SIMD delete).
pub
pub
pub
use Cow;
pub use ProcessType;
use get_transform_step;
/// Applies a composite [`ProcessType`] pipeline to `text` and returns the final
/// result.
///
/// Steps run in [`ProcessType::iter`] order (ascending bit position). If no
/// step changes the text, the return value borrows directly from `text` (zero
/// allocation). When one or more steps produce changes, intermediate
/// allocations are recycled through the thread-local string pool so only the
/// final result is returned as `Cow::Owned`.
///
/// This function is best for one-shot use.
///
/// # Examples
///
/// ```rust
/// use matcher_rs::{ProcessType, text_process};
///
/// // VariantNorm normalizes CJK variants; Delete removes punctuation.
/// let processed = text_process(ProcessType::VariantNorm | ProcessType::Delete, "測!試");
/// assert_eq!(processed, "测试");
///
/// // No-op when the text has nothing to transform.
/// let unchanged = text_process(ProcessType::VariantNorm, "hello");
/// assert_eq!(unchanged, "hello");
/// // Borrowed — no allocation occurred.
/// assert!(matches!(unchanged, std::borrow::Cow::Borrowed(_)));
/// ```
/// Applies a composite [`ProcessType`] pipeline to `text`, recording every
/// intermediate change.
///
/// Returns a `Vec` whose first element is always the original `text`
/// (borrowed). Each subsequent element is the output of a step that actually
/// changed the text; steps that leave the text unchanged are skipped. The final
/// element is therefore the fully transformed result.
///
/// This is useful for inspecting how each stage transforms the input, or for
/// collecting all intermediate forms that should be indexed.
///
/// # Examples
///
/// ```rust
/// use matcher_rs::{ProcessType, reduce_text_process};
///
/// // VariantNormDeleteNormalize = VariantNorm | Delete | Normalize, applied in that order.
/// let variants = reduce_text_process(ProcessType::VariantNormDeleteNormalize, "~測~A~");
/// // First entry is always the original input.
/// assert_eq!(variants[0], "~測~A~");
/// // Last entry is the fully transformed result.
/// assert_eq!(variants.last().unwrap(), "测a");
/// ```
/// Like [`reduce_text_process`], but merges replace-type steps in-place.
///
/// This variant is used during matcher construction to keep only the strings
/// that the Aho-Corasick automaton will actually scan at match time.
/// Replace-style steps (VariantNorm, Normalize, Romanize, RomanizeChar)
/// overwrite the last entry rather than appending, because the pre-replacement
/// form is never scanned separately. Delete steps still append because deletion
/// changes which character sequences are adjacent, affecting which patterns can
/// match.
///
/// The result therefore contains fewer entries than [`reduce_text_process`]:
/// one entry per "scan boundary" rather than one per transformation step.
///
/// # Examples
///
/// ```rust
/// use matcher_rs::{ProcessType, reduce_text_process_emit};
///
/// // VariantNormDeleteNormalize = VariantNorm | Delete | Normalize.
/// let variants = reduce_text_process_emit(ProcessType::VariantNormDeleteNormalize, "~測~A~");
/// // Only two entries: VariantNorm overwrites the original, then Delete appends.
/// // The Normalize step overwrites the Delete entry in-place.
/// assert_eq!(variants.len(), 2);
/// assert_eq!(variants[0], "~测~A~"); // after VariantNorm (replace, overwrites original)
/// assert_eq!(variants[1], "测a"); // after Delete+Normalize
/// ```