1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
//! High-performance multi-pattern text matcher with logical operators and
//! transformation pipelines.
//!
//! `matcher_rs` is designed for rule matching tasks where plain substring
//! search is too rigid. A rule can combine multiple sub-patterns, veto on other
//! sub-patterns, and match against raw text, transformed text, or both.
//!
//! The crate is built around three ideas:
//!
//! - **Logical operators** — Rules can require co-occurrence of sub-patterns
//! (`&`) or veto a match when a sub-pattern is present (`~`).
//! - **Transformation pipelines** — Input can be matched after
//! Traditional→Simplified CJK variant normalization
//! ([`ProcessType::VariantNorm`]), deletion of configured codepoints
//! ([`ProcessType::Delete`]), replacement-table normalization
//! ([`ProcessType::Normalize`]), and CJK romanization
//! ([`ProcessType::Romanize`] / [`ProcessType::RomanizeChar`]).
//! - **Two-pass evaluation** — Construction deduplicates emitted patterns and
//! partitions them into ASCII and charwise matcher engines. Search walks the
//! needed transform tree once, scans each produced text variant, then
//! evaluates only touched rules.
//!
//! # Quick Start
//!
//! ```rust
//! use matcher_rs::{ProcessType, SimpleMatcherBuilder};
//!
//! let matcher = SimpleMatcherBuilder::new()
//! .add_word(ProcessType::None, 1, "hello")
//! // Matches after converting Traditional Chinese and removing noise chars
//! .add_word(ProcessType::VariantNormDeleteNormalize, 2, "你好")
//! // Both sub-patterns must appear in the text
//! .add_word(ProcessType::None, 3, "apple&pie")
//! // "banana" matches only when "peel" is absent
//! .add_word(ProcessType::None, 4, "banana~peel")
//! .build()
//! .unwrap();
//!
//! assert!(matcher.is_match("hello world"));
//! assert!(matcher.is_match("apple and pie"));
//! assert!(!matcher.is_match("banana peel")); // vetoed by ~peel
//!
//! let results = matcher.process("hello world");
//! assert_eq!(results[0].word_id, 1);
//! ```
//!
//! Composite [`ProcessType`] values can also include [`ProcessType::None`] to
//! match against both the raw text and a transformed variant. For example, a
//! rule with `ProcessType::None | ProcessType::Romanize` can satisfy one
//! sub-pattern directly from the input and another via CJK romanization during
//! the same search.
//!
//! # Safety
//!
//! This crate uses `unsafe` in three categories:
//!
//! ## Thread-local state via `#[thread_local]` + `UnsafeCell`
//!
//! | Static | Location |
//! |--------|----------|
//! | `SIMPLE_MATCH_STATE` | `simple_matcher/state.rs` |
//! | `STRING_POOL` | `process/string_pool.rs` |
//!
//! These use `#[thread_local]` + `UnsafeCell` instead of the `thread_local!`
//! macro to avoid per-access closure overhead. Safety relies on two invariants:
//! (1) `#[thread_local]` guarantees single-threaded access — no data races.
//! (2) No public function is re-entrant: the borrow from `UnsafeCell::get()` is
//! always dropped before any call that could re-enter the same pool.
//!
//! ## Bounds-elided indexing
//!
//! Hot loops use `get_unchecked` / `get_unchecked_mut` to avoid repeated bounds
//! checks on indices that are structurally guaranteed in-bounds by construction
//! (e.g. automaton values, rule indices). Every such site communicates the
//! invariant to the optimizer via [`core::hint::assert_unchecked`].
//!
//! # Feature Flags
//!
//! | Flag | Default | Effect |
//! |------|---------|--------|
//! | `perf` | on | Meta-feature enabling `dfa + simd_runtime_dispatch` |
//! | `dfa` | via `perf` | Enables `aho-corasick` DFA mode in the places where this crate chooses it; other paths still use `daachorse`-backed matchers |
//! | `simd_runtime_dispatch` | via `perf` | Selects the best available transform kernel at runtime (`AVX2` on x86-64, `NEON` on ARM64, portable fallback elsewhere) |
//! | `serde` | off | Enables `Serialize`/`Deserialize` impls for [`ProcessType`] and `Serialize` for [`SimpleResult`] |
/// Uses [`mimalloc`](https://github.com/purpleprotocol/mimalloc_rust) as the global allocator.
///
/// `mimalloc` was chosen because `SimpleMatcher` scanning relies heavily on
/// thread-local buffer pools and short-lived allocations during text
/// transformation. `mimalloc` provides lower fragmentation under these
/// allocation patterns and significantly better multi-threaded throughput
/// compared to the system allocator, especially on workloads where many threads
/// match concurrently.
static GLOBAL: MiMalloc = MiMalloc;
use fmt;
/// Error returned when [`SimpleMatcher`] construction fails.
///
/// Each variant describes a specific failure mode. The enum is
/// `#[non_exhaustive]`, so new variants may be added in future minor releases
/// without breaking callers who use a wildcard arm.
///
/// # When does construction fail?
///
/// - **Empty pattern set** — no patterns remain after parsing (all entries were
/// empty strings or pure-NOT rules).
/// - **Invalid [`ProcessType`] bits** — the caller passed a bitflag value with
/// undefined bits (bits 6–7) set.
/// - **Automaton build failure** — the underlying Aho-Corasick libraries
/// (`daachorse` or `aho-corasick`) rejected the compiled pattern set (e.g.,
/// the pattern set exceeded internal capacity limits).
///
/// # Examples
///
/// ```rust
/// use std::collections::HashMap;
///
/// use matcher_rs::{ProcessType, SimpleMatcher, SimpleTable};
///
/// // Empty tables are rejected.
/// let empty: SimpleTable = HashMap::new();
/// assert!(SimpleMatcher::new(&empty).is_err());
/// ```
pub use SimpleMatcherBuilder;
pub use ;
pub use ;