1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
//! High-performance multi-pattern text matcher with logical operators and transformation pipelines.
//!
//! `matcher_rs` is designed for rule matching tasks where plain substring search is too rigid.
//! A rule can combine multiple sub-patterns, veto on other sub-patterns, and match against
//! raw text, transformed text, or both.
//!
//! The crate is built around three ideas:
//!
//! - **Logical operators** — Rules can require co-occurrence of sub-patterns (`&`) or
//! veto a match when a sub-pattern is present (`~`).
//! - **Transformation pipelines** — Input can be matched after Traditional→Simplified
//! Chinese conversion ([`ProcessType::Fanjian`]), deletion of configured codepoints
//! ([`ProcessType::Delete`]), replacement-table normalization ([`ProcessType::Normalize`]),
//! and Pinyin transliteration ([`ProcessType::PinYin`] / [`ProcessType::PinYinChar`]).
//! - **Two-pass evaluation** — Construction deduplicates emitted patterns and partitions them
//! into ASCII and charwise matcher engines. Search walks the needed transform tree once,
//! scans each produced text variant, then evaluates only touched rules.
//!
//! # Quick Start
//!
//! ```rust
//! use matcher_rs::{SimpleMatcherBuilder, ProcessType};
//!
//! let matcher = SimpleMatcherBuilder::new()
//! .add_word(ProcessType::None, 1, "hello")
//! // Matches after converting Traditional Chinese and removing noise chars
//! .add_word(ProcessType::FanjianDeleteNormalize, 2, "你好")
//! // Both sub-patterns must appear in the text
//! .add_word(ProcessType::None, 3, "apple&pie")
//! // "banana" matches only when "peel" is absent
//! .add_word(ProcessType::None, 4, "banana~peel")
//! .build()
//! .unwrap();
//!
//! assert!(matcher.is_match("hello world"));
//! assert!(matcher.is_match("apple and pie"));
//! assert!(!matcher.is_match("banana peel")); // vetoed by ~peel
//!
//! let results = matcher.process("hello world");
//! assert_eq!(results[0].word_id, 1);
//! ```
//!
//! Composite [`ProcessType`] values can also include [`ProcessType::None`] to match
//! against both the raw text and a transformed variant. For example, a rule with
//! `ProcessType::None | ProcessType::PinYin` can satisfy one sub-pattern directly from
//! the input and another via Pinyin transliteration during the same search.
//!
//! # Safety
//!
//! This crate uses `unsafe` in three categories:
//!
//! ## Thread-local state via `#[thread_local]` + `UnsafeCell`
//!
//! | Static | Location |
//! |--------|----------|
//! | `SIMPLE_MATCH_STATE` | `simple_matcher/state.rs` |
//! | `STRING_POOL` | `process/variant.rs` |
//! | `TRANSFORM_STATE` | `process/variant.rs` |
//!
//! These use `#[thread_local]` + `UnsafeCell` instead of the `thread_local!` macro
//! to avoid per-access closure overhead. Safety relies on two invariants:
//! (1) `#[thread_local]` guarantees single-threaded access — no data races.
//! (2) No public function is re-entrant: the borrow from `UnsafeCell::get()` is
//! always dropped before any call that could re-enter the same pool.
//!
//! ## Bounds-elided indexing
//!
//! Hot loops use `get_unchecked` / `get_unchecked_mut` to avoid repeated bounds
//! checks on indices that are structurally guaranteed in-bounds by construction
//! (e.g. automaton values, rule indices). Every such site is guarded by a
//! `debug_assert!` that validates the index in debug builds.
//!
//! ## Lifetime transmute in buffer pooling
//!
//! `return_processed_string_to_pool` (`process/variant.rs`) transmutes an empty
//! `Vec<TextVariant<'_>>` to `Vec<TextVariant<'static>>` after draining all
//! elements. This is sound because an empty `Vec` holds no values — the lifetime
//! parameter exists only at the type level and has no runtime representation.
//!
//! # Feature Flags
//!
//! | Flag | Default | Effect |
//! |------|---------|--------|
//! | `dfa` | on | Enables `aho-corasick` DFA mode in the places where this crate chooses it; other paths still use `daachorse`-backed matchers |
//! | `simd_runtime_dispatch` | on | Selects the best available transform kernel at runtime (`AVX2` on x86-64, `NEON` on ARM64, portable fallback elsewhere) |
//! | `runtime_build` | off | Parses the source transform maps at runtime instead of loading build-time artifacts lazily on first use |
/// Uses [`mimalloc`](https://github.com/purpleprotocol/mimalloc_rust) as the global allocator.
///
/// `mimalloc` was chosen because `SimpleMatcher` scanning relies heavily on thread-local
/// buffer pools and short-lived allocations during text transformation. `mimalloc`
/// provides lower fragmentation under these allocation patterns and significantly better
/// multi-threaded throughput compared to the system allocator, especially on workloads
/// where many threads match concurrently.
static GLOBAL: MiMalloc = MiMalloc;
use fmt;
/// Error returned when [`SimpleMatcher`] construction fails.
///
/// Currently this only wraps automaton-build errors from the underlying
/// Aho-Corasick libraries. The type is an opaque struct (not an enum) to
/// avoid coupling the public API to third-party error types, and to allow
/// adding new error variants in the future without breaking callers.
pub use SimpleMatcherBuilder;
pub use ;
pub use ;