1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
//! High-performance multi-pattern text matcher with logical operators and
//! transformation pipelines.
//!
//! `matcher_rs` is designed for rule matching tasks where plain substring
//! search is too rigid. A rule can combine multiple sub-patterns, veto on other
//! sub-patterns, and match against raw text, transformed text, or both.
//!
//! The crate is built around three ideas:
//!
//! - **Logical operators** — Rules can require co-occurrence of sub-patterns
//! (`&`) or veto a match when a sub-pattern is present (`~`).
//! - **Transformation pipelines** — Input can be matched after
//! Traditional→Simplified CJK variant normalization
//! ([`ProcessType::VariantNorm`]), deletion of configured codepoints
//! ([`ProcessType::Delete`]), replacement-table normalization
//! ([`ProcessType::Normalize`]), and CJK romanization
//! ([`ProcessType::Romanize`] / [`ProcessType::RomanizeChar`]).
//! - **Two-pass evaluation** — Construction deduplicates emitted patterns and
//! partitions them into ASCII and charwise matcher engines. Search walks the
//! needed transform tree once, scans each produced text variant, then
//! evaluates only touched rules.
//!
//! # Quick Start
//!
//! ```rust
//! use matcher_rs::{ProcessType, SimpleMatcherBuilder};
//!
//! let matcher = SimpleMatcherBuilder::new()
//! .add_word(ProcessType::None, 1, "hello")
//! // Matches after converting Traditional Chinese and removing noise chars
//! .add_word(ProcessType::VariantNormDeleteNormalize, 2, "你好")
//! // Both sub-patterns must appear in the text
//! .add_word(ProcessType::None, 3, "apple&pie")
//! // "banana" matches only when "peel" is absent
//! .add_word(ProcessType::None, 4, "banana~peel")
//! .build()
//! .unwrap();
//!
//! assert!(matcher.is_match("hello world"));
//! assert!(matcher.is_match("apple and pie"));
//! assert!(!matcher.is_match("banana peel")); // vetoed by ~peel
//!
//! let results = matcher.process("hello world");
//! assert_eq!(results[0].word_id, 1);
//! ```
//!
//! Composite [`ProcessType`] values can also include [`ProcessType::None`] to
//! match against both the raw text and a transformed variant. For example, a
//! rule with `ProcessType::None | ProcessType::Romanize` can satisfy one
//! sub-pattern directly from the input and another via CJK romanization during
//! the same search.
//!
//! # Safety
//!
//! This crate uses `unsafe` in three categories:
//!
//! ## Thread-local state via `#[thread_local]` + `UnsafeCell`
//!
//! | Static | Location |
//! |--------|----------|
//! | `SIMPLE_MATCH_STATE` | `simple_matcher/state.rs` |
//!
//! This uses `#[thread_local]` + `UnsafeCell` instead of the `thread_local!`
//! macro to avoid per-access closure overhead. Safety relies on two invariants:
//! (1) `#[thread_local]` guarantees single-threaded access — no data races.
//! (2) No public function is re-entrant: the borrow from `UnsafeCell::get()` is
//! always dropped before any call that could re-enter the same state.
//!
//! ## Bounds-elided indexing
//!
//! Hot loops use `get_unchecked` / `get_unchecked_mut` to avoid repeated bounds
//! checks on indices that are structurally guaranteed in-bounds by construction
//! (e.g. automaton values, rule indices). Every such site communicates the
//! invariant to the optimizer via [`core::hint::assert_unchecked`].
//!
//! # Feature Flags
//!
//! | Flag | Default | Effect |
//! |------|---------|--------|
//! | `perf` | on | Meta-feature enabling `dfa + simd_runtime_dispatch` |
//! | `dfa` | via `perf` | Enables `aho-corasick` DFA mode in the places where this crate chooses it; other paths still use `daachorse`-backed matchers |
//! | `simd_runtime_dispatch` | via `perf` | Selects the best available transform kernel at runtime (`AVX2` on x86-64, `NEON` on ARM64, portable fallback elsewhere) |
//! | `serde` | off | Enables `Serialize`/`Deserialize` impls for [`ProcessType`] and `Serialize` for [`SimpleResult`] |
//!
//! # Terminology
//!
//! | Term | Meaning |
//! |------|---------|
//! | **Rule** | A user-supplied pattern string, possibly with `&` (AND), `~` (NOT), `\|` (OR) operators. Identified by a caller-chosen `word_id`. |
//! | **Segment** | One sub-pattern within a rule, delimited by `&` or `~`. A segment may contain `\|`-separated alternatives. |
//! | **Pattern** | A deduplicated sub-pattern string stored in the AC automaton. Multiple rules may share the same pattern. |
//! | **Variant** | One transformed form of the input text (e.g., after VariantNorm, after Delete). Each variant gets a unique index. |
//! | **Generation** | A monotonic `u16` counter enabling O(1) amortized state reset between scans. Wraps every ~65K scans. |
//! | **Direct encoding** | Bit-packing a single-entry pattern's metadata into the automaton value, bypassing entry-table indirection. See `simple_matcher::pattern`. |
//!
//! For the full architectural walkthrough, see [DESIGN.md](https://github.com/search?q=repo%3Afoster_guo%2FMatcher+DESIGN.md).
/// Uses [`mimalloc`](https://github.com/purpleprotocol/mimalloc_rust) as the global allocator.
///
/// `mimalloc` was chosen because `SimpleMatcher` scanning relies heavily on
/// thread-local buffer pools and short-lived allocations during text
/// transformation. `mimalloc` provides lower fragmentation under these
/// allocation patterns and significantly better multi-threaded throughput
/// compared to the system allocator, especially on workloads where many threads
/// match concurrently.
static GLOBAL: MiMalloc = MiMalloc;
pub use SimpleMatcherBuilder;
pub use ;
pub use ;