1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
// What: `use resharp::Regex;` imports the resharp regex type.
// Resharp's `Regex` holds a `Mutex<RegexInner>` for lazy DFA
// growth, so calling `is_match`/`find_all` on a SHARED Regex
// from multiple threads serializes through that lock. Each
// rule gets its own Regex, so per-rule parallelism still
// works (different mutexes).
// Why: We use resharp only for the (smaller) regex bucket --
// literals go through AC. The combined-over-regex-bucket
// Regex acts as a fast "any regex rule might match?" gate.
// TS map: `import { Regex } from "resharp";`.
//
// In TS you'd write (pseudocode):
// ```ts
// import { Regex } from "resharp";
// ```
use Regex;
// What: `use std::panic::{catch_unwind, AssertUnwindSafe};` brings
// the panic-recovery primitives into scope.
// - `catch_unwind(closure)` runs the closure on the current
// thread. If the closure panics, the panic is INTERCEPTED
// instead of propagating: the call returns
// `Err(Box<dyn Any + Send>)` carrying the panic payload.
// Successful returns are wrapped in `Ok(value)`. The
// intercepted panic does NOT cause the process to abort or
// the thread to die -- execution continues after the call.
// - `AssertUnwindSafe(value)` is a transparent wrapper that
// asserts to the compiler "I have personally verified this
// value's invariants survive a panic crossing my
// closure". Required because Rust's `UnwindSafe` is an
// AUTO-TRAIT (the compiler derives it structurally): it is
// NOT implemented for `&T where T: !RefUnwindSafe`, and
// `&resharp::Regex` is NOT `RefUnwindSafe` because
// `Regex` holds a `Mutex<RegexInner>` (the lazy-DFA cache
// from `resharp`'s lazy-determinisation strategy).
// `AssertUnwindSafe` is sound for our usage because:
// (a) every `Regex` instance is owned by exactly one
// `CompiledRegex` and lives the entire scanner run, so
// there is no shared interior state for a panic to
// corrupt across calls;
// (b) a poisoned `Mutex` after a caught panic returns
// `PoisonError` on the next lock attempt, which
// resharp converts into one of its own `Error`
// variants -- our `.map_err(|_| ())` already swallows
// those into `Err(())`, exactly the same shape callers
// already handle (synthetic "engine error" hit in
// `scan.rs`); the failure stays inside the engine
// boundary;
// (c) we never look at the panic payload, so payload-
// specific UnwindSafe concerns do not apply.
// Siblings: `RefUnwindSafe` (same idea, for shared
// references); `panic::resume_unwind` (re-throws a caught
// payload -- we never want this here because we are the
// top of the engine boundary, not a transparent passthrough).
// Why: Resharp 0.5.x through 0.6.x panics on a handful of fuzzer-
// discovered rule shapes -- one during `Regex::new` (algebra
// overflow at `resharp-algebra/src/lib.rs:2479`), one during
// `find_all` (engine "unexpected end" assertion at
// `resharp/src/engine.rs:1020`, behind a `debug_assert!`
// that fires in test profile but is compiled out of release;
// release silently returns corrupted matches instead).
// Both crashes are inside upstream code we do not own.
// `Result::map_err` cannot catch panics; only `catch_unwind`
// can. Without this wrapper an upstream panic propagates
// through our process, libFuzzer records a crash, and
// (more importantly) a production scanner run on the same
// rule + content pair aborts the process instead of
// degrading gracefully to "skip this
// rule on this file". The scanner is a CI gate: an aborted
// run silently passes the gate.
// TS map: `try { ... } catch (e) { ... }` -- TS exceptions are
// always caught structurally; Rust panics require an
// explicit unwind barrier.
//
// In TS you'd write (pseudocode):
// ```ts
// // No equivalent. Rust requires catch_unwind + AssertUnwindSafe to
// // intercept panics across a closure boundary.
// ```
use ;
// What: `use regex::bytes::Regex as PlainRegex;` imports the
// standard `regex` crate's byte-mode regex type under an
// alias to disambiguate from `resharp::Regex`. The `regex`
// crate is Rust's mainline regex engine (Russ Cox-style
// NFA + lazy DFA + Teddy literal accel); its compile path
// is roughly 100x faster than resharp on patterns that
// don't use set-algebra (`A&B`, `~(A)`). Resharp's
// strength is set-algebra and bounded-state guarantees --
// its compile cost is the price of admitting set
// operations as first-class. For rules without set-algebra
// (the overwhelming majority of our secret-detection
// corpus -- 257 of 259 rules in the betterleaks example),
// `regex` produces an equivalent matcher in a fraction of
// the time.
// Why: Phase 1 (regex compile) was the dominant remaining cost
// at 2.0s of 2.96s total wall. Switching the 257
// non-set-algebra rules to `regex` drops Phase 1 to
// tens of milliseconds, putting total wall well under 1s
// on the current corpus and providing the 5x growth
// headroom the user asked for.
// TS map: No equivalent crate exists in TS; closest is the
// built-in `RegExp` which is engineered for pattern-search
// rather than streaming bulk-text scan.
//
// In TS you'd write (pseudocode):
// ```ts
// // No 1:1; pretend `import { Regex as PlainRegex } from "regex-bytes";`
// ```
use Regex as PlainRegex;
// What: `pub enum CompiledRegex { Resharp(Regex), Plain(PlainRegex) }`
// is the unified compiled-regex container. Each rule's
// source is classified at load time (set-algebra vs not)
// and routed to the appropriate engine. Both engines
// satisfy the same `find_all`/`is_match` contract via
// inherent methods on this enum.
// Why: A single dispatch point keeps `scan.rs` engine-agnostic
// on the hot path. Without this, `RegexRule.re` would have
// to be `Box<dyn Trait>` -- which adds vtable indirection
// per call AND prevents inlining. Static dispatch via
// `match` lets LLVM specialize each branch.
// TS map: `type CompiledRegex = { kind: "resharp"; re: Regex } | { kind: "plain"; re: PlainRegex };`.
//
// In TS you'd write (pseudocode):
// ```ts
// type CompiledRegex =
// | { kind: "resharp"; re: Regex }
// | { kind: "plain"; re: PlainRegex };
// ```
//
// Clippy lint suppressed: `Resharp` carries a 3.3 KiB inner DFA struct,
// while `Plain` is 32 bytes. Boxing the Resharp arm would add a heap
// indirection on every `find_all`/`is_match` (the hot path), regressing
// scan throughput. The size asymmetry is acceptable -- a few hundred
// `RegexRule` values is a one-time per-process cost.
// What: `pub struct ScanMatch { pub start: usize, pub end: usize }`
// is the engine-agnostic match record. Field-shape is
// identical to `resharp::Match` so `scan.rs` code reading
// `m.start`/`m.end` works unchanged whether the source
// engine is resharp or regex. The fields are byte offsets
// into the scanned content; `start` is inclusive, `end`
// exclusive (half-open range).
// Why: We can't expose `resharp::Match` directly when the match
// originated from `regex` because regex's match type
// (`regex::bytes::Match`) is a separate library type with
// method-style accessors `.start()`/`.end()`. Translating
// to a common record at the dispatch boundary keeps
// call-sites uniform.
// TS map: `type ScanMatch = { start: number; end: number };`.
//
// In TS you'd write (pseudocode):
// ```ts
// type ScanMatch = { start: number; end: number };
// ```