ib_matcher/syntax/glob/
mod.rs

1/*!
2[glob()-style](https://en.wikipedia.org/wiki/Glob_(programming)) (wildcard) pattern matching syntax support.
3
4Supported syntax:
5- [`parse_wildcard`]: `?` and `*`.
6  - Windows file name safe.
7
8- [`parse_wildcard_path`]: `?`, `*` and `**`, optionally with [`GlobExtConfig`].
9  - Windows file name safe.
10
11  Used by voidtools' Everything, etc.
12
13- [`parse_glob_path`]: `?`, `*`, `[]` and `**`, optionally with [`GlobExtConfig`].
14  - Parsing of `[]` is [fallible](#error-behavior).
15  - Not Windows file name safe: `[]` may disturb the matching of literal `[]` in file names.
16
17*/
18//! - [`GlobExtConfig`]: Two seperators (`//`) or a complement separator (`\`) as a glob star (`*/**`).
19/*!
20
21The following examples match glob syntax using [`ib_matcher::regex`](crate::regex) engines.
22
23## Example
24```
25// cargo add ib-matcher --features syntax-glob,regex
26use ib_matcher::{regex::lita::Regex, syntax::glob::{parse_wildcard_path, PathSeparator}};
27
28let re = Regex::builder()
29    .build_from_hir(
30        parse_wildcard_path()
31            .separator(PathSeparator::Windows)
32            .call(r"Win*\*\*.exe"),
33    )
34    .unwrap();
35assert!(re.is_match(r"C:\Windows\System32\notepad.exe"));
36
37let re = Regex::builder()
38    .build_from_hir(
39        parse_wildcard_path()
40            .separator(PathSeparator::Windows)
41            .call(r"Win**.exe"),
42    )
43    .unwrap();
44assert!(re.is_match(r"C:\Windows\System32\notepad.exe"));
45```
46
47## With `IbMatcher`
48```
49use ib_matcher::{
50    matcher::MatchConfig,
51    regex::lita::Regex,
52    syntax::glob::{parse_wildcard_path, PathSeparator}
53};
54
55let re = Regex::builder()
56    .ib(MatchConfig::builder().pinyin(Default::default()).build())
57    .build_from_hir(
58        parse_wildcard_path()
59            .separator(PathSeparator::Windows)
60            .call(r"win**pyss.exe"),
61    )
62    .unwrap();
63assert!(re.is_match(r"C:\Windows\System32\拼音搜索.exe"));
64```
65
66## Anchor modes
67There are four possible anchor modes:
68- Matching from the start of the string. Used by terminal auto completion.
69- Matching from anywhere in the string. Used by this module.
70- Matching to the end of the string. Rarely used besides matching file extensions.
71- Matching the whole string (from the start to the end). Used by [voidtools' Everything](https://github.com/Chaoses-Ib/IbEverythingExt/issues/98).
72
73This module will match from anywhere in the string by default. For other modes:
74- To match from the start of the string only, you can append a `*` to the pattern (like `foo*`), which will then be consider as an anchor (by [`surrounding_wildcard_as_anchor`](ParseWildcardPathBuilder::surrounding_wildcard_as_anchor)).
75- To match the whole string only, you can combine the above one with checking the returned match length at the moment.
76- If you want to match to the end of the string, prepend a `*`, like `*.mp4`.
77
78### Surrounding wildcards as anchors
79> TL;DR: When not matching the whole string, enabling [`surrounding_wildcard_as_anchor`](ParseWildcardPathBuilder::surrounding_wildcard_as_anchor) let patterns like `*.mp4` matches `v.mp4` but not `v.mp4_0.webp` (it matches both if disabled). And it's enabled by default.
80
81Besides matching the whole string, other anchor modes can have some duplicate patterns. For example, when matching from anywhere, `*.mp4` will match the same strings matched by `.mp4`; when matching from the start, `foo*` is the same as `foo`.
82
83These duplicate patterns have no syntax error, but matching them literally probably isn't what the user want. For example, `*.mp4` actually means the match must be to the end, `foo*` actually means the match must be from the start, otherwise the user would just type `.mp4` or `foo`. And the formers also cause worse match highlight (hightlighting the whole string isn't useful).
84
85To fix these problems, one way is to only match the whole string, another way is to treat leading and trailing wildcards differently. The user-side difference of them is how patterns like `a*b` are treated: the former requires `^a.*b$`, the latter allows `^.*a.*b.*$` (`*a*b*` in the former). The latter is more user-friendly (in my option) and can be converted to the former by adding anchor modes, so it's implemented here: [`surrounding_wildcard_as_anchor`](ParseWildcardPathBuilder::surrounding_wildcard_as_anchor), enabled by default.
86
87Related issue: [IbEverythingExt #98](https://github.com/Chaoses-Ib/IbEverythingExt/issues/98)
88
89### Anchors in file paths
90> TL;DR: If you are matching file paths, you probably want to set `Regex::builder().thompson(PathSeparator::Windows.look_matcher_config())`.
91
92Another problem about anchored matching is, when matching file paths, should the anchors match the start/end of the whole path or the path components (i.e. match separators)?
93
94The default behavior is the former, for example:
95```
96use ib_matcher::{
97    matcher::MatchConfig,
98    regex::lita::Regex,
99    syntax::glob::{parse_wildcard_path, PathSeparator}
100};
101
102let re = Regex::builder()
103    .ib(MatchConfig::default())
104    .build_from_hir(
105        parse_wildcard_path()
106            .separator(PathSeparator::Windows)
107            .call(r"?\foo*\"),
108    )
109    .unwrap();
110assert!(re.is_match(r"C\foobar\⑨"));
111assert!(re.is_match(r"D\C\foobar\9") == false); // Doesn't match
112assert!(re.is_match(r"DC\foobar\9") == false);
113assert!(re.is_match(r"C\DC\foobar\9") == false);
114```
115
116If you want the latter behavior, i.e. special anchors that match `/` or `\` too, you need to set `look_matcher` in [`crate::regex::nfa::thompson::Config`], for example:
117```
118use ib_matcher::{
119    matcher::MatchConfig,
120    regex::lita::Regex,
121    syntax::glob::{parse_wildcard_path, PathSeparator}
122};
123
124let re = Regex::builder()
125    .ib(MatchConfig::default())
126    .thompson(PathSeparator::Windows.look_matcher_config())
127    .build_from_hir(
128        parse_wildcard_path()
129            .separator(PathSeparator::Windows)
130            .call(r"?\foo*\"),
131    )
132    .unwrap();
133assert!(re.is_match(r"C\foobar\⑨"));
134assert!(re.is_match(r"D\C\foobar\9")); // Now matches
135assert!(re.is_match(r"DC\foobar\9") == false);
136assert!(re.is_match(r"C\DC\foobar\9") == false);
137```
138
139The latter behavior is used by voidtools' Everything.
140
141Related issue: [IbEverythingExt #99](https://github.com/Chaoses-Ib/IbEverythingExt/issues/99)
142
143## Character classes
144<!-- Support the same syntax as in [`regex`](crate::syntax::regex#character-classes), with `^` replaced by `!`. -->
145
146Support patterns like `[abc]`, `[a-z]`, `[!a-z]` and `[[:ascii:]]`.
147
148Character classes can be used to escape the metacharacter: `[?]`, `[*]`, `[[]`, `[]]` match the literal characters `?`, `*`, `[`, `]` respectively.
149
150### Error behavior
151Parsing of `[]` is fallible: patterns like `a[b` are invalid.
152
153At the moment related characters will be treated as literal characters if parsing fails.
154
155### Examples
156```
157# use ib_matcher::{syntax::glob::{parse_glob_path, PathSeparator}, regex::cp::Regex};
158# let is_match = |p, h| {
159#     Regex::builder()
160#         .build_from_hir(parse_glob_path().separator(PathSeparator::Windows).call(p))
161#         .unwrap()
162#         .is_match(h)
163# };
164// Set
165assert!(is_match("a[b]z", "abz"));
166assert!(is_match("a[b]z", "aBz") == false);
167assert!(is_match("a[bcd]z", "acz"));
168
169// Range
170assert!(is_match("a[b-z]z", "ayz"));
171
172// Negative set
173assert!(is_match("a[!b]z", "abz") == false);
174assert!(is_match("a[!b]z", "acz"));
175
176// ASCII character class
177assert!(is_match("a[[:space:]]z", "a z"));
178
179// Escape
180assert!(is_match("a[?]z", "a?z"));
181assert!(is_match("a[*]z", "a*z"));
182assert!(is_match("a[[]z", "a[z"));
183assert!(is_match("a[-]z", "a-z"));
184assert!(is_match("a[]]z", "a]z"));
185assert!(is_match(r"a[\d]z", r"a\z"));
186
187// Invalid patterns
188assert!(is_match("a[b", "a[bz"));
189assert!(is_match("a[[b]z", "a[[b]z"));
190assert!(is_match("a[!]z", "a[!]z"));
191```
192*/
193use std::{borrow::Cow, path::MAIN_SEPARATOR};
194
195use bon::{builder, Builder};
196use logos::Logos;
197use regex_automata::{nfa::thompson, util::look::LookMatcher};
198use regex_syntax::{
199    hir::{
200        Class, ClassBytes, ClassBytesRange, ClassUnicode, ClassUnicodeRange, Dot, Hir, Repetition,
201    },
202    ParserBuilder,
203};
204
205use util::SurroundingWildcardHandler;
206
207mod util;
208
209/// See [`parse_wildcard`].
210#[derive(Logos, Clone, Copy, Debug, PartialEq)]
211pub enum WildcardToken {
212    /// Equivalent to `.`.
213    #[token("?")]
214    Any,
215
216    /// Equivalent to `.*`.
217    #[token("*")]
218    Star,
219
220    /// Plain text.
221    #[regex("[^*?]+")]
222    Text,
223}
224
225/// Wildcard-only glob syntax flavor, including `?` and `*`.
226#[builder]
227pub fn parse_wildcard(
228    #[builder(finish_fn)] pattern: &str,
229    /// See [`surrounding wildcards as anchors`](super::glob#surrounding-wildcards-as-anchors).
230    #[builder(default = true)]
231    surrounding_wildcard_as_anchor: bool,
232) -> Hir {
233    let mut lex = WildcardToken::lexer(&pattern);
234    let mut hirs = Vec::new();
235    let mut surrounding_handler =
236        surrounding_wildcard_as_anchor.then(|| SurroundingWildcardHandler::new(PathSeparator::Any));
237    while let Some(Ok(token)) = lex.next() {
238        if let Some(h) = &mut surrounding_handler {
239            if h.skip(token, &mut hirs, &lex) {
240                continue;
241            }
242        }
243
244        hirs.push(match token {
245            WildcardToken::Any => Hir::dot(Dot::AnyChar),
246            WildcardToken::Star => Hir::repetition(Repetition {
247                min: 0,
248                max: None,
249                greedy: true,
250                sub: Hir::dot(Dot::AnyByte).into(),
251            }),
252            WildcardToken::Text => Hir::literal(lex.slice().as_bytes()),
253        });
254    }
255
256    if let Some(h) = surrounding_handler {
257        h.insert_anchors(&mut hirs);
258    }
259
260    Hir::concat(hirs)
261}
262
263/// Defaults to [`PathSeparator::Os`], i.e. `/` on Unix and `\` on Windows.
264#[derive(Default, Clone, Copy)]
265pub enum PathSeparator {
266    /// `/` on Unix and `\` on Windows.
267    #[default]
268    Os,
269    /// i.e. `/`
270    Unix,
271    /// i.e. `\`
272    Windows,
273    /// i.e. `/` or `\`
274    Any,
275}
276
277impl PathSeparator {
278    fn os_desugar() -> Self {
279        if MAIN_SEPARATOR == '\\' {
280            PathSeparator::Windows
281        } else {
282            PathSeparator::Unix
283        }
284    }
285
286    fn desugar(self) -> Self {
287        match self {
288            PathSeparator::Os => Self::os_desugar(),
289            sep => sep,
290        }
291    }
292
293    pub fn is_unix_or_any(self) -> bool {
294        matches!(self.desugar(), PathSeparator::Unix | PathSeparator::Any)
295    }
296
297    pub fn is_windows_or_any(self) -> bool {
298        matches!(self.desugar(), PathSeparator::Windows | PathSeparator::Any)
299    }
300
301    fn literal(&self) -> Hir {
302        match self.desugar() {
303            PathSeparator::Os => unreachable!(),
304            PathSeparator::Unix => Hir::literal(*b"/"),
305            PathSeparator::Windows => Hir::literal(*b"\\"),
306            PathSeparator::Any => Hir::class(Class::Bytes(ClassBytes::new([
307                ClassBytesRange::new(b'/', b'/'),
308                ClassBytesRange::new(b'\\', b'\\'),
309            ]))),
310        }
311    }
312
313    pub fn any_byte_except(&self) -> Hir {
314        match self {
315            // Hir::class(Class::Bytes(ClassBytes::new([
316            //     ClassBytesRange::new(0, b'\\' - 1),
317            //     ClassBytesRange::new(b'\\' + 1, u8::MAX),
318            // ])))
319            PathSeparator::Os => Hir::dot(Dot::AnyByteExcept(MAIN_SEPARATOR as u8)),
320            PathSeparator::Unix => Hir::dot(Dot::AnyByteExcept(b'/')),
321            PathSeparator::Windows => Hir::dot(Dot::AnyByteExcept(b'\\')),
322            PathSeparator::Any => Hir::class(Class::Bytes(ClassBytes::new([
323                ClassBytesRange::new(0, b'/' - 1),
324                ClassBytesRange::new(b'/' + 1, b'\\' - 1),
325                ClassBytesRange::new(b'\\' + 1, u8::MAX),
326            ]))),
327        }
328    }
329
330    pub fn any_char_except(&self) -> Hir {
331        match self {
332            PathSeparator::Os => Hir::dot(Dot::AnyCharExcept(MAIN_SEPARATOR)),
333            PathSeparator::Unix => Hir::dot(Dot::AnyCharExcept('/')),
334            PathSeparator::Windows => Hir::dot(Dot::AnyCharExcept('\\')),
335            PathSeparator::Any => Hir::class(Class::Unicode(ClassUnicode::new([
336                ClassUnicodeRange::new('\0', '.'),
337                ClassUnicodeRange::new('0', '['),
338                ClassUnicodeRange::new(']', char::MAX),
339            ]))),
340        }
341    }
342
343    /// Does not support `PathSeparator::Any` yet.
344    pub fn look_matcher(&self) -> LookMatcher {
345        debug_assert!(!matches!(self, PathSeparator::Any));
346
347        let mut lookm = LookMatcher::new();
348        lookm.set_line_terminator(if self.is_unix_or_any() { b'/' } else { b'\\' });
349        lookm
350    }
351
352    /// Does not support `PathSeparator::Any` yet.
353    pub fn look_matcher_config(&self) -> thompson::Config {
354        thompson::Config::new().look_matcher(self.look_matcher())
355    }
356
357    // fn with_complement_char(&self) -> Option<(char, char)> {
358    //     match self {
359    //         PathSeparator::Os => Self::os_desugar().with_complement_char(),
360    //         PathSeparator::Unix => Some(('/', '\\')),
361    //         PathSeparator::Windows => Some(('\\', '/')),
362    //         PathSeparator::Any => None,
363    //     }
364    // }
365
366    /// The complement path separator of the current OS, i.e. `/` on Windows and `\` on Unix.
367    pub fn os_complement() -> PathSeparator {
368        if MAIN_SEPARATOR == '/' {
369            PathSeparator::Windows
370        } else {
371            PathSeparator::Unix
372        }
373    }
374}
375
376#[derive(Clone, Copy)]
377#[non_exhaustive]
378pub enum GlobStar {
379    /// i.e. `*`, only match within the current component.
380    Current,
381    /// i.e. `**`, match anywhere, from the current component to children.
382    Any,
383    /// i.e. `*/**`, match from the current component to and must to children.
384    ToChild,
385    /// i.e. `**/`, match from the current component to and must to the start of a child.
386    ToChildStart,
387}
388
389impl GlobStar {
390    pub fn to_pattern(&self, separator: PathSeparator) -> &'static str {
391        match self {
392            GlobStar::Current => "*",
393            GlobStar::Any => "**",
394            GlobStar::ToChild => {
395                if separator.is_unix_or_any() {
396                    "*/**"
397                } else {
398                    r"*\**"
399                }
400            }
401            GlobStar::ToChildStart => {
402                if separator.is_unix_or_any() {
403                    "**/"
404                } else {
405                    r"**\"
406                }
407            }
408        }
409    }
410}
411
412/// See [`GlobExtConfig`].
413#[derive(Logos, Debug, PartialEq)]
414enum GlobExtToken {
415    #[token("/")]
416    SepUnix,
417
418    #[token(r"\")]
419    SepWin,
420
421    #[token("//")]
422    TwoSepUnix,
423
424    #[token(r"\\")]
425    TwoSepWin,
426
427    /// Plain text.
428    #[regex(r"[^/\\]+")]
429    Text,
430}
431
432/// Support two seperators (`//`) or a complement separator (`\`) as a glob star (`*/**`).
433///
434/// Optional extensions:
435/// - [`two_separator_as_star`](GlobExtConfigBuilder::two_separator_as_star): `\\` as `*\**`.
436/// - [`separator_as_star`](GlobExtConfigBuilder::separator_as_star): `/` as `*\**`.
437#[derive(Builder, Default, Clone, Copy)]
438pub struct GlobExtConfig {
439    /// - `sep`: You likely want to use [`PathSeparator::Any`].
440    /// - `star`:
441    ///   - [`GlobStar::ToChild`]: Replace `\\` with `*\**` for Windows and vice versa for Unix.
442    ///
443    /// Used by voidtools' Everything.
444    #[builder(with = |sep: PathSeparator, star: GlobStar| (sep, star))]
445    two_separator_as_star: Option<(PathSeparator, GlobStar)>,
446    /// - `sep`: You likely want to use [`PathSeparator::os_complement()`].
447    /// - `star`:
448    ///   - [`GlobStar::ToChild`]: Replace `/` with `*\**` for Windows and vice versa for Unix.
449    ///
450    ///     e.g. `xx/hj` can match `xxzl\sj\7yhj` (`学习资料\时间\7月合集` with pinyin match) for Windows.
451    ///   - [`GlobStar::ToChildStart`]: Replace `/` with `**\` for Windows and vice versa for Unix.
452    ///
453    ///     For example:
454    ///     - `foo/alice` can, but `foo/lice` can't match `foo\bar\alice` for Windows.
455    ///     - `xx/7y` can, but `xx/hj` can't match `xxzl\sj\7yhj` (`学习资料\时间\合集7月` with pinyin match) for Windows.
456    ///
457    /// Used by IbEverythingExt.
458    #[builder(with = |sep: PathSeparator, star: GlobStar| (sep, star))]
459    separator_as_star: Option<(PathSeparator, GlobStar)>,
460}
461
462impl GlobExtConfig {
463    /// The config used by IbEverythingExt. Suitable for common use cases.
464    pub fn new_ev() -> Self {
465        GlobExtConfig {
466            two_separator_as_star: Some((PathSeparator::Any, GlobStar::ToChild)),
467            separator_as_star: Some((PathSeparator::os_complement(), GlobStar::ToChildStart)),
468        }
469    }
470
471    #[cfg(test)]
472    fn desugar_single<'p>(&self, pattern: &'p str, to_separator: PathSeparator) -> Cow<'p, str> {
473        let mut pattern = Cow::Borrowed(pattern);
474        if let Some((sep, star)) = self.two_separator_as_star {
475            let star_pattern = star.to_pattern(to_separator);
476            pattern = match sep.desugar() {
477                PathSeparator::Os => unreachable!(),
478                PathSeparator::Unix => pattern.replace("//", star_pattern),
479                PathSeparator::Windows => pattern.replace(r"\\", star_pattern),
480                PathSeparator::Any => pattern
481                    .replace("//", star_pattern)
482                    .replace(r"\\", star_pattern),
483            }
484            .into();
485        }
486        if let Some((sep, star)) = self.separator_as_star {
487            let star_pattern = star.to_pattern(to_separator);
488            pattern = match sep.desugar() {
489                PathSeparator::Os => unreachable!(),
490                PathSeparator::Unix => pattern.replace('/', star_pattern),
491                PathSeparator::Windows => pattern.replace('\\', star_pattern),
492                PathSeparator::Any => {
493                    if to_separator.is_unix_or_any() {
494                        pattern
495                            .replace('/', star_pattern)
496                            .replace('\\', star_pattern)
497                    } else {
498                        pattern
499                            .replace('\\', star_pattern)
500                            .replace('/', star_pattern)
501                    }
502                }
503            }
504            .into();
505        }
506        #[cfg(test)]
507        dbg!(&pattern);
508        pattern
509    }
510
511    /// - `to_separator`: The separator the pattern should be desugared to.
512    pub fn desugar<'p>(&self, pattern: &'p str, to_separator: PathSeparator) -> Cow<'p, str> {
513        if self.two_separator_as_star.is_none() && self.separator_as_star.is_none() {
514            return Cow::Borrowed(pattern);
515        }
516        // TODO: desugar_single optimization?
517
518        let mut lex = GlobExtToken::lexer(&pattern);
519        let mut pattern = String::with_capacity(pattern.len());
520        let sep_unix = self
521            .separator_as_star
522            .filter(|(sep, _)| sep.is_unix_or_any())
523            .map(|(_, star)| star.to_pattern(to_separator))
524            .unwrap_or("/");
525        let sep_win = self
526            .separator_as_star
527            .filter(|(sep, _)| sep.is_windows_or_any())
528            .map(|(_, star)| star.to_pattern(to_separator))
529            .unwrap_or(r"\");
530        let two_sep_unix = self
531            .two_separator_as_star
532            .filter(|(sep, _)| sep.is_unix_or_any())
533            .map(|(_, star)| star.to_pattern(to_separator))
534            .unwrap_or("//");
535        let two_sep_win = self
536            .two_separator_as_star
537            .filter(|(sep, _)| sep.is_windows_or_any())
538            .map(|(_, star)| star.to_pattern(to_separator))
539            .unwrap_or(r"\\");
540        while let Some(Ok(token)) = lex.next() {
541            pattern.push_str(match token {
542                GlobExtToken::SepUnix => sep_unix,
543                GlobExtToken::SepWin => sep_win,
544                GlobExtToken::TwoSepUnix => two_sep_unix,
545                GlobExtToken::TwoSepWin => two_sep_win,
546                GlobExtToken::Text => lex.slice(),
547            });
548        }
549        #[cfg(test)]
550        dbg!(&pattern);
551        Cow::Owned(pattern)
552    }
553}
554
555/// See [`parse_wildcard_path`].
556#[derive(Logos, Clone, Copy, Debug, PartialEq)]
557pub enum WildcardPathToken {
558    /// Equivalent to `[^/]` on Unix and `[^\\]` on Windows.
559    #[token("?")]
560    Any,
561
562    /// Equivalent to `[^/]*` on Unix and `[^\\]*` on Windows.
563    #[token("*")]
564    Star,
565
566    /// Equivalent to `.*`.
567    #[token("**")]
568    GlobStar,
569
570    #[token("/")]
571    SepUnix,
572
573    #[token(r"\")]
574    SepWin,
575
576    /// Plain text.
577    #[regex(r"[^*?/\\]+")]
578    Text,
579}
580
581/// Wildcard-only path glob syntax flavor, including `?`, `*` and `**`.
582///
583/// Used by voidtools' Everything, etc.
584#[builder]
585pub fn parse_wildcard_path(
586    #[builder(finish_fn)] pattern: &str,
587    /// The separator used in the pattern. Can be different from the one used in the haystacks to be matched.
588    ///
589    /// Defaults to the same as `separator`. You may want to use [`PathSeparator::Any`] instead.
590    pattern_separator: Option<PathSeparator>,
591    /// The path separator used in the haystacks to be matched.
592    ///
593    /// Only have effect on `?` and `*`.
594    separator: PathSeparator,
595    /// See [`surrounding wildcards as anchors`](super::glob#surrounding-wildcards-as-anchors).
596    #[builder(default = true)]
597    surrounding_wildcard_as_anchor: bool,
598    #[builder(default)] ext: GlobExtConfig,
599) -> Hir {
600    let pattern_separator = pattern_separator.unwrap_or(separator);
601
602    // Desugar
603    let pattern = ext.desugar(pattern, pattern_separator);
604
605    let mut lex = WildcardPathToken::lexer(&pattern);
606    let mut hirs = Vec::new();
607    let mut surrounding_handler =
608        surrounding_wildcard_as_anchor.then(|| SurroundingWildcardHandler::new(pattern_separator));
609    while let Some(Ok(token)) = lex.next() {
610        if let Some(h) = &mut surrounding_handler {
611            if h.skip(token, &mut hirs, &lex) {
612                continue;
613            }
614        }
615
616        hirs.push(match token {
617            WildcardPathToken::Any => separator.any_char_except(),
618            WildcardPathToken::Star => Hir::repetition(Repetition {
619                min: 0,
620                max: None,
621                greedy: true,
622                sub: separator.any_byte_except().into(),
623            }),
624            WildcardPathToken::GlobStar => Hir::repetition(Repetition {
625                min: 0,
626                max: None,
627                greedy: true,
628                sub: Hir::dot(Dot::AnyByte).into(),
629            }),
630            WildcardPathToken::SepUnix if pattern_separator.is_unix_or_any() => separator.literal(),
631            WildcardPathToken::SepWin if pattern_separator.is_windows_or_any() => {
632                separator.literal()
633            }
634            WildcardPathToken::Text | WildcardPathToken::SepUnix | WildcardPathToken::SepWin => {
635                Hir::literal(lex.slice().as_bytes())
636            }
637        });
638    }
639
640    if let Some(h) = surrounding_handler {
641        h.insert_anchors(&mut hirs);
642    }
643
644    Hir::concat(hirs)
645}
646
647/// See [`parse_glob_path`].
648#[derive(Logos, Clone, Copy, Debug, PartialEq)]
649pub enum GlobPathToken {
650    /// Equivalent to `[^/]` on Unix and `[^\\]` on Windows.
651    #[token("?")]
652    Any,
653
654    /// Equivalent to `[^/]*` on Unix and `[^\\]*` on Windows.
655    #[token("*")]
656    Star,
657
658    /// `[...]`.
659    #[regex(r"\[[^\]]+\]\]?")]
660    Class,
661
662    /// Equivalent to `.*`.
663    #[token("**")]
664    GlobStar,
665
666    #[token("/")]
667    SepUnix,
668
669    #[token(r"\")]
670    SepWin,
671
672    /// Plain text.
673    #[regex(r"[^*?\[\]/\\]+")]
674    Text,
675}
676
677/// glob path syntax flavor, including `?`, `*`, `[]` and `**`.
678#[builder]
679pub fn parse_glob_path(
680    #[builder(finish_fn)] pattern: &str,
681    /// The separator used in the pattern. Can be different from the one used in the haystacks to be matched.
682    ///
683    /// Defaults to the same as `separator`. You may want to use [`PathSeparator::Any`] instead.
684    pattern_separator: Option<PathSeparator>,
685    /// The path separator used in the haystacks to be matched.
686    ///
687    /// Only have effect on `?` and `*`.
688    separator: PathSeparator,
689    /// See [`surrounding wildcards as anchors`](super::glob#surrounding-wildcards-as-anchors).
690    #[builder(default = true)]
691    surrounding_wildcard_as_anchor: bool,
692    #[builder(default)] ext: GlobExtConfig,
693) -> Hir {
694    let pattern_separator = pattern_separator.unwrap_or(separator);
695
696    // Desugar
697    let pattern = ext.desugar(pattern, pattern_separator);
698
699    let mut lex = GlobPathToken::lexer(&pattern);
700    let mut hirs = Vec::new();
701    let mut surrounding_handler =
702        surrounding_wildcard_as_anchor.then(|| SurroundingWildcardHandler::new(pattern_separator));
703    let mut parser = ParserBuilder::new().unicode(false).utf8(false).build();
704    while let Some(Ok(token)) = lex.next() {
705        if let Some(h) = &mut surrounding_handler {
706            if h.skip(token, &mut hirs, &lex) {
707                continue;
708            }
709        }
710
711        hirs.push(match token {
712            GlobPathToken::Any => separator.any_char_except(),
713            GlobPathToken::Star => Hir::repetition(Repetition {
714                min: 0,
715                max: None,
716                greedy: true,
717                sub: separator.any_byte_except().into(),
718            }),
719            GlobPathToken::GlobStar => Hir::repetition(Repetition {
720                min: 0,
721                max: None,
722                greedy: true,
723                sub: Hir::dot(Dot::AnyByte).into(),
724            }),
725            GlobPathToken::Class => {
726                let s = lex.slice();
727                match s {
728                    "[[]" => Hir::literal("[".as_bytes()),
729                    // "[!]" => Hir::literal("!".as_bytes()),
730                    _ => {
731                        // Life is short
732                        match parser.parse(&s.replace("[!", "[^").replace(r"\", r"\\")) {
733                            Ok(hir) => hir,
734                            Err(_e) => {
735                                #[cfg(test)]
736                                println!("{_e}");
737                                Hir::literal(s.as_bytes())
738                            }
739                        }
740                    }
741                }
742            }
743            GlobPathToken::SepUnix if pattern_separator.is_unix_or_any() => separator.literal(),
744            GlobPathToken::SepWin if pattern_separator.is_windows_or_any() => separator.literal(),
745            GlobPathToken::Text | GlobPathToken::SepUnix | GlobPathToken::SepWin => {
746                Hir::literal(lex.slice().as_bytes())
747            }
748        });
749    }
750
751    if let Some(h) = surrounding_handler {
752        h.insert_anchors(&mut hirs);
753    }
754
755    Hir::concat(hirs)
756}
757
758#[cfg(test)]
759mod tests {
760    use regex_automata::Match;
761    use regex_syntax::ParserBuilder;
762
763    use crate::{matcher::MatchConfig, regex::lita::Regex};
764
765    use super::*;
766
767    #[test]
768    fn wildcard_path_token() {
769        let input = "*text?more*?text**end";
770        let mut lexer = WildcardPathToken::lexer(input);
771        assert_eq!(lexer.next(), Some(Ok(WildcardPathToken::Star)));
772        assert_eq!(lexer.next(), Some(Ok(WildcardPathToken::Text)));
773        assert_eq!(lexer.next(), Some(Ok(WildcardPathToken::Any)));
774        assert_eq!(lexer.next(), Some(Ok(WildcardPathToken::Text)));
775        assert_eq!(lexer.next(), Some(Ok(WildcardPathToken::Star)));
776        assert_eq!(lexer.next(), Some(Ok(WildcardPathToken::Any)));
777        assert_eq!(lexer.next(), Some(Ok(WildcardPathToken::Text)));
778        assert_eq!(lexer.next(), Some(Ok(WildcardPathToken::GlobStar)));
779        assert_eq!(lexer.next(), Some(Ok(WildcardPathToken::Text)));
780        assert_eq!(lexer.next(), None);
781    }
782
783    #[test]
784    fn wildcard() {
785        let re = Regex::builder()
786            .build_from_hir(parse_wildcard().call("?a*b**c"))
787            .unwrap();
788        assert!(re.is_match(r"1a2b33c"));
789        assert!(re.is_match(r"1a\b33c"));
790        assert!(re.is_match(r"b1a\b33c") == false);
791
792        let re = Regex::builder()
793            .build_from_hir(parse_wildcard().call(r"Win*\*\*.exe"))
794            .unwrap();
795        assert!(re.is_match(r"C:\Windows\System32\notepad.exe"));
796    }
797
798    #[test]
799    fn wildcard_path() {
800        let hir1 = ParserBuilder::new()
801            .utf8(false)
802            .build()
803            .parse(r"[^\\](?s-u)a[^\\]*b.*c")
804            .unwrap();
805        println!("{:?}", hir1);
806
807        let hir2 = parse_wildcard_path()
808            .separator(PathSeparator::Windows)
809            .surrounding_wildcard_as_anchor(false)
810            .call("?a*b**c");
811        println!("{:?}", hir2);
812
813        assert_eq!(hir1, hir2);
814
815        let re = Regex::builder().build_from_hir(hir2).unwrap();
816        assert!(re.is_match(r"1a2b33c"));
817        assert!(re.is_match(r"1a\b33c") == false);
818
819        let re = Regex::builder()
820            .build_from_hir(
821                parse_wildcard_path()
822                    .separator(PathSeparator::Windows)
823                    .call(r"Win*\*\*.exe"),
824            )
825            .unwrap();
826        assert!(re.is_match(r"C:\Windows\System32\notepad.exe"));
827
828        let re = Regex::builder()
829            .build_from_hir(
830                parse_wildcard_path()
831                    .separator(PathSeparator::Windows)
832                    .call(r"Win**.exe"),
833            )
834            .unwrap();
835        assert!(re.is_match(r"C:\Windows\System32\notepad.exe"));
836
837        let re = Regex::builder()
838            .ib(MatchConfig::builder().pinyin(Default::default()).build())
839            .build_from_hir(
840                parse_wildcard_path()
841                    .separator(PathSeparator::Windows)
842                    .call(r"win**pyss.exe"),
843            )
844            .unwrap();
845        assert!(re.is_match(r"C:\Windows\System32\拼音搜索.exe"));
846
847        let re = Regex::builder()
848            .ib(MatchConfig::builder().romaji(Default::default()).build())
849            .build_from_hir(
850                parse_wildcard_path()
851                    .separator(PathSeparator::Windows)
852                    .call("wifi**miku"),
853            )
854            .unwrap();
855        assert!(re.is_match(r"C:\Windows\System32\ja-jp\WiFiTask\ミク.exe"));
856    }
857
858    #[test]
859    fn glob_path() {
860        let is_match = |p, h| {
861            Regex::builder()
862                .build_from_hir(parse_glob_path().separator(PathSeparator::Windows).call(p))
863                .unwrap()
864                .is_match(h)
865        };
866
867        // Set
868        assert!(is_match("a[b]z", "abz"));
869        assert!(is_match("a[b]z", "aBz") == false);
870        assert!(is_match("a[bcd]z", "acz"));
871
872        // Range
873        assert!(is_match("a[b-z]z", "ayz"));
874
875        // Negative set
876        assert!(is_match("a[!b]z", "abz") == false);
877        assert!(is_match("a[!b]z", "acz"));
878
879        // ASCII character class
880        assert!(is_match("a[[:space:]]z", "a z"));
881
882        // Escape
883        assert!(is_match("a[?]z", "a?z"));
884        assert!(is_match("a[*]z", "a*z"));
885        assert!(is_match("a[[]z", "a[z"));
886        assert!(is_match("a[-]z", "a-z"));
887        assert!(is_match("a[]]z", "a]z"));
888        assert!(is_match(r"a[\d]z", r"a\z"));
889
890        // Invalid patterns
891        assert!(is_match("a[b", "a[bz"));
892        assert!(is_match("a[[b]z", "a[[b]z"));
893        assert!(is_match("a[!]z", "a[!]z"));
894    }
895
896    #[test]
897    fn complement_separator_as_glob_star() {
898        let ext = GlobExtConfig::builder()
899            .separator_as_star(PathSeparator::Any, GlobStar::ToChild)
900            .build();
901
902        assert_eq!(
903            ext.desugar_single(r"xx/hj", PathSeparator::Windows),
904            r"xx*\**hj"
905        );
906        assert_eq!(ext.desugar(r"xx/hj", PathSeparator::Windows), r"xx*\**hj");
907        let re = Regex::builder()
908            .build_from_hir(
909                parse_wildcard_path()
910                    .separator(PathSeparator::Windows)
911                    .ext(ext)
912                    .call(r"xx/hj"),
913            )
914            .unwrap();
915        assert!(re.is_match(r"xxzl\sj\8yhj"));
916
917        let re = Regex::builder()
918            .build_from_hir(
919                parse_wildcard_path()
920                    .separator(PathSeparator::Unix)
921                    .ext(ext)
922                    .call(r"xx\hj"),
923            )
924            .unwrap();
925        assert!(re.is_match(r"xxzl/sj/8yhj"));
926
927        let re = Regex::builder()
928            .ib(MatchConfig::builder().pinyin(Default::default()).build())
929            .build_from_hir(
930                parse_wildcard_path()
931                    .separator(PathSeparator::Windows)
932                    .ext(ext)
933                    .call(r"xx/hj"),
934            )
935            .unwrap();
936        assert!(re.is_match(r"学习资料\时间\7月合集"));
937
938        // Trailing sep
939        let ext = GlobExtConfig::builder()
940            .separator_as_star(PathSeparator::Any, GlobStar::ToChildStart)
941            .build();
942        let re = Regex::builder()
943            .ib(MatchConfig::default())
944            .build_from_hir(
945                parse_wildcard_path()
946                    .separator(PathSeparator::Windows)
947                    .ext(ext)
948                    .call(r"xx/"),
949            )
950            .unwrap();
951        assert!(re.is_match(r"C:\Xxzl\sj\8yhj"));
952        assert!(re.is_match(r"C:\学习\Xxzl\sj\8yhj"));
953    }
954
955    #[test]
956    fn surrounding_wildcard_as_anchor() {
957        // Leading *
958        let re = Regex::builder()
959            .build_from_hir(
960                parse_wildcard_path()
961                    .separator(PathSeparator::Windows)
962                    .call(r"*.mp4"),
963            )
964            .unwrap();
965        assert!(re.is_match(r"瑠璃の宝石.mp4"));
966        assert!(re.is_match(r"瑠璃の宝石.mp4_001947.296.webp") == false);
967
968        // Trailing *
969        let re = Regex::builder()
970            .ib(MatchConfig::builder().pinyin(Default::default()).build())
971            .build_from_hir(
972                parse_wildcard_path()
973                    .separator(PathSeparator::Windows)
974                    .call(r"ll*"),
975            )
976            .unwrap();
977        assert!(re.is_match(r"瑠璃の宝石.mp4"));
978        assert_eq!(re.find(r"瑠璃の宝石.mp4"), Some(Match::must(0, 0..6)));
979        assert!(re.is_match(r"ruri 瑠璃の宝石.mp4") == false);
980
981        let re = Regex::builder()
982            .ib(MatchConfig::builder().pinyin(Default::default()).build())
983            .build_from_hir(
984                parse_wildcard_path()
985                    .separator(PathSeparator::Windows)
986                    .call(r"ll***"),
987            )
988            .unwrap();
989        assert_eq!(re.find(r"瑠璃の宝石.mp4"), Some(Match::must(0, 0..6)));
990        assert!(re.is_match(r"ruri 瑠璃の宝石.mp4") == false);
991
992        // Middle *
993        let re = Regex::builder()
994            .ib(MatchConfig::builder().pinyin(Default::default()).build())
995            .build_from_hir(
996                parse_wildcard_path()
997                    .separator(PathSeparator::Windows)
998                    .call(r"ll*.mp4"),
999            )
1000            .unwrap();
1001        assert!(re.is_match(r"瑠璃の宝石.mp4"));
1002        assert!(re.is_match(r"ruri 瑠璃の宝石.mp4"));
1003        assert!(re.is_match(r"ruri 瑠璃の宝石.mp4_001133.937.webp"));
1004
1005        // Leading ?
1006        let re = Regex::builder()
1007            .ib(MatchConfig::builder().pinyin(Default::default()).build())
1008            .build_from_hir(
1009                parse_wildcard_path()
1010                    .separator(PathSeparator::Windows)
1011                    .call(r"??.mp4"),
1012            )
1013            .unwrap();
1014        assert_eq!(re.find(r"宝石.mp4"), Some(Match::must(0, 0..10)));
1015        assert_eq!(re.find(r"瑠璃の宝石.mp4"), None);
1016
1017        // Trailing ?
1018        let re = Regex::builder()
1019            .ib(MatchConfig::builder().pinyin(Default::default()).build())
1020            .build_from_hir(
1021                parse_wildcard_path()
1022                    .separator(PathSeparator::Windows)
1023                    .call(r"ll???"),
1024            )
1025            .unwrap();
1026        assert_eq!(re.find(r"瑠璃の宝石"), Some(Match::must(0, 0..15)));
1027        assert!(re.is_match(r"ruri 瑠璃の宝石") == false);
1028    }
1029
1030    #[test]
1031    fn surrounding_wildcard_as_anchor_path() {
1032        // Leading ?
1033        let re = Regex::builder()
1034            .ib(MatchConfig::builder().pinyin(Default::default()).build())
1035            .thompson(PathSeparator::Windows.look_matcher_config())
1036            .build_from_hir(
1037                parse_wildcard_path()
1038                    .separator(PathSeparator::Windows)
1039                    .call(r"?:\$RECYCLE*\"),
1040            )
1041            .unwrap();
1042        assert!(re.is_match(r"C:\$RECYCLE.BIN\⑨"));
1043        assert!(re.is_match(r"C:\$RECYCLE.BIN\9"));
1044        assert!(re.is_match(r"C:\$RECYCLE.BIN\99"));
1045        assert!(re.is_match(r"D:\C:\$RECYCLE.BIN\9"));
1046        assert!(re.is_match(r"DC:\$RECYCLE.BIN\9") == false);
1047        assert!(re.is_match(r"D:\DC:\$RECYCLE.BIN\9") == false);
1048
1049        // Trailing ?
1050        let re = Regex::builder()
1051            .ib(MatchConfig::builder().pinyin(Default::default()).build())
1052            .thompson(PathSeparator::Windows.look_matcher_config())
1053            .build_from_hir(
1054                parse_wildcard_path()
1055                    .separator(PathSeparator::Windows)
1056                    .call(r"?:\$RECYCLE*\?"),
1057            )
1058            .unwrap();
1059        assert!(re.is_match(r"C:\$RECYCLE.BIN\⑨"));
1060        assert!(re.is_match(r"C:\$RECYCLE.BIN\9"));
1061        assert!(re.is_match(r"C:\$RECYCLE.BIN\99") == false);
1062        assert!(re.is_match(r"D:\C:\$RECYCLE.BIN\9"));
1063        assert!(re.is_match(r"D:\C:\$RECYCLE.BIN\99") == false);
1064        assert!(re.is_match(r"DC:\$RECYCLE.BIN\9") == false);
1065        assert!(re.is_match(r"D:\DC:\$RECYCLE.BIN\9") == false);
1066    }
1067}