Crate ib_matcher

Expand description

A multilingual, flexible and fast string, glob and regex matcher. Support 拼音匹配 (Chinese pinyin match) and ローマ字検索 (Japanese romaji match).

§Features

Unicode support
- Fully UTF-8 support and limited support for UTF-16 and UTF-32.
- Unicode case insensitivity (simple case folding).
Chinese pinyin matching (拼音匹配)
- Support characters with multiple readings (i.e. heteronyms, 多音字).
- Support multiple pinyin notations, including Quanpin (全拼), Jianpin (简拼) and many Shuangpin (双拼) notations.
- Support mixing multiple notations during matching.
Japanese romaji matching (ローマ字検索)
- Support characters with multiple readings (i.e. heteronyms, 同形異音語).
- Support Hepburn romanization system only at the moment.
glob()-style pattern matching (i.e. ?, *, [] and **)
- Support different anchor modes, treating surrounding wildcards as anchors and special anchors in file paths.
- Support two seperators (//) or a complement separator (\) as a glob star (*/**).
Regular expression
- Support the same syntax as regex, including wildcards, repetitions, alternations, groups, etc.
- Support custom matching callbacks, which can be used to implement ad hoc look-around, backreferences, balancing groups/recursion/subroutines, combining domain-specific parsers, etc.
Relatively high performance
- Generally on par with the regex crate, depending on the case it can be faster or slower.

And all of the above features are optional. You don’t need to pay the performance and binary size cost for features you don’t use.

You can also use ib-pinyin if you only need Chinese pinyin match, which is simpler and more stable.

§Usage

// cargo add ib-matcher --features pinyin,romaji
use ib_matcher::matcher::{IbMatcher, PinyinMatchConfig, RomajiMatchConfig};

let matcher = IbMatcher::builder("la vie est drôle").build();
assert!(matcher.is_match("LA VIE EST DRÔLE"));

let matcher = IbMatcher::builder("βίος").build();
assert!(matcher.is_match("Βίοσ"));
assert!(matcher.is_match("ΒΊΟΣ"));

let matcher = IbMatcher::builder("pysousuoeve")
    .pinyin(PinyinMatchConfig::default())
    .build();
assert!(matcher.is_match("拼音搜索Everything"));

let matcher = IbMatcher::builder("konosuba")
    .romaji(RomajiMatchConfig::default())
    .is_pattern_partial(true)
    .build();
assert!(matcher.is_match("この素晴らしい世界に祝福を"));

§glob()-style pattern matching

See glob module for more details. Here is a quick example:

// cargo add ib-matcher --features syntax-glob,regex,romaji
use ib_matcher::{
    matcher::MatchConfig,
    regex::lita::Regex,
    syntax::glob::{parse_wildcard_path, PathSeparator}
};

let re = Regex::builder()
    .ib(MatchConfig::builder().romaji(Default::default()).build())
    .build_from_hir(
        parse_wildcard_path()
            .separator(PathSeparator::Windows)
            .call("wifi**miku"),
    )
    .unwrap();
assert!(re.is_match(r"C:\Windows\System32\ja-jp\WiFiTask\ミク.exe"));

§Regular expression

See regex module for more details. Here is a quick example:

// cargo add ib-matcher --features regex,pinyin,romaji
use ib_matcher::{
    matcher::{MatchConfig, PinyinMatchConfig, RomajiMatchConfig},
    regex::{cp::Regex, Match},
};

let config = MatchConfig::builder()
    .pinyin(PinyinMatchConfig::default())
    .romaji(RomajiMatchConfig::default())
    .build();

let re = Regex::builder()
    .ib(config.shallow_clone())
    .build("raki.suta")
    .unwrap();
assert_eq!(re.find("「らき☆すた」"), Some(Match::must(0, 3..18)));

let re = Regex::builder()
    .ib(config.shallow_clone())
    .build("pysou.*?(any|every)thing")
    .unwrap();
assert_eq!(re.find("拼音搜索Everything"), Some(Match::must(0, 0..22)));

let config = MatchConfig::builder()
    .pinyin(PinyinMatchConfig::default())
    .romaji(RomajiMatchConfig::default())
    .mix_lang(true)
    .build();
let re = Regex::builder()
    .ib(config.shallow_clone())
    .build("(?x)^zangsounofuri-?ren # Mixing pinyin and romaji")
    .unwrap();
assert_eq!(re.find("葬送のフリーレン"), Some(Match::must(0, 0..24)));

Custom matching callbacks:

// cargo add ib-matcher --features regex,regex-callback
use ib_matcher::regex::cp::Regex;

let re = Regex::builder()
    .callback("ascii", |input, at, push| {
        let haystack = &input.haystack()[at..];
        if haystack.len() > 0 && haystack[0].is_ascii() {
            push(1);
        }
    })
    .build(r"(ascii)+\d(ascii)+")
    .unwrap();
let hay = "that4Ｕ this4me";
assert_eq!(&hay[re.find(hay).unwrap().span()], " this4me");

§Choosing a matcher

Use matcher::IbMatcher if:

You only need plain text matching, optionally with Unicode case insensitivity, Chinese pinyin match and Japanese romaji match.

Use regex::lita::Regex if:

You need regex or glob syntax.
You want high performance (and don’t mind some binary footprint).

regex::lita::Regex can be much faster than regex::cp::Regex, and slightly faster than the regex crate (due to enum dispatch) if the following conditions are met:
- Your pattern is often a literal string (i.e. plain text, optionally with pinyin/romaji match).
- A fair portion of your haystacks is ASCII-only.
A typical use case that meets the above conditions is matching file names and paths.

Use regex::cp::Regex if:

You need regex or glob syntax.
You need find_iter() or captures_iter().
You need build_many().
You need custom matching callbacks.
You want smaller binary size and don’t very mind about the performance.

§Performance

The following Cargo.toml settings are recommended if best performance is desired:

[profile.release]
lto = "fat"
codegen-units = 1

These can improve the performance by 5~10% at most.

§Crate features

Most used feature combinations:

Languages: pinyin,romaji
glob: syntax-glob,regex
Regex: regex
Regex with custom matching callbacks: regex-callback

Features:

std (enabled by default) —
- For regex: When enabled, this will cause regex to use the standard library. In terms of APIs, std causes error types to implement the std::error::Error trait. Enabling std will also result in performance optimizations, including SIMD and faster synchronization primitives. Notably, disabling the std feature will result in the use of spin locks. To use a regex engine without std and without spin locks, you’ll need to drop down to use APIs that accept a Cache value explicitly.
alloc (enabled by default) — Enables use of the alloc library. This is required for most APIs in this crate.

§Languages

unicode (enabled by default) — Unicode support.
pinyin — Chinese pinyin match support.
romaji — Japanese romaji match support.

The dictionary will take ~4.8 MiB (5.5 MiB without compression) in the binary at the moment, much larger than pinyin’s.
romaji-compress-words (enabled by default) — Binary size (and memory usage) -696 KiB (771 KiB if zstd is already used), romanizer build time +1.1 ms.

§Syntax

syntax — Pattern syntax support. Equivalent to features syntax-glob,syntax-ev. See syntax for details.
syntax-glob — glob()-style pattern matching syntax support. See syntax::glob for details.
syntax-ev — Support for the syntax used by IbEverythingExt. See syntax::ev for details.
syntax-regex — Enables a dependency on regex-syntax. This makes APIs for building regex engines from pattern strings available. Without the regex-syntax dependency, the only way to build a regex engine is generally to deserialize a previously built DFA or to hand assemble an NFA using its builder API. Once you have an NFA, you can build any of the regex engines in this crate. The syntax feature also enables alloc.

See syntax::regex for details.

§Regular expression engines

regex — Regular expression support. See regex for details.

Include all regex features except regex-unicode (enabled by default) and regex-callback.
regex-automata — Regex engine types.
regex-nfa — Regex NFA engines.
regex-cp — Enables regex::cp engine. Include features regex-nfa,syntax-regex.
regex-callback — Regex with custom matching callbacks.
regex-lita — Enables regex::lita engine. Include features regex-cp.
regex-unicode (enabled by default) — Enables all regex Unicode features. This feature is enabled by default, and will always cover all Unicode features, even if more are added in the future.

§Performance

perf (enabled by default) — Enables all performance related features. This feature is enabled by default is intended to cover all reasonable features that improve performance, even if more are added in the future.
perf-inline (enabled by default) — Enables aggressive use of inlining.

When enabled, inline(always) is used in (many) strategic locations to help performance at the expense of longer compile times and increased binary size.
perf-literal (enabled by default) — Enables all literal related optimizations.
perf-literal-substring (enabled by default) — Enables all single substring literal optimizations. This includes adding a dependency on the memchr crate.
perf-unicode (enabled by default) — Unicode and ASCII related optimizations.
perf-plain-regex — Not used at the moment.

Build size +837.5 KiB

§FFI

inmut-data — Make pinyin::PinyinData interior mutable. So it can be easily used as a static variable.
minimal — Minimal APIs that can be used in one call. See minimal for details.
encoding — Support for non-UTF-8 encodings. Only UTF-16 and UTF-32 at the moment.

Non-UTF-8 Japanese romaji match is not yet supported.

Re-exports§

pub use ib_romaji as romaji;romaji
pub use ib_unicode as unicode;

Modules§

matcher: The Ib matcher. See IbMatcher.
minimalminimal: Minimal APIs
pinyinpinyin: Pinyin
regexregex-automata: This module provides routines for searching strings for matches of a regular expression (aka “regex”). The regex syntax supported by this crate is similar to other regex engines, but it lacks several features that are not known how to implement efficiently. This includes, but is not limited to, look-around and backreferences. In exchange, all regex searches in this crate have worst case O(m * n) time complexity, where m is proportional to the size of the regex and n is proportional to the size of the string being searched.
syntaxsyntax-glob or syntax-ev or syntax-regex: A collection of syntax parsers for either IbMatcher or regex engines.

Crate ib_matcher

Crate ib_matcher Copy item path

§Features

§Usage

§glob()-style pattern matching

§Regular expression

§Choosing a matcher

§Performance

§Crate features

§Languages

§Syntax

§Regular expression engines

§Performance

§FFI

Re-exports§

Modules§

Crate ib_matcher