Crate ib_matcher

Crate ib_matcher 

Source
Expand description

A multilingual, flexible and fast string, glob and regex matcher. Support 拼音匹配 (Chinese pinyin match) and ローマ字検索 (Japanese romaji match).

§Features

And all of the above features are optional. You don’t need to pay the performance and binary size cost for features you don’t use.

You can also use ib-pinyin if you only need Chinese pinyin match, which is simpler and more stable.

§Usage

// cargo add ib-matcher --features pinyin,romaji
use ib_matcher::matcher::{IbMatcher, PinyinMatchConfig, RomajiMatchConfig};

let matcher = IbMatcher::builder("la vie est drôle").build();
assert!(matcher.is_match("LA VIE EST DRÔLE"));

let matcher = IbMatcher::builder("βίος").build();
assert!(matcher.is_match("Βίοσ"));
assert!(matcher.is_match("ΒΊΟΣ"));

let matcher = IbMatcher::builder("pysousuoeve")
    .pinyin(PinyinMatchConfig::default())
    .build();
assert!(matcher.is_match("拼音搜索Everything"));

let matcher = IbMatcher::builder("konosuba")
    .romaji(RomajiMatchConfig::default())
    .is_pattern_partial(true)
    .build();
assert!(matcher.is_match("この素晴らしい世界に祝福を"));

See also choosing a matcher.

§glob()-style pattern matching

See glob module for more details. Here is a quick example:

// cargo add ib-matcher --features syntax-glob,regex,romaji
use ib_matcher::{
    matcher::MatchConfig,
    regex::lita::Regex,
    syntax::glob::{parse_wildcard_path, PathSeparator}
};

let re = Regex::builder()
    .ib(MatchConfig::builder().romaji(Default::default()).build())
    .build_from_hir(
        parse_wildcard_path()
            .separator(PathSeparator::Windows)
            .call("wifi**miku"),
    )
    .unwrap();
assert!(re.is_match(r"C:\Windows\System32\ja-jp\WiFiTask\ミク.exe"));

§Regular expression

See regex module for more details. Here is a quick example:

// cargo add ib-matcher --features regex,pinyin,romaji
use ib_matcher::{
    matcher::{MatchConfig, PinyinMatchConfig, RomajiMatchConfig},
    regex::{cp::Regex, Match},
};

let config = MatchConfig::builder()
    .pinyin(PinyinMatchConfig::default())
    .romaji(RomajiMatchConfig::default())
    .build();

let re = Regex::builder()
    .ib(config.shallow_clone())
    .build("raki.suta")
    .unwrap();
assert_eq!(re.find("「らき☆すた」"), Some(Match::must(0, 3..18)));

let re = Regex::builder()
    .ib(config.shallow_clone())
    .build("pysou.*?(any|every)thing")
    .unwrap();
assert_eq!(re.find("拼音搜索Everything"), Some(Match::must(0, 0..22)));

let config = MatchConfig::builder()
    .pinyin(PinyinMatchConfig::default())
    .romaji(RomajiMatchConfig::default())
    .mix_lang(true)
    .build();
let re = Regex::builder()
    .ib(config.shallow_clone())
    .build("(?x)^zangsounofuri-?ren # Mixing pinyin and romaji")
    .unwrap();
assert_eq!(re.find("葬送のフリーレン"), Some(Match::must(0, 0..24)));

Custom matching callbacks:

// cargo add ib-matcher --features regex,regex-callback
use ib_matcher::regex::cp::Regex;

let re = Regex::builder()
    .callback("ascii", |input, at, push| {
        let haystack = &input.haystack()[at..];
        if haystack.len() > 0 && haystack[0].is_ascii() {
            push(1);
        }
    })
    .build(r"(ascii)+\d(ascii)+")
    .unwrap();
let hay = "that4U this4me";
assert_eq!(&hay[re.find(hay).unwrap().span()], " this4me");

§Choosing a matcher

Use matcher::IbMatcher if:

  • You only need plain text matching, optionally with Unicode case insensitivity, Chinese pinyin match and Japanese romaji match.

Use regex::lita::Regex if:

  • You need regex or glob syntax.

  • You want high performance (and don’t mind some binary footprint).

    regex::lita::Regex can be much faster than regex::cp::Regex, and slightly faster than the regex crate (due to enum dispatch) if the following conditions are met:

    • Your pattern is often a literal string (i.e. plain text, optionally with pinyin/romaji match).
    • A fair portion of your haystacks is ASCII-only.

    A typical use case that meets the above conditions is matching file names and paths.

Use regex::cp::Regex if:

  • You need regex or glob syntax.
  • You need find_iter() or captures_iter().
  • You need build_many().
  • You need custom matching callbacks.
  • You want smaller binary size and don’t very mind about the performance.

§Performance

The following Cargo.toml settings are recommended if best performance is desired:

[profile.release]
lto = "fat"
codegen-units = 1

These can improve the performance by 5~10% at most.

§Crate features

Most used feature combinations:

Features:

  • std (enabled by default)
    • For regex: When enabled, this will cause regex to use the standard library. In terms of APIs, std causes error types to implement the std::error::Error trait. Enabling std will also result in performance optimizations, including SIMD and faster synchronization primitives. Notably, disabling the std feature will result in the use of spin locks. To use a regex engine without std and without spin locks, you’ll need to drop down to use APIs that accept a Cache value explicitly.
  • alloc (enabled by default) — Enables use of the alloc library. This is required for most APIs in this crate.

§Languages

  • unicode (enabled by default) — Unicode support.

  • pinyin — Chinese pinyin match support.

  • romaji — Japanese romaji match support.

    The dictionary will take ~4.8 MiB (5.5 MiB without compression) in the binary at the moment, much larger than pinyin’s.

  • romaji-compress-words (enabled by default) — Binary size (and memory usage) -696 KiB (771 KiB if zstd is already used), romanizer build time +1.1 ms.

§Syntax

  • syntax — Pattern syntax support. Equivalent to features syntax-glob,syntax-ev. See syntax for details.

  • syntax-glob — glob()-style pattern matching syntax support. See syntax::glob for details.

  • syntax-ev — Support for the syntax used by IbEverythingExt. See syntax::ev for details.

  • syntax-regex — Enables a dependency on regex-syntax. This makes APIs for building regex engines from pattern strings available. Without the regex-syntax dependency, the only way to build a regex engine is generally to deserialize a previously built DFA or to hand assemble an NFA using its builder API. Once you have an NFA, you can build any of the regex engines in this crate. The syntax feature also enables alloc.

    See syntax::regex for details.

§Regular expression engines

  • regex — Regular expression support. See regex for details.

    Include all regex features except regex-unicode (enabled by default) and regex-callback.

  • regex-automata — Regex engine types.

  • regex-nfa — Regex NFA engines.

  • regex-cp — Enables regex::cp engine. Include features regex-nfa,syntax-regex.

  • regex-callback — Regex with custom matching callbacks.

  • regex-lita — Enables regex::lita engine. Include features regex-cp.

  • regex-unicode (enabled by default) — Enables all regex Unicode features. This feature is enabled by default, and will always cover all Unicode features, even if more are added in the future.

§Performance

  • perf (enabled by default) — Enables all performance related features. This feature is enabled by default is intended to cover all reasonable features that improve performance, even if more are added in the future.

  • perf-inline (enabled by default) — Enables aggressive use of inlining.

    When enabled, inline(always) is used in (many) strategic locations to help performance at the expense of longer compile times and increased binary size.

  • perf-literal (enabled by default) — Enables all literal related optimizations.

  • perf-literal-substring (enabled by default) — Enables all single substring literal optimizations. This includes adding a dependency on the memchr crate.

  • perf-unicode (enabled by default) — Unicode and ASCII related optimizations.

  • perf-plain-regex — Not used at the moment.

    Build size +837.5 KiB

§FFI

  • inmut-data — Make pinyin::PinyinData interior mutable. So it can be easily used as a static variable.

  • minimal — Minimal APIs that can be used in one call. See minimal for details.

  • encoding — Support for non-UTF-8 encodings. Only UTF-16 and UTF-32 at the moment.

    Non-UTF-8 Japanese romaji match is not yet supported.

Re-exports§

pub use ib_romaji as romaji;romaji
pub use ib_unicode as unicode;

Modules§

matcher
The Ib matcher. See IbMatcher.
minimalminimal
Minimal APIs
pinyinpinyin
Pinyin
regexregex-automata
This module provides routines for searching strings for matches of a regular expression (aka “regex”). The regex syntax supported by this crate is similar to other regex engines, but it lacks several features that are not known how to implement efficiently. This includes, but is not limited to, look-around and backreferences. In exchange, all regex searches in this crate have worst case O(m * n) time complexity, where m is proportional to the size of the regex and n is proportional to the size of the string being searched.
syntaxsyntax-glob or syntax-ev or syntax-regex
A collection of syntax parsers for either IbMatcher or regex engines.