Expand description
A multilingual, flexible and fast string, glob and regex matcher. Support 拼音匹配 (Chinese pinyin match) and ローマ字検索 (Japanese romaji match).
§Features
- Unicode support
- Fully UTF-8 support and limited support for UTF-16 and UTF-32.
- Unicode case insensitivity (simple case folding).
- Chinese pinyin matching (拼音匹配)
- Support characters with multiple readings (i.e. heteronyms, 多音字).
- Support multiple pinyin notations, including Quanpin (全拼), Jianpin (简拼) and many Shuangpin (双拼) notations.
- Support mixing multiple notations during matching.
- Japanese romaji matching (ローマ字検索)
- Support characters with multiple readings (i.e. heteronyms, 同形異音語).
- Support Hepburn romanization system only at the moment.
- glob()-style pattern matching (i.e.
?
,*
,[]
and**
)- Support different anchor modes, treating surrounding wildcards as anchors and special anchors in file paths.
- Support two seperators (
//
) or a complement separator (\
) as a glob star (*/**
).
- Regular expression
- Support the same syntax as
regex
, including wildcards, repetitions, alternations, groups, etc. - Support custom matching callbacks, which can be used to implement ad hoc look-around, backreferences, balancing groups/recursion/subroutines, combining domain-specific parsers, etc.
- Support the same syntax as
- Relatively high performance
- Generally on par with the
regex
crate, depending on the case it can be faster or slower.
- Generally on par with the
And all of the above features are optional. You don’t need to pay the performance and binary size cost for features you don’t use.
You can also use ib-pinyin if you only need Chinese pinyin match, which is simpler and more stable.
§Usage
// cargo add ib-matcher --features pinyin,romaji
use ib_matcher::matcher::{IbMatcher, PinyinMatchConfig, RomajiMatchConfig};
let matcher = IbMatcher::builder("la vie est drôle").build();
assert!(matcher.is_match("LA VIE EST DRÔLE"));
let matcher = IbMatcher::builder("βίος").build();
assert!(matcher.is_match("Βίοσ"));
assert!(matcher.is_match("ΒΊΟΣ"));
let matcher = IbMatcher::builder("pysousuoeve")
.pinyin(PinyinMatchConfig::default())
.build();
assert!(matcher.is_match("拼音搜索Everything"));
let matcher = IbMatcher::builder("konosuba")
.romaji(RomajiMatchConfig::default())
.is_pattern_partial(true)
.build();
assert!(matcher.is_match("この素晴らしい世界に祝福を"));
See also choosing a matcher.
§glob()-style pattern matching
See glob
module for more details. Here is a quick example:
// cargo add ib-matcher --features syntax-glob,regex,romaji
use ib_matcher::{
matcher::MatchConfig,
regex::lita::Regex,
syntax::glob::{parse_wildcard_path, PathSeparator}
};
let re = Regex::builder()
.ib(MatchConfig::builder().romaji(Default::default()).build())
.build_from_hir(
parse_wildcard_path()
.separator(PathSeparator::Windows)
.call("wifi**miku"),
)
.unwrap();
assert!(re.is_match(r"C:\Windows\System32\ja-jp\WiFiTask\ミク.exe"));
§Regular expression
See regex
module for more details. Here is a quick example:
// cargo add ib-matcher --features regex,pinyin,romaji
use ib_matcher::{
matcher::{MatchConfig, PinyinMatchConfig, RomajiMatchConfig},
regex::{cp::Regex, Match},
};
let config = MatchConfig::builder()
.pinyin(PinyinMatchConfig::default())
.romaji(RomajiMatchConfig::default())
.build();
let re = Regex::builder()
.ib(config.shallow_clone())
.build("raki.suta")
.unwrap();
assert_eq!(re.find("「らき☆すた」"), Some(Match::must(0, 3..18)));
let re = Regex::builder()
.ib(config.shallow_clone())
.build("pysou.*?(any|every)thing")
.unwrap();
assert_eq!(re.find("拼音搜索Everything"), Some(Match::must(0, 0..22)));
let config = MatchConfig::builder()
.pinyin(PinyinMatchConfig::default())
.romaji(RomajiMatchConfig::default())
.mix_lang(true)
.build();
let re = Regex::builder()
.ib(config.shallow_clone())
.build("(?x)^zangsounofuri-?ren # Mixing pinyin and romaji")
.unwrap();
assert_eq!(re.find("葬送のフリーレン"), Some(Match::must(0, 0..24)));
// cargo add ib-matcher --features regex,regex-callback
use ib_matcher::regex::cp::Regex;
let re = Regex::builder()
.callback("ascii", |input, at, push| {
let haystack = &input.haystack()[at..];
if haystack.len() > 0 && haystack[0].is_ascii() {
push(1);
}
})
.build(r"(ascii)+\d(ascii)+")
.unwrap();
let hay = "that4U this4me";
assert_eq!(&hay[re.find(hay).unwrap().span()], " this4me");
§Choosing a matcher
Use matcher::IbMatcher
if:
- You only need plain text matching, optionally with Unicode case insensitivity, Chinese pinyin match and Japanese romaji match.
Use regex::lita::Regex
if:
-
You want high performance (and don’t mind some binary footprint).
regex::lita::Regex
can be much faster thanregex::cp::Regex
, and slightly faster than theregex
crate (due to enum dispatch) if the following conditions are met:- Your pattern is often a literal string (i.e. plain text, optionally with pinyin/romaji match).
- A fair portion of your haystacks is ASCII-only.
A typical use case that meets the above conditions is matching file names and paths.
Use regex::cp::Regex
if:
- You need
regex
orglob
syntax. - You need
find_iter()
orcaptures_iter()
. - You need
build_many()
. - You need custom matching callbacks.
- You want smaller binary size and don’t very mind about the performance.
§Performance
The following Cargo.toml
settings are recommended if best performance is desired:
[profile.release]
lto = "fat"
codegen-units = 1
These can improve the performance by 5~10% at most.
§Crate features
Most used feature combinations:
- Languages:
pinyin,romaji
- glob:
syntax-glob,regex
- Regex:
regex
- Regex with custom matching callbacks:
regex-callback
Features:
std
(enabled by default) —- For
regex
: When enabled, this will causeregex
to use the standard library. In terms of APIs,std
causes error types to implement thestd::error::Error
trait. Enablingstd
will also result in performance optimizations, including SIMD and faster synchronization primitives. Notably, disabling thestd
feature will result in the use of spin locks. To use a regex engine withoutstd
and without spin locks, you’ll need to drop down to use APIs that accept aCache
value explicitly.
- For
alloc
(enabled by default) — Enables use of thealloc
library. This is required for most APIs in this crate.
§Languages
-
unicode
(enabled by default) — Unicode support. -
pinyin
— Chinese pinyin match support. -
romaji
— Japanese romaji match support.The dictionary will take ~4.8 MiB (5.5 MiB without compression) in the binary at the moment, much larger than pinyin’s.
-
romaji-compress-words
(enabled by default) — Binary size (and memory usage) -696 KiB (771 KiB if zstd is already used), romanizer build time +1.1 ms.
§Syntax
-
syntax
— Pattern syntax support. Equivalent to featuressyntax-glob,syntax-ev
. Seesyntax
for details. -
syntax-glob
— glob()-style pattern matching syntax support. Seesyntax::glob
for details. -
syntax-ev
— Support for the syntax used by IbEverythingExt. Seesyntax::ev
for details. -
syntax-regex
— Enables a dependency onregex-syntax
. This makes APIs for building regex engines from pattern strings available. Without theregex-syntax
dependency, the only way to build a regex engine is generally to deserialize a previously built DFA or to hand assemble an NFA using its builder API. Once you have an NFA, you can build any of the regex engines in this crate. Thesyntax
feature also enablesalloc
.See
syntax::regex
for details.
§Regular expression engines
-
regex
— Regular expression support. Seeregex
for details.Include all regex features except
regex-unicode
(enabled by default) andregex-callback
. -
regex-automata
— Regex engine types. -
regex-nfa
— Regex NFA engines. -
regex-cp
— Enablesregex::cp
engine. Include featuresregex-nfa,syntax-regex
. -
regex-callback
— Regex with custom matching callbacks. -
regex-lita
— Enablesregex::lita
engine. Include featuresregex-cp
. -
regex-unicode
(enabled by default) — Enables all regex Unicode features. This feature is enabled by default, and will always cover all Unicode features, even if more are added in the future.
§Performance
-
perf
(enabled by default) — Enables all performance related features. This feature is enabled by default is intended to cover all reasonable features that improve performance, even if more are added in the future. -
perf-inline
(enabled by default) — Enables aggressive use of inlining.When enabled,
inline(always)
is used in (many) strategic locations to help performance at the expense of longer compile times and increased binary size. -
perf-literal
(enabled by default) — Enables all literal related optimizations. -
perf-literal-substring
(enabled by default) — Enables all single substring literal optimizations. This includes adding a dependency on thememchr
crate. -
perf-unicode
(enabled by default) — Unicode and ASCII related optimizations. -
perf-plain-regex
— Not used at the moment.Build size +837.5 KiB
§FFI
-
inmut-data
— Makepinyin::PinyinData
interior mutable. So it can be easily used as astatic
variable. -
minimal
— Minimal APIs that can be used in one call. Seeminimal
for details. -
encoding
— Support for non-UTF-8 encodings. Only UTF-16 and UTF-32 at the moment.Non-UTF-8 Japanese romaji match is not yet supported.
Re-exports§
pub use ib_romaji as romaji;
romaji
pub use ib_unicode as unicode;
Modules§
- matcher
- The Ib matcher. See
IbMatcher
. - minimal
minimal
- Minimal APIs
- pinyin
pinyin
- Pinyin
- regex
regex-automata
- This module provides routines for searching strings for matches of a regular
expression (aka “regex”). The regex syntax supported by this crate is similar
to other regex engines, but it lacks several features that are not known how to
implement efficiently. This includes, but is not limited to, look-around and
backreferences. In exchange, all regex searches in this crate have worst case
O(m * n)
time complexity, wherem
is proportional to the size of the regex andn
is proportional to the size of the string being searched. - syntax
syntax-glob
orsyntax-ev
orsyntax-regex
- A collection of syntax parsers for either
IbMatcher
orregex
engines.