Expand description

🦞 Crawdad: ChaRActer-Wise Double-Array Dictionary

Crawdad is a library of natural language dictionaries using character-wise double-array tries. The implementation is optimized for strings of multibyte-characters, and you can enjoy fast text processing on such strings such as Japanese or Chinese.

Data structures

Crawdad contains the two trie implementations:

  • Trie is a standard trie form that often provides the fastest queries.
  • MpTrie is a minimal-prefix trie form that is memory-efficient for long strings.

Examples

Looking up an input key

To get a value associated with an input key, use Trie::exact_match().

use crawdad::Trie;

let keys = vec!["世界", "世界中", "国民"];
let trie = Trie::from_keys(&keys).unwrap();

assert_eq!(trie.exact_match("世界中".chars()), Some(1));
assert_eq!(trie.exact_match("日本中".chars()), None);

Finding all occurrences of keys in an input text

To search for all occurrences of registered keys in an input text, use Trie::common_prefix_search() for all starting positions in the text.

use crawdad::Trie;

let keys = vec!["世界", "世界中", "国民"];
let trie = Trie::from_keys(&keys).unwrap();

let haystack: Vec<char> = "国民が世界中にて".chars().collect();
let mut matches = vec![];

for i in 0..haystack.len() {
    for (v, j) in trie.common_prefix_search(haystack[i..].iter().copied()) {
        matches.push((v, i..i + j));
    }
}

assert_eq!(
    matches,
    vec![(2, 0..2), (0, 3..5), (1, 3..6)]
);

Serializing and deserializing the data structure

To serialize/deserialize the data structure into/from a byte sequence, use Trie::serialize_to_vec()/Trie::deserialize_from_slice().

use crawdad::Trie;

let keys = vec!["世界", "世界中", "国民"];
let trie = Trie::from_keys(&keys).unwrap();

let bytes = trie.serialize_to_vec();
let (other, _) = Trie::deserialize_from_slice(&bytes);

assert_eq!(trie.io_bytes(), other.io_bytes());

Re-exports

pub use mptrie::MpTrie;
pub use trie::Trie;

Modules

Definition of errors.

A minimal-prefix trie form that is memory-efficient for long strings.

A standard trie form that often provides the fastest queries.

Constants

Special terminator, which must not be contained in keys.