cedarwood
Efficiently-updatable double-array trie in Rust, ported from the C++ cedar library by Naoki Yoshinaga.
Features
- Fast lookups -- double-array tries offer O(k) lookup time where k is the length of the key, with excellent cache locality compared to pointer-based tries.
- Dynamic updates -- keys can be inserted and deleted after the initial build, unlike many static trie implementations.
- Common-prefix search -- find all keys that are prefixes of a given query string, useful for tokenization and morphological analysis.
- Predictive search -- find all keys that share a given prefix, useful for autocomplete.
- Full Unicode support -- works with any valid UTF-8 string, including CJK characters, supplementary planes (SIP), and combining characters.
- Reduced-trie mode -- optional
reduced-triefeature flag for a more compact representation at the cost of some flexibility.
Installation
Add it to your Cargo.toml:
[]
= "0.5"
To enable the reduced-trie optimization:
[]
= { = "0.5", = ["reduced-trie"] }
Quick Start
use Cedar;
API Overview
| Method | Description |
|---|---|
Cedar::new() |
Create an empty trie |
build(&key_values) |
Bulk-insert key-value pairs |
update(key, value) |
Insert or update a single key |
erase(key) |
Delete a key from the trie |
exact_match_search(key) |
Look up an exact key, returns Option<(value, length, node_pos)> |
common_prefix_search(key) |
Find all dictionary keys that are prefixes of key |
common_prefix_iter(key) |
Iterator version of common_prefix_search |
common_prefix_predict(key) |
Find all dictionary keys that start with key |
common_prefix_predict_iter(key) |
Iterator version of common_prefix_predict |
For detailed API documentation, see docs.rs or the docs/ folder.
Use Cases
- Text segmentation / tokenization -- common-prefix search is the core operation for dictionary-based Chinese/Japanese word segmentation.
- Autocomplete / suggest -- predictive search returns all completions for a typed prefix.
- Morphological analysis -- fast dictionary lookup for NLP pipelines.
- IP routing tables -- longest-prefix matching on serialized address bytes.
- Keyword filtering -- scan text for occurrences of any keyword in a large dictionary.
Benchmarks
A macro-benchmark with a real dictionary is available in benches/macro-benchmark/. A C++ cedar benchmark for comparison is in benches/cpp/.
Documentation
- Architecture & Design -- how the double-array trie works
- API Reference -- detailed method documentation
- Internal Implementation -- block management, conflict resolution, and memory layout
License
This work is released under the BSD-2-Clause license, following the original license of C++ cedar. A copy of the license is provided in the LICENSE file.
Reference
- cedar -- C++ implementation of efficiently-updatable double-array trie by Naoki Yoshinaga
- Aoe, J. (1989). An efficient implementation of trie structures. Software: Practice and Experience.