lexrs 0.2.0

Efficient lexicon data structures: Trie and DAWG
Documentation
  • Coverage
  • 62.26%
    33 out of 53 items documented0 out of 37 items with examples
  • Size
  • Source code size: 79.72 kB This is the summed size of all the files inside the crates.io package for this release.
  • Documentation size: 602.24 kB This is the summed size of all files generated by rustdoc for all configured targets
  • Ø build duration
  • this release: 34s Average build duration of successful builds.
  • all releases: 34s Average build duration of successful builds in releases after 2024-10-23.
  • Links
  • Homepage
  • aosingh/lexrs
    1 0 0
  • crates.io
  • Dependencies
  • Versions
  • Owners
  • aosingh

lexrs — core library

The lexrs Rust crate implements two lexicon data structures — Trie and DAWG — with an optional Python extension module built via PyO3.

Crate layout

lexrs/
  src/
    lib.rs      — public re-exports and Python module registration
    trie.rs     — Trie: arena-allocated prefix tree, wildcard & Levenshtein search
    dawg.rs     — DAWG: minimized Trie with suffix compression
    node.rs     — Node type shared by both structures
    utils.rs    — file I/O helpers and wildcard pattern normalization
    error.rs    — LexError enum (used for out-of-order DAWG insertion, etc.)
    python.rs   — PyO3 wrappers (compiled only with --features python)
    bin/
      genome_bench.rs — standalone benchmark binary
  tests/
    trie_tests.rs — Rust unit tests for Trie
    dawg_tests.rs — Rust unit tests for DAWG

Data structures

Trie

Arena-allocated prefix tree (Vec<Node>). Words may be inserted in any order. Good for write-heavy workloads or when insertion order is unknown.

DAWG (Directed Acyclic Word Graph)

A minimized Trie that merges identical suffix subtrees, yielding a much smaller node count for large lexicons. Words must be inserted in lexicographic order. Call reduce() after individual add() calls to finalize minimization; add_all() handles this automatically.

API

Both types expose the same interface:

Method Description
add(word, count) Insert a word with an optional frequency count
add_all(words) Insert from any iterable (sorts before inserting into DAWG)
add_from_file(path) Insert words from a newline-delimited file
contains(word) Exact membership test
contains_prefix(prefix) True if any stored word starts with this prefix
search(pattern) Wildcard search (* = any chars, ? = exactly one)
search_with_count(pattern) Like search, returns (word, count) pairs
search_with_prefix(prefix) All words beginning with the prefix
search_with_prefix_count(prefix) Like above, with counts
search_within_distance(word, dist) Levenshtein fuzzy search
search_within_distance_count(word, dist) Like above, with counts
word_count() Number of words stored
node_count() Number of nodes in the structure

Consecutive wildcards are normalized: ***, ?**.

Install

Add to your Cargo.toml:

[dependencies]
lexrs = "0.1"

Python bindings

The same data structures are available as a Python package:

pip install pylexrs
from lexrs import Trie, DAWG
t = Trie()
t.add_all(["apple", "apply", "apt"])
t.search("ap*")
from lexrs import Trie, DAWG

t = Trie()
t.add("hello", 5)
t.add_all(["world", "foo"])
"hello" in t               # True
t.search("h*")             # wildcard
t.search_within_distance("helo", 1)  # fuzzy

d = DAWG()
d.add_all(["apple", "apply", "apt"])
"apple" in d
d.search("ap?")

Cargo features

Feature Effect
python Enables PyO3 and compiles the lexrs Python extension module