lexmatch 0.1.0

This is a simple lexicon matching tool that, given a lexicon of words or phrases, identifies all matches in a given target text. Uses suffix arrays.
lexmatch-0.1.0 is not a library.

Lexmatch

This is a simple lexicon matching tool that, given a lexicon of words or phrases, identifies all matches in a given target text. It can be used compute a frequency list for a lexicon, on a target corpus.

The implementation uses suffix arrays. The text must be plain-text UTF-8 and is limited to 2^32 bytes (about 4GB).

Installation

You can build and install the latest stable release using Rust's package manager:

cargo install lexmatch

or if you want the development version after cloning this repository:

cargo install --path .

No cargo/rust on your system yet? Do sudo apt install cargo on Debian/ubuntu based systems, brew install rust on mac, or use rustup.

Usage

See lexmatch --help.

Simple example:

$ lexmatch --lexicon lexicon.lst --text corpus.txt

The lexicon must be plain-text UTF-8 containing one entry per line, an entry need not be a single word and is not constrained in length.