nlpo3 1.3.2

Thai natural language processing library, with Python and Node bindings
# nlpO3

Thai Natural Language Processing library in Rust,
with Python and Node bindings. Formerly oxidized-thainlp.

## Features

- Thai word tokenizer
  - use maximal-matching dictionary-based tokenization algorithm and honor Thai Character Cluster boundaries
    - [2.5x faster] than similar pure Python implementation (PyThaiNLP's newmm)
  - load a dictionary from a plain text file (one word per line) or from `Vec<String>`

## Dictionary file

- For the interest of library size, nlpO3 does not assume what dictionary the developer would like to use.
  It does not come with a dictionary. A dictionary is needed for the dictionary-based word tokenizer.
- For tokenization dictionary, try
  - [words_th.tx] from [PyThaiNLP] - around 62,000 words (CC0)
  - [word break dictionary] from [libthai] - consists of dictionaries in different categories, with make script (LGPL-2.1)

## Usage

### Command-line interface

- [nlpo3-cli]nlpo3-cli/ <a href=""><img alt="" src=""/></a>

echo "ฉันกินข้าว" | nlpo3 segment

### Bindings
- [Node.js]nlpo3-nodejs/
- [Python]nlpo3-python/ <a href=""><img alt="pypi" src=""/></a>

from nlpo3 import load_dict, segment

load_dict("path/to/dict.file", "dict_name")
segment("สวัสดีครับ", "dict_name")

### As Rust library
<a href=""><img alt="" src=""/></a>

In `Cargo.toml`:

# ...
nlpo3 = "1.3.2"

Create a tokenizer using a dictionary from file,
then use it to tokenize a string (safe mode = true, and parallel mode = false):
use nlpo3::tokenizer::newmm::NewmmTokenizer;
use nlpo3::tokenizer::tokenizer_trait::Tokenizer;

let tokenizer = NewmmTokenizer::new("path/to/dict.file");
let tokens = tokenizer.segment("ห้องสมุดประชาชน", true, false).unwrap();

Create a tokenizer using a dictionary from a vector of Strings:
let words = vec!["ปาลิเมนต์".to_string(), "คอนสติติวชั่น".to_string()];
let tokenizer = NewmmTokenizer::from_word_list(words);

Add words to an existing tokenizer:

Remove words from an existing tokenizer:
tokenizer.remove_word(&["กระเพรา", "ชานชลา"]);

## Build

### Requirements

- [Rust 2018 Edition]

### Steps

Generic test:
cargo test

Build API document and open it to check:
cargo doc --open

Build (remove `--release` to keep debug information):
cargo build --release

Check `target/` for build artifacts.

## Development documents

- [Notes on custom string]src/

## Issues

Please report issues at