SPDX-FileCopyrightText: 2024 PyThaiNLP Project SPDX-License-Identifier: Apache-2.0
nlpO3
Thai natural language processing library in Rust, with Python and Node bindings. Formerly oxidized-thainlp.
cargo install nlpo3
pip install nlpo3
Table of contents
Features
- Thai word tokenizer
- Use maximal-matching dictionary-based tokenization algorithm
and honor Thai Character Cluster boundaries
- 2.5x faster than similar pure Python implementation (PyThaiNLP's newmm)
- Load a dictionary from a plain text file (one word per line)
or from
Vec<String>
- Use maximal-matching dictionary-based tokenization algorithm
and honor Thai Character Cluster boundaries
Dictionary file
- For the interest of library size, nlpO3 does not assume what dictionary the user would like to use, and it does not come with a dictionary.
- A dictionary is needed for the dictionary-based word tokenizer.
- For tokenization dictionary, try
- words_th.tx from PyThaiNLP
- ~62,000 words
- CC0-1.0
- word break dictionary from libthai
- consists of dictionaries in different categories, with a make script
- LGPL-2.1
- words_th.tx from PyThaiNLP
Usage
Node.js binding
See nlpo3-nodejs.
Python binding
Example:
See more at nlpo3-python.
Rust library
Install
cargo install nlpo3
In Cargo.toml:
[]
# ...
= "1.4.0"
Example
Create a tokenizer using a dictionary from file, then use it to tokenize a string (safe mode = true, and parallel mode = false):
use NewmmTokenizer;
use Tokenizer;
let tokenizer = new;
let tokens = tokenizer.segment.unwrap;
Create a tokenizer using a dictionary from a vector of Strings:
let words = vec!;
let tokenizer = from_word_list;
Add words to an existing tokenizer:
tokenizer.add_word;
Remove words from an existing tokenizer:
tokenizer.remove_word;
Command-line interface
Example:
|
See more at nlpo3-cli.
Build
Requirements
Steps
Generic test:
Build API document and open it to check:
Build (remove --release to keep debug information):
Check target/ for build artifacts.
Development
Development document:
Issues:
- Please report issues at https://github.com/PyThaiNLP/nlpo3/issues
License
nlpO3 is copyrighted by its authors and licensed under terms of the Apache Software License 2.0 (Apache-2.0) - see file LICENSE for details.