Skip to main content

Crate vn_nlp

Crate vn_nlp

Expand description

Vietnamese NLP library — tokenization, normalization, segmentation.

§Quick Start

use vn_nlp::tokenize;

let tokens = tokenize("Xin chào Việt Nam").unwrap();
assert_eq!(tokens[0].text, "Xin");

Modules§

error
normalize: Text normalization — diacritics, Unicode NFC/NFD.
segment: Sentence segmentation.
tokenize: Tokenization algorithms cho tiếng Việt.
traits
types

Structs§

Sentence: Một câu sau khi segment.
Span: Vị trí byte offset trong string gốc.
Token: Một token sau khi tách.

Enums§

TokenKind: Phân loại token.
VnNlpError: Lỗi chung cho vn-nlp.

Traits§

Normalizer: Trait cho các thuật toán normalization.
Segmenter: Trait cho các thuật toán sentence segmentation.
Tokenizer: Trait cho các thuật toán tokenization.

Functions§

normalize: Chuẩn hóa văn bản tiếng Việt: NFC + collapse whitespace.
segment: Chia văn bản thành danh sách câu (convenience function).
tokenize: Tách từ tiếng Việt theo âm tiết (convenience function).