Lindera UniDic Builder
UniDic builder for Lindera. This project fork from fulmicoton's kuromoji-rs.
Install
% cargo install lindera-unidic-builder
Build
The following products are required to build:
- Rust >= 1.46.0
% cargo build --release
Dictionary version
This project supports UniDic 2.1.2. See detail of UniDic .
Building a dictionary
Building a dictionary with lindera-unidic
command:
% curl -l -o /tmp/unidic-mecab-2.1.2_src.zip "https://ccd.ninjal.ac.jp/unidic_archive/cwj/2.1.2/unidic-mecab-2.1.2_src.zip"
% unzip /tmp/unidic-mecab-2.1.2_src.zip -d /tmp
% lindera-unidic-builder -s /tmp/unidic-mecab-2.1.2_src -d /tmp/lindera-unidic-2.1.2
Dictionary format
Refer to the manual for details on the unidic-mecab dictionary format and part-of-speech tags.
Index | Name (Japanese) | Name (English) | Notes |
---|---|---|---|
0 | 品詞大分類 | ||
1 | 品詞中分類 | ||
2 | 品詞小分類 | ||
3 | 品詞細分類 | ||
4 | 活用型 | ||
5 | 活用形 | ||
6 | 語彙素読み | ||
7 | 語彙素(語彙素表記 + 語彙素細分類) | Lexeme | |
8 | 書字形出現形 | ||
9 | 発音形出現形 | ||
10 | 書字形基本形 | ||
11 | 発音形基本形 | ||
12 | 語種 | ||
13 | 語頭変化型 | ||
14 | 語頭変化形 | ||
15 | 語末変化型 | ||
16 | 語末変化形 |
Tokenizing text using produced dictionary
You can tokenize text using produced dictionary with lindera
command:
% echo "羽田空港限定トートバッグ" | lindera -d /tmp/lindera-unidic-2.1.2
羽田 名詞,固有名詞,人名,姓,*,*,羽田,ハタ,ハタ
空港 名詞,普通名詞,一般,*,*,*,空港,クーコー,クーコー
限定 名詞,普通名詞,サ変可能,*,*,*,限定,ゲンテー,ゲンテー
トート 名詞,普通名詞,一般,*,*,*,トート,トート,トート
バッグ 名詞,普通名詞,一般,*,*,*,バッグ,バッグ,バッグ
EOS
For more details about lindera
command, please refer to the following URL:
API reference
The API reference is available. Please see following URL:
- Lindera UniDic Builder