Lindera IPADIC Builder
IPADIC dictionary builder for Lindera. This project fork from kuromoji-rs.
Install
% cargo install lindera-ipadic-builder
Build
The following products are required to build:
- Rust >= 1.46.0
% cargo build --release
Dictionary version
This repository contains mecab-ipadic-2.7.0-20070801.
Building a dictionary
Building a dictionary with lindera-ipadic-builder
command:
% curl -L -o /tmp/mecab-ipadic-2.7.0-20070801.tar.gz "http://jaist.dl.sourceforge.net/project/mecab/mecab-ipadic/2.7.0-20070801/mecab-ipadic-2.7.0-20070801.tar.gz"
% tar zxvf /tmp/mecab-ipadic-2.7.0-20070801.tar.gz -C /tmp
% lindera-ipadic-builder -s /tmp/mecab-ipadic-2.7.0-20070801 -d /tmp/lindera-ipadic-2.7.0-20070801
Building a user dictionary
Building a dictionary with lindera-userdic-builder
command:
% lindera-ipadic-builder -S ./resources/userdic.csv -D ./resources/userdic.bin
Dictionary format
Refer to the manual for details on the IPADIC dictionary format and part-of-speech tags.
Index | Name (Japanese) | Name (English) | Notes |
---|---|---|---|
0 | 表層形 | surface | |
1 | 左文脈ID | left-context-id | |
2 | 右文脈ID | right-context-id | |
3 | コスト | cost | |
4 | 品詞 | part-of-speech | |
5 | 品詞細分類1 | sub POS 1 | |
6 | 品詞細分類2 | sub POS 2 | |
7 | 品詞細分類3 | sub POS 3 | |
8 | 活用形 | conjugation type | |
9 | 活用型 | conjugation form | |
10 | 原形 | base form | |
11 | 読み | reading | |
12 | 発音 | pronunciation |
User dictionary format (CSV)
Simple version
Index | Name (Japanese) | Name (English) | Notes |
---|---|---|---|
0 | 表層形 | surface | |
1 | 品詞 | part-of-speech | |
2 | 読み | reading |
Detailed version
Index | Name (Japanese) | Name (English) | Notes |
---|---|---|---|
0 | 表層形 | surface | |
1 | 左文脈ID | left-context-id | |
2 | 右文脈ID | right-context-id | |
3 | コスト | cost | |
4 | 品詞 | part-of-speech | |
5 | 品詞細分類1 | sub POS 1 | |
6 | 品詞細分類2 | sub POS 2 | |
7 | 品詞細分類3 | sub POS 3 | |
8 | 活用形 | conjugation type | |
9 | 活用型 | conjugation form | |
10 | 原形 | base form | |
11 | 読み | reading | |
12 | 発音 | pronunciation |
Tokenizing text using produced dictionary
You can tokenize text using produced dictionary with lindera
command:
% echo "羽田空港限定トートバッグ" | lindera -d /tmp/lindera-ipadic-2.7.0-20070801
羽田空港 名詞,固有名詞,一般,*,*,*,羽田空港,ハネダクウコウ,ハネダクーコー
限定 名詞,サ変接続,*,*,*,*,限定,ゲンテイ,ゲンテイ
トートバッグ UNK,*,*,*,*,*,*,*,*
EOS
Tokenizing text using default dictionary and produced binary user dictionary
You can tokenize text using produced dictionary with lindera
command:
% echo "東京スカイツリーの最寄り駅はとうきょうスカイツリー駅です" | lindera -D ./resources/userdic.bin -t bin
東京スカイツリー カスタム名詞,*,*,*,*,*,東京スカイツリー,トウキョウスカイツリー,*
の 助詞,連体化,*,*,*,*,の,ノ,ノ
最寄り駅 名詞,一般,*,*,*,*,最寄り駅,モヨリエキ,モヨリエキ
は 助詞,係助詞,*,*,*,*,は,ハ,ワ
とうきょうスカイツリー駅 カスタム名詞,*,*,*,*,*,とうきょうスカイツリー駅,トウキョウスカイツリーエキ,*
です 助動詞,*,*,*,特殊・デス,基本形,です,デス,デス
EOS
For more details about lindera
command, please refer to the following URL:
API reference
The API reference is available. Please see following URL:
- lindera-ipadic-builder