lindera-unidic-builder 0.17.0

A Japanese morphological dictionary builder for UniDic.
Documentation

Lindera UniDic Builder

License: MIT Join the chat at https://gitter.im/lindera-morphology/lindera

UniDic builder for Lindera.

Install

% cargo install lindera-unidic-builder

Build

The following products are required to build:

  • Rust >= 1.46.0
% cargo build --release

Build small binary

You can reduce the size of the dictionary by using the "compress" feature flag.
Instead, it can only be used with Lindera, which supports compression.

This repo example is this.

% cargo build --release --features compress

It also depends on liblzma to compress the dictionary. Please install the dependent packages as follows:

% sudo apt install liblzma-dev

Dictionary version

This project supports UniDic 2.1.2. See detail of UniDic .

Building a dictionary

Building a dictionary with lindera-unidic-builder command:

% curl -l -o /tmp/unidic-mecab-2.1.2_src.zip "https://clrd.ninjal.ac.jp/unidic_archive/cwj/2.1.2/unidic-mecab-2.1.2_src.zip"
% unzip /tmp/unidic-mecab-2.1.2_src.zip -d /tmp
% lindera-unidic-builder -s /tmp/unidic-mecab-2.1.2_src -d /tmp/lindera-unidic-2.1.2

Building a user dictionary

Building a dictionary with lindera-unidic-builder command:

% lindera-unidic-builder -S ./resources/simple_userdic.csv -D ./resources/unidic_userdic.bin

Dictionary format

Refer to the manual for details on the unidic-mecab dictionary format and part-of-speech tags.

Index Name (Japanese) Name (English) Notes
0 表層形 Surface
1 左文脈ID Left context ID
2 右文脈ID Right context ID
3 コスト Cost
4 品詞大分類 Major POS classification
5 品詞中分類 Middle POS classification
6 品詞小分類 Small POS classification
7 品詞細分類 Fine POS classification
8 活用型 Conjugation form
9 活用形 Conjugation type
10 語彙素読み Lexeme reading
11 語彙素(語彙素表記 + 語彙素細分類) Lexeme
12 書字形出現形 Orthography appearance type
13 発音形出現形 Pronunciation appearance type
14 書字形基本形 Orthography basic type
15 発音形基本形 Pronunciation basic type
16 語種 Word type
17 語頭変化型 Prefix of a word form
18 語頭変化形 Prefix of a word type
19 語末変化型 Suffix of a word form
20 語末変化形 Suffix of a word type

User dictionary format (CSV)

Simple version

Index Name (Japanese) Name (English) Notes
0 表層形 Surface
1 品詞大分類 Major POS classification
2 語彙素読み Lexeme reading

Detailed version

Index Name (Japanese) Name (English) Notes
0 表層形 Surface
1 左文脈ID Left context ID
2 右文脈ID Right context ID
3 コスト Cost
4 品詞大分類 Major POS classification
5 品詞中分類 Middle POS classification
6 品詞小分類 Small POS classification
7 品詞細分類 Fine POS classification
8 活用型 Conjugation form
9 活用形 Conjugation type
10 語彙素読み Lexeme reading
11 語彙素(語彙素表記 + 語彙素細分類) Lexeme
12 書字形出現形 Orthography appearance type
13 発音形出現形 Pronunciation appearance type
14 書字形基本形 Orthography basic type
15 発音形基本形 Pronunciation basic type
16 語種 Word type
17 語頭変化型 Prefix of a word form
18 語頭変化形 Prefix of a word type
19 語末変化型 Suffix of a word form
20 語末変化形 Suffix of a word type
21 - - After 21, it can be freely expanded.

Tokenizing text using produced dictionary

You can tokenize text using produced dictionary with lindera command:

% echo "羽田空港限定トートバッグ" | lindera -k unidic -d /tmp/lindera-unidic-2.1.2
羽田    名詞,固有名詞,人名,姓,*,*,ハタ,ハタ,羽田,ハタ,羽田,ハタ,固,*,*,*,*
空港    名詞,普通名詞,一般,*,*,*,クウコウ,空港,空港,クーコー,空港,クーコー,漢,*,*,*,*
限定    名詞,普通名詞,サ変可能,*,*,*,ゲンテイ,限定,限定,ゲンテー,限定,ゲンテー,漢,*,*,*,*
トート  名詞,普通名詞,一般,*,*,*,トート,トート,トート,トート,トート,トート,外,*,*,*,*
バッグ  名詞,普通名詞,一般,*,*,*,バッグ,バッグ-bag,バッグ,バッグ,バッグ,バッグ,外,*,*,*,*
EOS

Tokenizing text using UniDic dictionary and produced binary user dictionary

You can tokenize text using produced dictionary with lindera command:

% echo "東京スカイツリーの最寄り駅はとうきょうスカイツリー駅です" | lindera -k unidic -u ./resources/unidic_userdic.bin -t binary
東京スカイツリー        カスタム名詞,*,*,*,*,*,トウキョウスカイツリー,*,*,*,*,*,*,*,*,*,*
の      助詞,格助詞,*,*,*,*,ノ,の,の,ノ,の,ノ,和,*,*,*,*
最寄り  名詞,普通名詞,一般,*,*,*,モヨリ,最寄り,最寄り,モヨリ,最寄り,モヨリ,和,*,*,*,*
駅      名詞,普通名詞,一般,*,*,*,エキ,駅,駅,エキ,駅,エキ,漢,*,*,*,*
は      助詞,係助詞,*,*,*,*,ハ,は,は,ワ,は,ワ,和,*,*,*,*
とうきょうスカイツリー駅        カスタム名詞,*,*,*,*,*,トウキョウスカイツリーエキ,*,*,*,*,*,*,*,*,*,*
です    助動詞,*,*,*,助動詞-デス,終止形-一般,デス,です,です,デス,です,デス,和,*,*,*,*
EOS

You can use other user dictionary (e.g. IPADIC) with UniDic. But, note that the detailed information of the words will be others one.

For more details about lindera command, please refer to the following URL:

API reference

The API reference is available. Please see following URL:

  • Lindera UniDic Builder