Lindera UniDic Builder

UniDic builder for Lindera.

Install

% cargo install lindera-unidic-builder

Build

The following products are required to build:

Rust >= 1.46.0

% cargo build --release

Build small binary

You can reduce the size of the dictionary by using the "compress" feature flag.
Instead, it can only be used with Lindera, which supports compression.

This repo example is this.

% cargo build --release --features compress

It also depends on liblzma to compress the dictionary. Please install the dependent packages as follows:

% sudo apt install liblzma-dev

Dictionary version

This project supports UniDic 2.1.2. See detail of UniDic .

Building a dictionary

Building a dictionary with lindera-unidic-builder command:

% curl -l -o /tmp/unidic-mecab-2.1.2_src.zip "https://clrd.ninjal.ac.jp/unidic_archive/cwj/2.1.2/unidic-mecab-2.1.2_src.zip"
% unzip /tmp/unidic-mecab-2.1.2_src.zip -d /tmp
% lindera-unidic-builder -s /tmp/unidic-mecab-2.1.2_src -d /tmp/lindera-unidic-2.1.2

Building a user dictionary

Building a dictionary with lindera-unidic-builder command:

% lindera-unidic-builder -S ./resources/simple_userdic.csv -D ./resources/unidic_userdic.bin

Dictionary format

Refer to the manual for details on the unidic-mecab dictionary format and part-of-speech tags.

Index	Name (Japanese)	Name (English)
0	表層形	Surface
1	左文脈ID	Left context ID
2	右文脈ID	Right context ID
3	コスト	Cost
4	品詞大分類	Major POS classification
5	品詞中分類	Middle POS classification
6	品詞小分類	Small POS classification
7	品詞細分類	Fine POS classification
8	活用型	Conjugation form
9	活用形	Conjugation type
10	語彙素読み	Lexeme reading
11	語彙素（語彙素表記 + 語彙素細分類）	Lexeme
12	書字形出現形	Orthography appearance type
13	発音形出現形	Pronunciation appearance type
14	書字形基本形	Orthography basic type
15	発音形基本形	Pronunciation basic type
16	語種	Word type
17	語頭変化型	Prefix of a word form
18	語頭変化形	Prefix of a word type
19	語末変化型	Suffix of a word form
20	語末変化形	Suffix of a word type

User dictionary format (CSV)

Simple version

Index	Name (Japanese)	Name (English)
0	表層形	Surface
1	品詞大分類	Major POS classification
2	語彙素読み	Lexeme reading

Detailed version

Index	Name (Japanese)	Name (English)
0	表層形	Surface
1	左文脈ID	Left context ID
2	右文脈ID	Right context ID
3	コスト	Cost
4	品詞大分類	Major POS classification
5	品詞中分類	Middle POS classification
6	品詞小分類	Small POS classification
7	品詞細分類	Fine POS classification
8	活用型	Conjugation form
9	活用形	Conjugation type
10	語彙素読み	Lexeme reading
11	語彙素（語彙素表記 + 語彙素細分類）	Lexeme
12	書字形出現形	Orthography appearance type
13	発音形出現形	Pronunciation appearance type
14	書字形基本形	Orthography basic type
15	発音形基本形	Pronunciation basic type
16	語種	Word type
17	語頭変化型	Prefix of a word form
18	語頭変化形	Prefix of a word type
19	語末変化型	Suffix of a word form
20	語末変化形	Suffix of a word type

Tokenizing text using produced dictionary

You can tokenize text using produced dictionary with lindera command:

% echo "羽田空港限定トートバッグ" | lindera -k unidic -d /tmp/lindera-unidic-2.1.2

羽田    名詞,固有名詞,人名,姓,*,*,ハタ,ハタ,羽田,ハタ,羽田,ハタ,固,*,*,*,*
空港    名詞,普通名詞,一般,*,*,*,クウコウ,空港,空港,クーコー,空港,クーコー,漢,*,*,*,*
限定    名詞,普通名詞,サ変可能,*,*,*,ゲンテイ,限定,限定,ゲンテー,限定,ゲンテー,漢,*,*,*,*
トート  名詞,普通名詞,一般,*,*,*,トート,トート,トート,トート,トート,トート,外,*,*,*,*
バッグ  名詞,普通名詞,一般,*,*,*,バッグ,バッグ-bag,バッグ,バッグ,バッグ,バッグ,外,*,*,*,*
EOS

Tokenizing text using UniDic dictionary and produced binary user dictionary