Lindera CLI

A Japanese Morphological Analyzer Command Line Interface written in Rust. This project fork from fulmicoton's kuromoji-rs.

Installing Lindera CLI

$ cargo install lindera cli

Building Lindera CLI

Requirements

The following products are required to build Lindera CLI:

Rust >= 1.39.0
make >= 3.81

Build

Build Lindera CLI with the following command:

$ make build

Usage

Switching dictionary

Use default dictionary:

$ echo "関西国際空港限定トートバッグ" | ./bin/lindera
関西国際空港    名詞,固有名詞,組織,*,*,*,関西国際空港,カンサイコクサイクウコウ,カンサイコクサイクーコー
限定    名詞,サ変接続,*,*,*,*,限定,ゲンテイ,ゲンテイ
トートバッグ    UNK,*,*,*,*,*,*,*,*
EOS

Swith dictionary (UniDic):

$ echo "関西国際空港限定トートバッグ" | ./bin/lindera -d ../lindera-unidic-builder/lindera-unidic
関西    名詞,固有名詞,地名,一般,*,*,関西,カンサイ,カンサイ
国際    名詞,普通名詞,一般,*,*,*,国際,コクサイ,コクサイ
空港    名詞,普通名詞,一般,*,*,*,空港,クーコー,クーコー
限定    名詞,普通名詞,サ変可能,*,*,*,限定,ゲンテー,ゲンテー
トート  名詞,普通名詞,一般,*,*,*,トート,トート,トート
バッグ  名詞,普通名詞,一般,*,*,*,バッグ,バッグ,バッグ
EOS

Tokenize mode

Normal mode:

$ echo "関西国際空港限定トートバッグ" | ./bin/lindera --mode=normal
関西国際空港    名詞,固有名詞,組織,*,*,*,関西国際空港,カンサイコクサイクウコウ,カンサイコクサイクーコー
限定    名詞,サ変接続,*,*,*,*,限定,ゲンテイ,ゲンテイ
トートバッグ    UNK,*,*,*,*,*,*,*,*
EOS

Decompose mode:

$ echo "関西国際空港限定トートバッグ" | ./bin/lindera --mode=decompose
関西    名詞,固有名詞,地域,一般,*,*,関西,カンサイ,カンサイ
国際    名詞,一般,*,*,*,*,国際,コクサイ,コクサイ
空港    名詞,一般,*,*,*,*,空港,クウコウ,クーコー
限定    名詞,サ変接続,*,*,*,*,限定,ゲンテイ,ゲンテイ
トートバッグ    UNK,*,*,*,*,*,*,*,*
EOS

Output format

MeCab format:

$ echo "お待ちしております。" | ./bin/lindera --output=mecab
お待ち	名詞,サ変接続,*,*,*,*,お待ち,オマチ,オマチ
し	動詞,自立,*,*,サ変・スル,連用形,する,シ,シ
て	助詞,接続助詞,*,*,*,*,て,テ,テ
おり	動詞,非自立,*,*,五段・ラ行,連用形,おる,オリ,オリ
ます	助動詞,*,*,*,特殊・マス,基本形,ます,マス,マス
。	記号,句点,*,*,*,*,。,。,。
EOS

Wakati format:

$ echo "お待ちしております。" | ./bin/lindera --output=wakati
お待ち し て おり ます 。

JSON format:

$ echo "お待ちしております。" | ./bin/lindera --output=json
[
  {
    "text": "お待ち",
    "detail": {
      "left_id": 1283,
      "right_id": 1283,
      "word_cost": 6376,
      "pos_level1": "名詞",
      "pos_level2": "サ変接続",
      "pos_level3": "*",
      "pos_level4": "*",
      "conjugation_type": "*",
      "conjugate_form": "*",
      "base_form": "お待ち",
      "reading": "オマチ",
      "pronunciation": "オマチ"
    }
  },
  {
    "text": "し",
    "detail": {
      "left_id": 610,
      "right_id": 610,
      "word_cost": 8718,
      "pos_level1": "動詞",
      "pos_level2": "自立",
      "pos_level3": "*",
      "pos_level4": "*",
      "conjugation_type": "サ変・スル",
      "conjugate_form": "連用形",
      "base_form": "する",
      "reading": "シ",
      "pronunciation": "シ"
    }
  },
  {
    "text": "て",
    "detail": {
      "left_id": 307,
      "right_id": 307,
      "word_cost": 5170,
      "pos_level1": "助詞",
      "pos_level2": "接続助詞",
      "pos_level3": "*",
      "pos_level4": "*",
      "conjugation_type": "*",
      "conjugate_form": "*",
      "base_form": "て",
      "reading": "テ",
      "pronunciation": "テ"
    }
  },
  {
    "text": "おり",
    "detail": {
      "left_id": 1197,
      "right_id": 1197,
      "word_cost": 8773,
      "pos_level1": "動詞",
      "pos_level2": "非自立",
      "pos_level3": "*",
      "pos_level4": "*",
      "conjugation_type": "五段・ラ行",
      "conjugate_form": "連用形",
      "base_form": "おる",
      "reading": "オリ",
      "pronunciation": "オリ"
    }
  },
  {
    "text": "ます",
    "detail": {
      "left_id": 491,
      "right_id": 491,
      "word_cost": 5537,
      "pos_level1": "助動詞",
      "pos_level2": "*",
      "pos_level3": "*",
      "pos_level4": "*",
      "conjugation_type": "特殊・マス",
      "conjugate_form": "基本形",
      "base_form": "ます",
      "reading": "マス",
      "pronunciation": "マス"
    }
  },
  {
    "text": "。",
    "detail": {
      "left_id": 8,
      "right_id": 8,
      "word_cost": 215,
      "pos_level1": "記号",
      "pos_level2": "句点",
      "pos_level3": "*",
      "pos_level4": "*",
      "conjugation_type": "*",
      "conjugate_form": "*",
      "base_form": "。",
      "reading": "。",
      "pronunciation": "。"
    }
  }
]

If you output result in JSON format, token can be filtering is easily assured by using with jq command.
For example, folloing command executes:

Tokenize a text
Filter tokens by part of speech (名詞)
Concat the token text with a white space

$ echo "すもももももももものうち" | ./bin/lindera --output=json |
    jq -r '.[] | select (.detail.pos_level1 =="名詞")' |
    jq -s -r '. | map(.text) | join(" ")'
すもも もも もも うち

lindera-cli 0.2.0