lindera-ruby
Ruby binding for Lindera, a morphological analysis engine for CJK text.
Overview
lindera-ruby provides a Ruby interface to the Lindera morphological analysis engine, supporting Japanese, Korean, and Chinese text analysis.
- Multi-language Support: Japanese (IPADIC, IPADIC-NEologd, UniDic), Korean (ko-dic), Chinese (CC-CEDICT, Jieba)
- Character Filters: Text preprocessing with mapping, regex, Unicode normalization, and Japanese iteration mark handling
- Token Filters: Post-processing filters including lowercase, length filtering, stop words, and Japanese-specific filters
- Flexible Configuration: Configurable tokenization modes and penalty settings
- Metadata Support: Complete dictionary schema and metadata management
- Training & Export (optional): Train custom morphological analysis models from corpus data
Requirements
- Ruby >= 3.1
- Rust >= 1.85
Dictionary
Pre-built dictionaries are available from GitHub Releases.
Download a dictionary archive (e.g. lindera-ipadic-*.zip) and extract it to a local path.
Install
Usage
# Load dictionary from a local path (download from GitHub Releases)
dictionary = Lindera.load_dictionary()
# Create a tokenizer
tokenizer = Lindera::Tokenizer.new(dictionary, , nil)
# Tokenize text
tokens = tokenizer.tokenize()
tokens.each do
puts
end
Using TokenizerBuilder
builder = Lindera::TokenizerBuilder.new
builder.set_mode()
builder.set_dictionary()
# Add filters
builder.append_character_filter(, { => })
builder.append_token_filter(, nil)
tokenizer = builder.build
tokens = tokenizer.tokenize()
Test
License
MIT