Expand description
§Lindera Ruby Bindings
Ruby bindings for Lindera, a morphological analysis library for CJK text.
Lindera provides high-performance tokenization and morphological analysis for:
- Japanese (IPADIC, IPADIC NEologd, UniDic)
- Korean (ko-dic)
- Chinese (CC-CEDICT, Jieba)
§Features
- Dictionary management: Build, load, and use custom dictionaries
- Tokenization: Multiple tokenization modes (normal, decompose)
- Filters: Character and token filtering pipeline
- Training: Train custom morphological models (with
trainfeature) - User dictionaries: Support for custom user dictionaries
§Examples
require "lindera"
# Create a tokenizer
builder = Lindera::TokenizerBuilder.new
tokenizer = builder.build
# Tokenize text
tokens = tokenizer.tokenize("関西国際空港")
tokens.each { |token| puts "#{token.surface}: #{token.details}" }Modules§
- character_
filter - Character filters for preprocessing text.
- dictionary
- Dictionary management for morphological analysis.
- error
- Error types for Lindera operations.
- metadata
- Dictionary metadata configuration.
- mode
- Tokenization modes and penalty configurations.
- schema
- Dictionary schema definitions.
- segmenter
- Segmenter implementation for morphological analysis.
- token
- Token representation for morphological analysis results.
- token_
filter - Token filters for post-processing tokens.
- tokenizer
- Tokenizer implementation for morphological analysis.
- trainer
- Training functionality for custom morphological models.
- util
- Utility functions for Ruby-Rust data conversion.