Expand description
§Lindera Python Bindings
Python bindings for Lindera, a morphological analysis library for CJK text.
Lindera provides high-performance tokenization and morphological analysis for:
- Japanese (IPADIC, IPADIC NEologd, UniDic)
- Korean (ko-dic)
- Chinese (CC-CEDICT)
§Features
- Dictionary management: Build, load, and use custom dictionaries
- Tokenization: Multiple tokenization modes (normal, decompose)
- Filters: Character and token filtering pipeline
- Training: Train custom morphological models (with
trainfeature) - User dictionaries: Support for custom user dictionaries
§Examples
import lindera
# Create a tokenizer
tokenizer = lindera.TokenizerBuilder().build()
# Tokenize text
tokens = tokenizer.tokenize("関西国際空港")
for token in tokens:
print(token["text"], token["detail"])Modules§
- dictionary
- Dictionary management for morphological analysis.
- error
- Error types for Lindera operations.
- metadata
- Dictionary metadata configuration.
- mode
- Tokenization modes and penalty configurations.
- schema
- Dictionary schema definitions.
- tokenizer
- Tokenizer implementation for morphological analysis.
- trainer
- Training functionality for custom morphological models.
- util
- Utility functions for Python-Rust data conversion.
Functions§
- version
- Returns the version of the lindera-python package.