lindera-python
Python binding for Lindera, a Japanese morphological analysis engine.
Overview
lindera-python provides a comprehensive Python interface to the Lindera 1.1.1 morphological analysis engine, supporting Japanese, Korean, and Chinese text analysis. This implementation includes all major features:
- Multi-language Support: Japanese (IPADIC, UniDic), Korean (ko-dic), Chinese (CC-CEDICT)
- Character Filters: Text preprocessing with mapping, regex, Unicode normalization, and Japanese iteration mark handling
- Token Filters: Post-processing filters including lowercase, length filtering, stop words, and Japanese-specific filters
- Flexible Configuration: Configurable tokenization modes and penalty settings
- Metadata Support: Complete dictionary schema and metadata management
Features
Core Components
- TokenizerBuilder: Fluent API for building customized tokenizers
- Tokenizer: High-performance text tokenization with integrated filtering
- CharacterFilter: Pre-processing filters for text normalization
- TokenFilter: Post-processing filters for token refinement
- Metadata & Schema: Dictionary structure and configuration management
- Training & Export (optional): Train custom morphological analysis models from corpus data
Supported Dictionaries
- Japanese: IPADIC (embedded), UniDic (embedded)
- Korean: ko-dic (embedded)
- Chinese: CC-CEDICT (embedded)
- Custom: User dictionary support
Filter Types
Character Filters:
- Mapping filter (character replacement)
- Regex filter (pattern-based replacement)
- Unicode normalization (NFKC, etc.)
- Japanese iteration mark normalization
Token Filters:
- Text case transformation (lowercase, uppercase)
- Length filtering (min/max character length)
- Stop words filtering
- Japanese-specific filters (base form, reading form, etc.)
- Korean-specific filters
Install project dependencies
- pyenv : https://github.com/pyenv/pyenv?tab=readme-ov-file#installation
- Poetry : https://python-poetry.org/docs/#installation
- Rust : https://www.rust-lang.org/tools/install
Install Python
# Install Python
% pyenv install 3.13.5
Setup repository and activate virtual environment
# Clone lindera-python project repository
% git clone git@github.com:lindera/lindera-python.git
% cd lindera-python
# Set Python version for this project
% pyenv local 3.13.5
# Make Python virtual environment
% python -m venv .venv
# Activate Python virtual environment
% source .venv/bin/activate
# Initialize lindera-python project
(.venv) % make init
Install lindera-python as a library in the virtual environment
This command takes a long time because it builds a library that includes all the dictionaries.
(.venv) % make develop
Quick Start
Basic Tokenization
# Create a tokenizer with default settings
=
=
# Tokenize Japanese text
=
=
Using Character Filters
# Create tokenizer builder
=
# Add character filters
# Build tokenizer with filters
=
=
= # Will apply filters automatically
Using Token Filters
# Create tokenizer builder
=
# Add token filters
# Build tokenizer with filters
=
=
Integrated Pipeline
# Build tokenizer with integrated filters
=
# Add character filters
# Add token filters
# Build and use
=
=
Working with Metadata
# Get metadata for a specific dictionary
=
# Access schema information
=
# First 5 fields
Advanced Usage
Filter Configuration Examples
Character filters and token filters accept configuration as dictionary arguments:
=
# Character filters with dict configuration
# Token filters with dict configuration
# Filters without configuration can omit the dict
=
See examples/ directory for comprehensive examples including:
tokenize.py: Basic tokenizationtokenize_with_filters.py: Using character and token filterstokenize_with_userdict.py: Custom user dictionarytrain_and_export.py: Train and export custom dictionaries (requirestrainfeature)- Multi-language tokenization
- Advanced configuration options
Dictionary Support
Japanese
- IPADIC: Default Japanese dictionary, good for general text
- UniDic: Academic dictionary with detailed morphological information
Korean
- ko-dic: Standard Korean dictionary for morphological analysis
Chinese
- CC-CEDICT: Community-maintained Chinese-English dictionary
Custom Dictionaries
- User dictionary support for domain-specific terms
- CSV format for easy customization
Dictionary Training (Experimental)
lindera-python supports training custom morphological analysis models from annotated corpus data when built with the train feature.
Building with Training Support
# Install with training support
(.venv) % maturin develop --features train
Training a Model
# Train a model from corpus
Exporting Dictionary Files
# Export trained model to dictionary files
This will create:
lex.csv: Lexicon filematrix.def: Connection cost matrixunk.def: Unknown word definitionschar.def: Character definitionsmetadata.json: Dictionary metadata (if provided)
See examples/train_and_export.py for a complete example.
API Reference
Core Classes
TokenizerBuilder: Fluent builder for tokenizer configurationTokenizer: Main tokenization engineToken: Individual token with text, position, and linguistic featuresCharacterFilter: Text preprocessing filtersTokenFilter: Token post-processing filtersMetadata: Dictionary metadata and configurationSchema: Dictionary schema definition
Training Functions (requires train feature)
train(): Train a morphological analysis model from corpusexport(): Export trained model to dictionary files
See the test_basic.py file for comprehensive API usage examples.