Skip to main content

Crate lindera

Crate lindera 

Source
Expand description

§Lindera Python Bindings

Python bindings for Lindera, a morphological analysis library for CJK text.

Lindera provides high-performance tokenization and morphological analysis for:

  • Japanese (IPADIC, IPADIC NEologd, UniDic)
  • Korean (ko-dic)
  • Chinese (CC-CEDICT)

§Features

  • Dictionary management: Build, load, and use custom dictionaries
  • Tokenization: Multiple tokenization modes (normal, decompose)
  • Filters: Character and token filtering pipeline
  • Training: Train custom morphological models (with train feature)
  • User dictionaries: Support for custom user dictionaries

§Examples

import lindera

# Create a tokenizer
tokenizer = lindera.TokenizerBuilder().build()

# Tokenize text
tokens = tokenizer.tokenize("関西国際空港")
for token in tokens:
    print(token["text"], token["detail"])

Modules§

character_filter
Character filters for preprocessing text.
dictionary
Dictionary management for morphological analysis.
error
Error types for Lindera operations.
metadata
Dictionary metadata configuration.
mode
Tokenization modes and penalty configurations.
schema
Dictionary schema definitions.
segmenter
Segmenter implementation for morphological analysis.
token
token_filter
Token filters for post-processing tokens.
tokenizer
Tokenizer implementation for morphological analysis.
trainer
Training functionality for custom morphological models.
util
Utility functions for Python-Rust data conversion.

Functions§

version
Returns the version of the lindera-python package.