Expand description
Tokenization modes and penalty configurations.
This module defines the different tokenization modes available and their penalty configurations for controlling segmentation behavior.
§Modes
- Normal: Standard tokenization based on dictionary cost
- Decompose: Decomposes compound words with penalty-based control
§Examples
# Normal mode
tokenizer = lindera.TokenizerBuilder().set_mode("normal").build()
# Decompose mode
tokenizer = lindera.TokenizerBuilder().set_mode("decompose").build()
# Custom penalty configuration
penalty = lindera.Penalty(
kanji_penalty_length_threshold=2,
kanji_penalty_length_penalty=3000
)Structs§
- PyPenalty
- Penalty configuration for decompose mode.
Enums§
- PyMode
- Tokenization mode.