lindera-nodejs
Node.js binding for Lindera, a Japanese morphological analysis engine.
Overview
lindera-nodejs provides a comprehensive Node.js interface to the Lindera morphological analysis engine, supporting Japanese, Korean, and Chinese text analysis. This implementation includes all major features:
- Multi-language Support: Japanese (IPADIC, IPADIC-NEologd, UniDic), Korean (ko-dic), Chinese (CC-CEDICT, Jieba)
- Character Filters: Text preprocessing with mapping, regex, Unicode normalization, and Japanese iteration mark handling
- Token Filters: Post-processing filters including lowercase, length filtering, stop words, and Japanese-specific filters
- Flexible Configuration: Configurable tokenization modes and penalty settings
- Metadata Support: Complete dictionary schema and metadata management
- TypeScript Support: Full type definitions included out of the box
Features
Core Components
- TokenizerBuilder: Fluent API for building customized tokenizers
- Tokenizer: High-performance text tokenization with integrated filtering
- CharacterFilter: Pre-processing filters for text normalization
- TokenFilter: Post-processing filters for token refinement
- Metadata & Schema: Dictionary structure and configuration management
- Training & Export (optional): Train custom morphological analysis models from corpus data
Supported Dictionaries
- Japanese: IPADIC, IPADIC-NEologd, UniDic
- Korean: ko-dic
- Chinese: CC-CEDICT, Jieba
- Custom: User dictionary support
Pre-built dictionaries are available from GitHub Releases.
Download a dictionary archive (e.g. lindera-ipadic-*.zip) and specify the extracted path when loading.
Filter Types
Character Filters:
- Mapping filter (character replacement)
- Regex filter (pattern-based replacement)
- Unicode normalization (NFKC, etc.)
- Japanese iteration mark normalization
Token Filters:
- Text case transformation (lowercase, uppercase)
- Length filtering (min/max character length)
- Stop words filtering
- Japanese-specific filters (base form, reading form, etc.)
- Korean-specific filters
Install project dependencies
- Node.js 18+ : https://nodejs.org/
- Rust : https://www.rust-lang.org/tools/install
- @napi-rs/cli :
npm install -g @napi-rs/cli
Setup repository
# Clone lindera project repository
git clone git@github.com:lindera/lindera.git
cd lindera
Install lindera-nodejs
This command builds the library with development settings (debug build).
cd lindera-nodejs
npm install
npm run build
Quick Start
Basic Tokenization
const = require;
// Load dictionary
// Load dictionary from a local path (download from GitHub Releases)
const dictionary = ;
// Create a tokenizer
const tokenizer = ;
// Tokenize Japanese text
const text = "すもももももももものうち";
const tokens = tokenizer.;
Using Character Filters
const = require;
// Create tokenizer builder
const builder = ;
builder.;
builder.;
// Add character filters
builder.;
builder.;
// Build tokenizer with filters
const tokenizer = builder.;
const text = "テストー123";
const tokens = tokenizer.; // Will apply filters automatically
Using Token Filters
const = require;
// Create tokenizer builder
const builder = ;
builder.;
builder.;
// Add token filters
builder.;
builder.;
builder.;
// Build tokenizer with filters
const tokenizer = builder.;
const tokens = tokenizer.;
Integrated Pipeline
const = require;
// Build tokenizer with integrated filters
const builder = ;
builder.;
builder.;
// Add character filters
builder.;
builder.;
// Add token filters
builder.;
builder.;
// Build and use
const tokenizer = builder.;
const tokens = tokenizer.;
Working with Metadata
const = require;
// Create metadata with default values
const metadata = ;
console.log;
console.log;
// Create metadata from a JSON file
const loaded = ;
console.log;
Advanced Usage
Filter Configuration Examples
Character filters and token filters accept configuration as object arguments:
const = require;
const builder = ;
builder.;
// Character filters with object configuration
builder.;
builder.;
builder.;
// Token filters with object configuration
builder.;
builder.;
builder.;
// Filters without configuration can omit the object
builder.;
builder.;
const tokenizer = builder.;
See examples/ directory for comprehensive examples including:
tokenize.js: Basic tokenizationtokenize_with_filters.js: Using character and token filterstokenize_with_userdict.js: Custom user dictionarytrain_and_export.js: Train and export custom dictionaries (requirestrainfeature)tokenize_with_decompose.js: Decompose mode tokenization
Dictionary Support
Japanese
- IPADIC: Default Japanese dictionary, good for general text
- UniDic: Academic dictionary with detailed morphological information
Korean
- ko-dic: Standard Korean dictionary for morphological analysis
Chinese
- CC-CEDICT: Community-maintained Chinese-English dictionary
Custom Dictionaries
- User dictionary support for domain-specific terms
- CSV format for easy customization
Dictionary Training (Experimental)
lindera-nodejs supports training custom morphological analysis models from annotated corpus data when built with the train feature.
Building with Training Support
npm run build -- --features train
Training a Model
const = require;
// Train a model from corpus
;
Exporting Dictionary Files
const = require;
// Export trained model to dictionary files
;
This will create:
lex.csv: Lexicon filematrix.def: Connection cost matrixunk.def: Unknown word definitionschar.def: Character definitionsmetadata.json: Dictionary metadata (if provided)
See examples/train_and_export.js for a complete example.
API Reference
Core Classes
TokenizerBuilder: Fluent builder for tokenizer configurationTokenizer: Main tokenization engineToken: Individual token with text, position, and linguistic featuresMetadata: Dictionary metadata and configurationSchema: Dictionary schema definition
Training Functions (requires train feature)
train(): Train a morphological analysis model from corpusexportModel(): Export trained model to dictionary files