Expand description
§lindera-sqlite
A SQLite FTS5 (Full-Text Search 5) tokenizer extension that provides support for Chinese, Japanese, and Korean (CJK) text analysis using the Lindera morphological analyzer.
§Features
- CJK Language Support: Tokenizes Chinese, Japanese, and Korean text using Lindera
- Multiple Dictionaries: Supports various embedded dictionaries (IPADIC, UniDic, ko-dic, CC-CEDICT)
- Configurable: Uses YAML configuration for character filters and token filters
- SQLite Integration: Seamlessly integrates with SQLite’s FTS5 full-text search
§Usage
§Building the Extension
cargo build --release --features=embedded-cjk§Setting Up Configuration
Set the LINDERA_CONFIG_PATH environment variable to point to your Lindera configuration file:
export LINDERA_CONFIG_PATH=./resources/lindera.yml§Loading in SQLite
.load ./target/release/liblindera_sqlite lindera_fts5_tokenizer_init§Creating an FTS5 Table
CREATE VIRTUAL TABLE example USING fts5(content, tokenize='lindera_tokenizer');§Searching
INSERT INTO example(content) VALUES ('日本語の全文検索');
SELECT * FROM example WHERE content MATCH '検索';§Architecture
This library provides a C ABI interface for SQLite to use Lindera as a custom FTS5 tokenizer. The main components are:
load_tokenizer: Initializes a Lindera tokenizer with configurationlindera_fts5_tokenize: C-compatible entry point for tokenization (called by SQLite)- Internal tokenization logic that converts text to tokens and calls back to SQLite
Structs§
- Fts5
Tokenizer - Wrapper for Lindera tokenizer used in FTS5.
Constants§
- SQLITE_
INTERNAL - SQLite internal error status code.
- SQLITE_
MISUSE - SQLite misuse error status code.
- SQLITE_
OK - SQLite success status code.
Functions§
- lindera_
fts5_ tokenize - C-compatible FTS5 tokenization function.
- load_
tokenizer - Loads and initializes a Lindera tokenizer.
Type Aliases§
- Token
Function - Token callback function type.