Crate lindera_sqlite

Crate lindera_sqlite 

Source
Expand description

§lindera-sqlite

A SQLite FTS5 (Full-Text Search 5) tokenizer extension that provides support for Chinese, Japanese, and Korean (CJK) text analysis using the Lindera morphological analyzer.

§Features

  • CJK Language Support: Tokenizes Chinese, Japanese, and Korean text using Lindera
  • Multiple Dictionaries: Supports various embedded dictionaries (IPADIC, UniDic, ko-dic, CC-CEDICT)
  • Configurable: Uses YAML configuration for character filters and token filters
  • SQLite Integration: Seamlessly integrates with SQLite’s FTS5 full-text search

§Usage

§Building the Extension

cargo build --release --features=embedded-cjk

§Setting Up Configuration

Set the LINDERA_CONFIG_PATH environment variable to point to your Lindera configuration file:

export LINDERA_CONFIG_PATH=./resources/lindera.yml

§Loading in SQLite

.load ./target/release/liblindera_sqlite lindera_fts5_tokenizer_init

§Creating an FTS5 Table

CREATE VIRTUAL TABLE example USING fts5(content, tokenize='lindera_tokenizer');

§Searching

INSERT INTO example(content) VALUES ('日本語の全文検索');
SELECT * FROM example WHERE content MATCH '検索';

§Architecture

This library provides a C ABI interface for SQLite to use Lindera as a custom FTS5 tokenizer. The main components are:

  • load_tokenizer: Initializes a Lindera tokenizer with configuration
  • lindera_fts5_tokenize: C-compatible entry point for tokenization (called by SQLite)
  • Internal tokenization logic that converts text to tokens and calls back to SQLite

Structs§

Fts5Tokenizer
Wrapper for Lindera tokenizer used in FTS5.

Constants§

SQLITE_INTERNAL
SQLite internal error status code.
SQLITE_MISUSE
SQLite misuse error status code.
SQLITE_OK
SQLite success status code.

Functions§

lindera_fts5_tokenize
C-compatible FTS5 tokenization function.
load_tokenizer
Loads and initializes a Lindera tokenizer.

Type Aliases§

TokenFunction
Token callback function type.