Crate lindera_sqlite

Crate lindera_sqlite 

Source
Expand description

§lindera-sqlite

A SQLite FTS5 (Full-Text Search 5) tokenizer extension that provides support for Chinese, Japanese, and Korean (CJK) text analysis using the Lindera morphological analyzer.

§Features

  • CJK Language Support: Tokenizes Chinese, Japanese, and Korean text using Lindera
  • Multiple Dictionaries: Supports various embedded dictionaries (IPADIC, UniDic, ko-dic, CC-CEDICT)
  • Configurable: Uses YAML configuration for character filters and token filters
  • SQLite Integration: Seamlessly integrates with SQLite’s FTS5 full-text search

§Usage

§Building the Extension

cargo build --release --features=embedded-cjk

§Setting Up Configuration

Set the LINDERA_CONFIG_PATH environment variable to point to your Lindera configuration file:

export LINDERA_CONFIG_PATH=./resources/lindera.yml

§Loading in SQLite

.load ./target/release/liblindera_sqlite lindera_fts5_tokenizer_init

§Creating an FTS5 Table

CREATE VIRTUAL TABLE example USING fts5(content, tokenize='lindera_tokenizer');

§Searching

INSERT INTO example(content) VALUES ('日本語の全文検索');
SELECT * FROM example WHERE content MATCH '検索';

§Architecture

This library provides a C ABI interface for SQLite to use Lindera as a custom FTS5 tokenizer. The main components are:

  • load_tokenizer: Initializes a Lindera tokenizer with configuration
  • lindera_fts5_tokenize: C-compatible entry point for tokenization (called by SQLite)
  • Internal tokenization logic that converts text to tokens and calls back to SQLite

Structs§

Fts5Tokenizer
Wrapper for Lindera tokenizer used in FTS5.
TokenCallback
Convenience wrapper around SQLite’s token callback.

Constants§

SQLITE_INTERNAL
SQLite internal error status code.
SQLITE_MISUSE
SQLite misuse error status code.
SQLITE_OK
SQLite success status code.

Functions§

ffi_panic_boundary
Runs an operation behind a panic boundary suitable for the SQLite FFI.
lindera_fts5_tokenize
C-compatible FTS5 tokenization function.
load_tokenizer
Loads and initializes a Lindera tokenizer.

Type Aliases§

TokenFunction
Token callback function type.