lucivy
BM25 search engine with cross-token fuzzy matching — it finds substrings, handles typos, and matches across word boundaries. Built for code search, technical docs, and as a BM25 complement to vector databases.
Install
Everything is MIT-licensed.
| Language | Install |
|---|---|
| Python | pip install lucivy |
| Node.js | npm install lucivy |
| WASM (browser) | npm install lucivy-wasm |
| Rust | cargo add ld-lucivy |
| C++ | Static library via CXX bridge (build from source) |
Quick start
=
=
See the language-specific READMEs for full API docs:
Query types
lucivy queries operate on stored text (cross-token). They handle multi-word phrases, substrings, separators, and special characters naturally.
contains — the workhorse query
Fuzzy substring match with separator awareness.
# Exact substring
# Substring within a token: "program" matches "programming"
# Fuzzy tolerance (default distance=1, catches typos)
# Strict exact: distance=0 disables fuzzy
contains + regex
Regex on stored text (cross-token).
# Matches "programming language" — the .* spans the space between tokens
# Alternation
contains_split
Splits query into words, each word is a contains, combined with OR.
# String query (auto contains_split across all text fields)
# Explicit dict query on a specific field
boolean
Combine sub-queries with must (AND), should (OR), must_not (NOT).
Filters on non-text fields
Non-text fields (i64, f64, u64, keyword) can be filtered via the filters key.
# Supported ops: eq, ne, lt, lte, gt, gte, in, not_in, between, starts_with, contains
Highlights
All query types support byte-offset highlights. Internal fields (._raw, ._ngram) are automatically filtered out.
=
# e.g. "body": [(5, 9), (20, 31)]
Fields (stored values)
Retrieve stored field values alongside search results — useful for displaying file names, titles, or content excerpts.
=
Snapshots (export / import)
Export an index to a portable .luce binary blob, import it elsewhere.
=
What contains matches
Fuzzy mode (default):
| Query | Document | Match? | Why |
|---|---|---|---|
programming |
"Rust programming is fun" |
yes | exact token match |
programing (typo) |
"Rust programming is fun" |
yes | fuzzy distance=1 |
program |
"Rust programming is fun" |
yes | substring of token |
programming language |
"...programming language used..." |
yes | cross-token with separator |
c++ |
"c++ and c# are popular" |
yes | separator-aware |
std::collections |
"use std::collections::HashMap" |
yes | multi-token + :: separator |
Regex mode (regex: true):
| Pattern | Document | Match? | Why |
|---|---|---|---|
program.*language |
"...programming language used..." |
yes | cross-token regex on stored text |
python|rust |
"Python is versatile" |
yes | alternation |
v[0-9]+ |
"version v2.0 released" |
yes | full-scan fallback (literal < 3 chars) |
Internals
Triple-field layout
Every text field automatically gets 3 sub-fields:
| Sub-field | Tokenizer | Used by |
|---|---|---|
{name} |
stemmed or lowercase | phrase, parse queries (recall) |
{name}._raw |
lowercase only | contains verification (precision) |
{name}._ngram |
character trigrams | contains candidate generation |
This is transparent to the user — you always reference the base field name.
NgramContainsQuery — how contains works
- Candidate collection — depends on mode:
- Fuzzy: term dictionary lookup on
._raw(O(1) via FST), falling back to trigram intersection on._ngramif the exact term isn't found - Regex: trigram union on
._ngramfrom extracted regex literals - Short literals: full segment scan when literals < 3 chars
- Fuzzy: term dictionary lookup on
- Verification — read stored text, dispatch to fuzzy or regex verifier
- BM25 scoring — standard
idf * (1 + k1) * tf / (tf + k1 * (1 - b + b * dl / avgdl))
Building from source
# Rust library tests
# Python bindings
# Node.js bindings
&&
# C++ bindings
Lineage
Fork of tantivy v0.26.0 (via izihawa/tantivy).
quickwit-oss/tantivy v0.22
-> izihawa/tantivy v0.26.0 (regex phrase queries, FST improvements)
-> L-Defraiteur/lucivy (NgramContainsQuery, contains_split, fuzzy/regex/hybrid modes, HighlightSink, Python/Node.js/C++/WASM bindings)
License
MIT. See LICENSE.