Expand description
A lazily-maintained full-text index over OKF concept files, built on
BurntSushi’s fst crate — the same immutable finite-state-transducer
machinery search engines like tantivy use for their term dictionaries.
§Why this shape
The index is a set of immutable segments. Each incremental update writes
one new segment for the batch of changed files and never rewrites the
existing ones — so “layering in new content” costs only the new segment,
exactly the property the suite wanted. Superseded or removed documents are
recorded as tombstones in the manifest and filtered at query time;
Index::condense later merges every segment into one and drops the
tombstones, reclaiming space without re-reading the source files.
A segment is two files: seg-NNNNN.fst, an fst::Map from term to a
byte offset, and seg-NNNNN.pos, the postings blob those offsets point into
(a varint-encoded (doc_id, term_frequency) list per term). Global state —
the next document id, the live segment list, the per-file (doc, mtime, size) records that drive staleness, per-document metadata, and the tombstone
set — lives in manifest.json. Segment bytes are read into memory on demand
rather than memory-mapped, keeping the dependency surface to just fst.
§Query modes
Index::search understands a small query grammar (see QueryTerm):
plain terms (exact), term* (prefix), term~/term~2 (Levenshtein fuzzy),
and /regex/. Exact, prefix, and fuzzy are native fst automata; the regex
mode drives a regex-automata dense DFA as an fst::Automaton (the
modern equivalent of the transducer feature dropped from regex-automata
0.4), so it prunes the term FST during traversal instead of scanning it.
Structs§
- DocSource
- A document ready to index: its searchable text plus the metadata carried into search results.
- File
Stat - A file’s identity for staleness: its path-relative key plus the
mtime/size the indexer compares against the manifest to decide what to re-index. - Index
- A lazily-maintained fst-segment index rooted at a directory (
.ct/okf/). - Search
Hit - One search result.
- Update
Report - What an
Index::updatechanged.
Enums§
- Query
Term - One parsed query token and how it should match the term dictionary.
Functions§
- parse_
query - Parse a query string into its
QueryTerms. Whitespace separates tokens;/.../is one regex token. The non-regex modes share the index’stokenizeso a query splits exactly the way the stored terms did: a token that tokenizes to several terms contributes leadingExactterms, and any*/~operator applies to the final term (the typical token is a single term, so this is justPrefix/Fuzzy). Empty/operator-only tokens are dropped. - tokenize
- Split
textinto lowercased alphanumeric terms — the shared tokenizer for both indexing and (per-token) querying. Deliberately minimal: Unicode alphanumeric runs, lowercased, no stemming or stop-words, so behaviour is predictable and dependency-free.