Skip to main content

Module okfindex

Module okfindex 

Source
Expand description

A lazily-maintained full-text index over OKF concept files, built on BurntSushi’s fst crate — the same immutable finite-state-transducer machinery search engines like tantivy use for their term dictionaries.

§Why this shape

The index is a set of immutable segments. Each incremental update writes one new segment for the batch of changed files and never rewrites the existing ones — so “layering in new content” costs only the new segment, exactly the property the suite wanted. Superseded or removed documents are recorded as tombstones in the manifest and filtered at query time; Index::condense later merges every segment into one and drops the tombstones, reclaiming space without re-reading the source files.

A segment is two files: seg-NNNNN.fst, an fst::Map from term to a byte offset, and seg-NNNNN.pos, the postings blob those offsets point into (a varint-encoded (doc_id, term_frequency) list per term). Global state — the next document id, the live segment list, the per-file (doc, mtime, size) records that drive staleness, per-document metadata, and the tombstone set — lives in manifest.json. Segment bytes are read into memory on demand rather than memory-mapped, keeping the dependency surface to just fst.

§Query modes

Index::search understands a small query grammar (see QueryTerm): plain terms (exact), term* (prefix), term~/term~2 (Levenshtein fuzzy), and /regex/. Exact, prefix, and fuzzy are native fst automata; the regex mode drives a regex-automata dense DFA as an fst::Automaton (the modern equivalent of the transducer feature dropped from regex-automata 0.4), so it prunes the term FST during traversal instead of scanning it.

Structs§

DocSource
A document ready to index: its searchable text plus the metadata carried into search results.
FileStat
A file’s identity for staleness: its path-relative key plus the mtime/size the indexer compares against the manifest to decide what to re-index.
Index
A lazily-maintained fst-segment index rooted at a directory (.ct/okf/).
SearchHit
One search result.
UpdateReport
What an Index::update changed.

Enums§

QueryTerm
One parsed query token and how it should match the term dictionary.

Functions§

parse_query
Parse a query string into its QueryTerms. Whitespace separates tokens; /.../ is one regex token. The non-regex modes share the index’s tokenize so a query splits exactly the way the stored terms did: a token that tokenizes to several terms contributes leading Exact terms, and any */~ operator applies to the final term (the typical token is a single term, so this is just Prefix/Fuzzy). Empty/operator-only tokens are dropped.
tokenize
Split text into lowercased alphanumeric terms — the shared tokenizer for both indexing and (per-token) querying. Deliberately minimal: Unicode alphanumeric runs, lowercased, no stemming or stop-words, so behaviour is predictable and dependency-free.