Skip to main content

Module bm25

Module bm25 

Source
Expand description

BM25 keyword search index for code chunks.

Provides camelCase/snake_case-aware tokenization via CodeSplitFilter and an in-RAM tantivy index (Bm25Index) that supports per-field boosted queries so identifier sub-tokens (e.g. json from parseJsonConfig) are matched correctly.

Structs§

BM25Fields
Handles to the tantivy schema fields used by Bm25Index.
Bm25Index
In-RAM BM25 index over a slice of CodeChunks.
CodeSplitFilter
Tantivy TokenFilter that emits sub-tokens for camelCase/snake_case identifiers in addition to the original token.
CodeSplitFilterWrapper
Wrapper tokenizer produced by CodeSplitFilter::transform.
CodeSplitTokenStream
Token stream produced by CodeSplitFilterWrapper.

Functions§

build_schema
Construct the tantivy Schema and return field handles.
code_analyzer
Build a tantivy TextAnalyzer that tokenizes, expands camelCase/snake_case identifiers into sub-tokens, then lowercases everything.
split_code_identifier
Split a code identifier into its constituent sub-words.