Module analysis

Expand description

luci-analysis — text analysis pipeline for Luci.

Transforms raw text into indexed terms via a three-stage pipeline:

Raw Text → Tokenizer → Token Filters → Indexed Terms

Provides the Tokenizer and TokenFilter traits with built-in implementations matching Elasticsearch’s analyzer model: standard, simple, whitespace, and keyword analyzers.

See [[analyzers]] for the full specification.

Modules§

config

Structs§

Analyzer: A complete text analysis pipeline: char filters + tokenizer + token filters.
AnalyzerRegistry: Registry of named analyzers with fallback resolution.
AsciiFoldingFilter: Converts Unicode characters to their ASCII equivalents.
EdgeNGramTokenFilter: Generates edge n-grams (prefix n-grams) from each token.
EdgeNGramTokenizer: Produces edge n-grams (prefix n-grams) from the input text.
HtmlStripCharFilter: Strips HTML tags and decodes common HTML entities.
KeywordTokenizer: Emits the entire input as a single token.
LetterTokenizer: Splits text on non-letter characters.
LowercaseFilter: Lowercases all token text.
MappingCharFilter: Replaces characters/strings using a mapping table.
NGramTokenFilter: Generates n-grams from each token.
NGramTokenizer: Produces n-grams of specified sizes from the input text.
OffsetCorrection: Character filters transform raw text before tokenization.
PathHierarchyTokenizer: Splits filesystem paths into hierarchical tokens.
PatternReplaceCharFilter: Replaces characters matching a regex pattern.
PatternTokenizer: Splits text using a regular expression pattern.
ShingleFilter: Produces word-level n-grams (shingles) from the token stream.
StandardTokenizer: Unicode Text Segmentation tokenizer (UAX#29 word boundaries).
StemmerFilter: Reduces tokens to their word stems using the Snowball algorithm.
StopFilter: Removes stop words from the token stream.
SynonymFilter: Expands or replaces tokens with synonyms.
Token: A single token produced by the analysis pipeline.
WhitespaceTokenizer: Splits text on Unicode whitespace.

Enums§

StemmerAlgorithm: Re-export the Algorithm enum so callers don’t need to depend on rust-stemmers directly. Enum of all supported algorithms. Check the Snowball-Website for details.

Traits§

CharFilter: Transforms raw text before tokenization.
TokenFilter: Transforms tokens in the analysis pipeline.
Tokenizer: Breaks input text into a sequence of tokens.

Functions§

correct_offset: Map a byte offset in filtered text back to the original text.
keyword_analyzer: keyword analyzer: keyword tokenizer, no filters.
language_analyzer: language analyzer: UAX#29 tokenizer + lowercase + stop words + stemmer.
simple_analyzer: simple analyzer: letter tokenizer + lowercase filter.
standard_analyzer: standard analyzer: UAX#29 tokenizer + lowercase filter.
stop_analyzer: stop analyzer: UAX#29 tokenizer + lowercase + English stop words.
whitespace_analyzer: whitespace analyzer: whitespace tokenizer, no filters.

Module analysis

Module analysis Copy item path

Modules§

Structs§

Enums§

Traits§

Functions§

Module analysis