Skip to main content

Module analysis

Module analysis 

Source
Expand description

luci-analysis — text analysis pipeline for Luci.

Transforms raw text into indexed terms via a three-stage pipeline:

Raw Text → Tokenizer → Token Filters → Indexed Terms

Provides the Tokenizer and TokenFilter traits with built-in implementations matching Elasticsearch’s analyzer model: standard, simple, whitespace, and keyword analyzers.

See [[analyzers]] for the full specification.

Modules§

config

Structs§

Analyzer
A complete text analysis pipeline: char filters + tokenizer + token filters.
AnalyzerRegistry
Registry of named analyzers with fallback resolution.
AsciiFoldingFilter
Converts Unicode characters to their ASCII equivalents.
EdgeNGramTokenFilter
Generates edge n-grams (prefix n-grams) from each token.
EdgeNGramTokenizer
Produces edge n-grams (prefix n-grams) from the input text.
HtmlStripCharFilter
Strips HTML tags and decodes common HTML entities.
KeywordTokenizer
Emits the entire input as a single token.
LetterTokenizer
Splits text on non-letter characters.
LowercaseFilter
Lowercases all token text.
MappingCharFilter
Replaces characters/strings using a mapping table.
NGramTokenFilter
Generates n-grams from each token.
NGramTokenizer
Produces n-grams of specified sizes from the input text.
OffsetCorrection
Character filters transform raw text before tokenization.
PathHierarchyTokenizer
Splits filesystem paths into hierarchical tokens.
PatternReplaceCharFilter
Replaces characters matching a regex pattern.
PatternTokenizer
Splits text using a regular expression pattern.
ShingleFilter
Produces word-level n-grams (shingles) from the token stream.
StandardTokenizer
Unicode Text Segmentation tokenizer (UAX#29 word boundaries).
StemmerFilter
Reduces tokens to their word stems using the Snowball algorithm.
StopFilter
Removes stop words from the token stream.
SynonymFilter
Expands or replaces tokens with synonyms.
Token
A single token produced by the analysis pipeline.
WhitespaceTokenizer
Splits text on Unicode whitespace.

Enums§

StemmerAlgorithm
Re-export the Algorithm enum so callers don’t need to depend on rust-stemmers directly. Enum of all supported algorithms. Check the Snowball-Website for details.

Traits§

CharFilter
Transforms raw text before tokenization.
TokenFilter
Transforms tokens in the analysis pipeline.
Tokenizer
Breaks input text into a sequence of tokens.

Functions§

correct_offset
Map a byte offset in filtered text back to the original text.
keyword_analyzer
keyword analyzer: keyword tokenizer, no filters.
language_analyzer
language analyzer: UAX#29 tokenizer + lowercase + stop words + stemmer.
simple_analyzer
simple analyzer: letter tokenizer + lowercase filter.
standard_analyzer
standard analyzer: UAX#29 tokenizer + lowercase filter.
stop_analyzer
stop analyzer: UAX#29 tokenizer + lowercase + English stop words.
whitespace_analyzer
whitespace analyzer: whitespace tokenizer, no filters.