Expand description
luci-analysis — text analysis pipeline for Luci.
Transforms raw text into indexed terms via a three-stage pipeline:
Raw Text → Tokenizer → Token Filters → Indexed TermsProvides the Tokenizer and TokenFilter traits with built-in
implementations matching Elasticsearch’s analyzer model: standard,
simple, whitespace, and keyword analyzers.
See [[analyzers]] for the full specification.
Modules§
Structs§
- Analyzer
- A complete text analysis pipeline: char filters + tokenizer + token filters.
- Analyzer
Registry - Registry of named analyzers with fallback resolution.
- Ascii
Folding Filter - Converts Unicode characters to their ASCII equivalents.
- EdgeN
Gram Token Filter - Generates edge n-grams (prefix n-grams) from each token.
- EdgeN
Gram Tokenizer - Produces edge n-grams (prefix n-grams) from the input text.
- Html
Strip Char Filter - Strips HTML tags and decodes common HTML entities.
- Keyword
Tokenizer - Emits the entire input as a single token.
- Letter
Tokenizer - Splits text on non-letter characters.
- Lowercase
Filter - Lowercases all token text.
- Mapping
Char Filter - Replaces characters/strings using a mapping table.
- NGram
Token Filter - Generates n-grams from each token.
- NGram
Tokenizer - Produces n-grams of specified sizes from the input text.
- Offset
Correction - Character filters transform raw text before tokenization.
- Path
Hierarchy Tokenizer - Splits filesystem paths into hierarchical tokens.
- Pattern
Replace Char Filter - Replaces characters matching a regex pattern.
- Pattern
Tokenizer - Splits text using a regular expression pattern.
- Shingle
Filter - Produces word-level n-grams (shingles) from the token stream.
- Standard
Tokenizer - Unicode Text Segmentation tokenizer (UAX#29 word boundaries).
- Stemmer
Filter - Reduces tokens to their word stems using the Snowball algorithm.
- Stop
Filter - Removes stop words from the token stream.
- Synonym
Filter - Expands or replaces tokens with synonyms.
- Token
- A single token produced by the analysis pipeline.
- Whitespace
Tokenizer - Splits text on Unicode whitespace.
Enums§
- Stemmer
Algorithm - Re-export the Algorithm enum so callers don’t need to depend on
rust-stemmersdirectly. Enum of all supported algorithms. Check the Snowball-Website for details.
Traits§
- Char
Filter - Transforms raw text before tokenization.
- Token
Filter - Transforms tokens in the analysis pipeline.
- Tokenizer
- Breaks input text into a sequence of tokens.
Functions§
- correct_
offset - Map a byte offset in filtered text back to the original text.
- keyword_
analyzer keywordanalyzer: keyword tokenizer, no filters.- language_
analyzer languageanalyzer: UAX#29 tokenizer + lowercase + stop words + stemmer.- simple_
analyzer simpleanalyzer: letter tokenizer + lowercase filter.- standard_
analyzer standardanalyzer: UAX#29 tokenizer + lowercase filter.- stop_
analyzer stopanalyzer: UAX#29 tokenizer + lowercase + English stop words.- whitespace_
analyzer whitespaceanalyzer: whitespace tokenizer, no filters.