Module analiticcl::search

source ·

Structs§

  • Refers to a match and its unigram context
  • Represents a match between the input text and the lexicon.
  • Byte Offset
  • Intermediate datastructure tied to the Finite State Transducer used in most_likely_sequence() Holds the output symbol for each FST state and allows relating output symbols back to the input structures.
  • A complete sequence of output symbols with associated emission and language model (log) probabilities.

Enums§

Constants§

Functions§

  • Classify the token boundaries as detected by find_boundaries as either weak, normal or hard boundaries. This information determines how eager the system is to split on certain boundaries.
  • Given a text string, identify at what points token boundaries occur, for instance between alphabetic characters and punctuation. The text string always ends with a boundary (but it may be a dummy one that covers no length).
  • Find all ngrams in the text of the specified order, respecting the boundaries. This will return a vector of Match instances, referring to the precise (untokenised) text.
  • A redundant match is a higher order match which already scores a perfect distance score when its unigram components are considered separately.