Crate whydrogen

Source
Expand description

Whydrogen is a slightly opinioated parser for search queries from humans.

Its main purpose is converting strings of text from a search entry into a more easy to process list of tokens/lexemes (depending on what you are doing with them).

The search syntax in a nutshell:

  • Unless something else applies everythinng seperated by a space is a word.
  • Character classification happens through the means of Unicode category groups.
  • Any sequence of whitespace (Unicode whitespace category plus newline and tab) or the start or end of a query can be a token seperator.
  • Phrases are quoted sequences of text.
    • Supported pairs of quotes are: "…", »…« and «…», this will be expanded in the future.
    • The Phrase can only start after a token seperator.
    • The closing quote must be followed by a token seperator (if not it is taken as part of the phrase).
    • Any token seperator inside a quote is taken as its literal character (as in most quoting syntaxes).
    • Inside a phrase a backslash \ can be used to escape the closing quote character (independent of any token seperators)
    • A double backslash \\ can be used to unambigiously represent a backslash inside the quotes.
    • A backslash followed by anything else is taken as is.
    • A minus - before the first quote marks the phrase as inverted.
    • An unclosed phrase is ignored, the part with the opening quote is treated as a word, parsing continues as usual after that.
  • Key-Value pairs are a keyword and an optionally quoted value seperated by a colon :.
    • A keyword may contain any alphanumeric (unicode letter or number) character and -, _ and .. It may only start with an alphanumreic.
    • Valid keywords are implementation defined.
    • A minus - before the keyword marks the Key-Value pair as inverted.
  • Prefixed values are optionally quoted values prefixed with a single non-alphanumeric character.
    • Valid prefixes are implementation defined.
    • Prefixed values are parsed to the same data structure as Key-Value pairs.
    • A minus - before the prefix marks the prefixed value as inverted.
  • optionally quoted means:
    • A text literal that ends at the next token seperator like a word.
    • Quoted text according to the same quoting rules as Phrases (but starting immedeately instetad of after a token seperator), an additional quote pair of semicolons ;…; is supported.

Design goals of the syntax were:

  • Familiar to anyone who has used such sntax in other search engines.
  • Fault tolerant without synax errors in case of clumsy use.
  • Pasting things like error messages into the serch field should not trigger any search syntax.
  • Quotting must be able to reliably encode any sequence of characters without getting into the way of more casual use.

The name is made up of the word for asking the most important kind of questtion and the most abundand chemical element in the universe, which also happens to be a very important component in answer seeking beings :D.

Re-exports§

pub use crate::keyword_converter::KeywordConverter;

Modules§

keyword_converter
Keyword converters know which keywords and prefixes are part of your search syntax and how to best encode them for your backend.

Structs§

Parser
Thingy that does the actual parsing.

Enums§

Token
Parser output token (lexeme).