# Common
Alkale features a `common` module containing many extra methods for `TokenizerContext`. They are each documented with
their purpose and intended usage. This file will provide a brief overview of each additional method provided by `common`.
The `common` module additionally has a few built-in Notifications types at `common::error`, these will additionally
be covered.
## Methods
### Module: [`common`](./src/common/mod.rs)
- `skip_until`: Takes in a `char` and repeatedly skip characters until that character is found or all characters have
been consumed. The character in the source-code that matched the argument will NOT be consumed. This has no intended
purpose in mind, but may be used for recovery.
- `capture_span`: Takes a predicate function and passes the `TokenizerContext` reference into it. The predicate function
will be executed and its result returned alongside a `Span` that covers the area of code consumed by the predicate. This
may be used if a token has an unknown length and/or can span multiple lines.
- `map_single_char_token`: Takes a predicate function that converts a `char` into an `Option<T>`, where `T` is the `TokenData`
used by the `TokenizerContext`. This method will peek a single character and pass it into the predicate. If the predicate returns
`Some` data, then the character is consumed, a `Token` with the relevant data is pushed to the context's token list, and
`true` is returned. If the predicate returned `None`, nothing happens and the function returns `false`. This method should be used for
single-character tokens such as `+`, `-`, `*`, `/`, etc.
- `read_into_while`: Takes in a mutable `String` reference and a predicate function that converts `&char`s to `bool`s.
The next character in the source code will be repeatedly passed into the predicate and appended to the `String`. Once a
character causes the predicate to produce `false`, the function exits and that character is not consumed. This may be used
if a run of similar characters need to be collected into a single String, such as for identifiers, keywords, or the like.
- `fold`: Takes an initial accumulator and a predicate function supplied with a `char` and a mutable reference to the accumulator.
This method will repeatedly pass in the next `char` to the predicate until it returns a `Break`. The character that caused the
predicate to return `Break` is NOT consumed. This method returns the final accumulator value. This method is essentially a more
general form of `read_into_while` and as such may be used for a lot of different applications.
- `recover_with`: Takes in a predicate function that converts `char`s into `bool`s. The context will pass
in the next character into the predicate and repeatedly consume characters until it returns `false` or runs
out of characters. The character that caused the predicate to return `false` will NOT be consumed. This should
usually be used to recover from error states— such as finding the next whitespace character after parsing a complex
token goes wrong.
### Module: [`common::structure`](./src/common/structure.rs)
- `skip_whitespace`: This method just skips every character that returns true for `char::is_whitespace`. Once a non-whitespace
character is found, this method stops and doesn't consume it. Note that this method may skip newline characters,
and thus may message up `get_indent_level`'s functionality (the next listed method)
- `get_indent_level`: Takes a `char` (indent character) and `usize` (# of characters per indent level).
This method will consume all continuous characters that match the "indent character" until a non-matching character
is found. Then, the method will return an "indent level" based on how many of those characters were found divided by
the input `usize`. This method is intended to be used for languages where line indentation is relevant to functionality,
such as Python— as such, this method should only be called after a newline is consumed, and should be called every time
that happens.
### Module: [`common::string`](./src/common/string.rs)
- `parse_string`: Takes a "character consumer" predicate, which takes a `&mut String` and the next `char` from the source
code, and returns a possible error. This method will consume 1 character and declare it the delimiter. It will then repeatedly
call the predicate (logging any errors) to consume characters and append them to the buffer `String` until the delimiter character
is found, consuming it too. This method will return a `Span` over the consumed range as well as either a complete `String` or a
list of errors. A `Span` is returned either way. This method is a very general form of a string parser, and specific implementations using it are below.
- `try_parse_simple_string`: If the next character is a `"` or `'`, parse a string according to the above method and forward the result.
This method supports all valid one-character escape codes that Rust does. If no delimiter was found initially, return `None`. This
method is preferred over `parse_string` as you will generally not need to provide your own character parser unless you want
to support different escape characters.
- `try_parse_strict_string`: Exactly the same as `try_parse_simple_string`, but only supports `"` as a delimiter. Use this
if you don't want `'` as a delimiter and/or want to reserve `'` for character tokens.
- `try_parse_character_token`: If the next character is a `'`, consume it and parse the next "character."
(Escapes are valid with the same rules as `try_parse_simple_string`) Finally, parses a second `'`. This method will error
if the inner character is invalid or the apostrophes are incorrect, otherwise returns the parsed `char`. A `Span` is returned
either way, and `None` is returned if `'` was not found intially. This method should be used for languages that want to support
a very standard single-character token.
- `parse_simple_character`: This is not a method on `TokenizerContext`, but is rather a function stored in this module. This function
defines the baseline character-parsing behavior and may serve as a model if you wish to create your own for different behavior. This method
is what's passed into `parse_string` by `try_parse_simple_string` and `try_parse_strict_string`.
### Module: [`common::identifier`](./src/common/identifier.rs)
- `try_parse_identifier`: Takes two predicate functions, a "first" and a "rest." Both convert `&char`s into `bool`s. If the
"first" predicate matches the next character, consume it and add it to a `String` buffer. The method will then consume as many
characters as possible that match the "rest" predicate, adding them to the same buffer. Once out of matching characters, the `String`
is returned along with a `Span`. Returns `None` if the initial predicate failed. This method is used to parse identifiers, and it
effectively acts as `read_into_while` with a second predicate for the first character (as identifiers tend to have a different set of allowed
characters for the initial character)
- `try_parse_standard_identifier`: A default implementation of `try_parse_identifier`. This method will parse out
a `String` in the form `[a-zA-Z_][a-zA-Z0-9_]*` and return it with a `Span`. If the next character in the stream doesn't
match `[a-zA-Z_]`, the method returns `None`. This method should be used a majority of the time for parsing identifiers, as almost
all languages will get by fine with this set of characters. If a different set is needed (such as allowing `$`), use `try_parse_identifier`.
### Module: [`common::numeric`](./src/common/numeric/mod.rs)
**NOTE:** None of these numerical methods are intended to parse negative numbers. In Alkale, numbers should be tokenized as unsigned/positive
numbers, and then later corrected during parsing. This prevents is to prevent issues regarding minus signs incorrectly being tokenized into a
number when they were intended to be a subtraction operation.
Unsigned integers can always be safely converted to signed integers later without going out of bounds, and floats can losslessly switch signs
so this should have no impact on precision or what numbers are representable.
- `consume_standard_number`: If the next character is a digit (`[0-9.]`), consume as many characters matching `[A-Za-z0-9_.]` as possible.
Every consumed character is collected into a `String` and returned along with a `Span`. If no characters were consumed, `None` is returned.
Additionally, `+` and `-` may be matched, but only if preceded by an `e` or `E`. This method should be used when manual number parsing is
necessary, and may reduce code complexity in regards to collecting the initial number-like bit of text for processing.
- `parse_standard_base_strict`: If the next character is `0`, consume it and the character following it. If it is a `b`, `x`, or `o`,
return the `StandardBase` represented by the character. This method will error if the character after the `0` is missing or anything else.
This method will return `None` if no `0` was initially found. This method is used by other method in this library, but may come in handy.
- `parse_standard_base`: Exactly the same as `parse_standard_base_strict`, but will return a `StandardBase` representing
base-10 instead of `None`.
- `try_parse_integer_from_base`: Takes in a `NumericalBase` object (such as `StandardBase`) as well as a generic integer type (such as `u64`,
make sure to select a type that can fit the largest number you plan to allow). This method will parse an integer that matches the digits defined
by the base, and stop parsing when a non-base, non-underscore digit is found. If no digit was found initially, this method will return `None`—
otherwise it will return an error/result integer, alongside a `Span`. This method will always consume an entire number, even if an error occurs.
This method should generally be used after `parse_standard_base` or the strict variant, but shouldn't be super necessary unless custom bases are used.
- `try_parse_integer`: Works exactly like `try_parse_integer_from_base` except it always uses base-10. This method should be marginally
more performant than its base-neutral counterpart. Very, very simple languages can use this for number parsing but it's recommended to
allow for different bases to be used if possible.
- `try_parse_float`: Consumes a number according to `consume_standard_number`, removes all underscores, and then parses it using
Rust's default float parser. Returns the value as an `f64` or an error— either way a `Span` is returned as well. Returns `None` if
no number was found initially. This method should be used by languages that want to *only* support floats, for example a basic
calculator might be simplified if all numbers are floats.
- `try_parse_number`: This method combines `try_parse_float` and `try_parse_integer_from_base` into one. If `0X` is found,
where `X` is `b`, `o`, or `x`, then they are consumed and an integer is parsed using that base. Otherwise, `consume_standard_number` is
called and the result either parsed as a float or base-10 integer depending on the presence of `.`, `+`, `-`, or an `f` suffix. This method
should be the go-to number parser in most cases, unless a specific feature is needed that is missing as-is (For example, parsing a specific
suffix like Rust does)
#### Module: [`common::numeric::base`](./src/common/numeric/base.rs)
This module doesn't contain any methods, but holds `NumericalBase`, a trait to define a custom base for use with the above methods.
It also contains `StandardBase`, a default implementation of `NumericalBase` for Binary, Decimal, Octal, and Hexadecimal.
## Notifications
[`common::error`](./src/common/error.rs) contains a few built-in Notification types.
| SimpleErrorNotification | `String` | Prototype errors without Span information. |
| ErrorNotification | `String`, `Span` | Prototype errors with Span information. |
| SimpleWarningNotification | `String` | Prototype warnings without Span information. |
| WarningNotification | `String`, `Span` | Prototype warnings with Span information. |
These types should really only be used for debug cases or for early lexer prototypes. Once your
language or parser has progressed into a usable state, these should be swapped out for custom
`Notification` implementations. (You can, of course, use custom types from the get-go as well)