Expand description
The core of a parser for a unique textual notation that can be used as both a data format and a markup language and that has powerful extensibility of both syntax and semantics. It is inspired by the little-known Curl programming language. It is very parameterized to allow maximal reuse for different applications. It is capable of zero-copy operation (depending on how you concretize it), including for its generic designs of chunked text representations and omitting escape characters.
§Overview
The notation is similar to Lisp S-expressions in that there are nested forms delimited by brackets and in that the first sub-form in a nest (the “head”) can be interpreted as an operator (which can also be thought of as a constructor). Unlike S-expressions, but like Curl, the parsing and meaning of nested text and nested forms can be extended by two types of macros (somewhat like “reader macros” of Lisp). Also unlike S-expressions, all text outside nested forms is preserved exactly, as is the text inside some nested forms, and so the notation is also a markup language. Head forms can be bound to macros, which is what causes them to be interpreted as operators, but they can also be unbound which leaves a nested form uninterpreted.
The macros are implemented as functions, termed “combiner“s. One of the types of combiner, termed “operative”, takes nested text unparsed and can parse it however it wants. The other type of combiner, termed “applicative”, takes a list of forms produced by recursively parsing nested text. For both combiner types, whatever is returned is substituted for the nested form in the abstract syntax tree (AST) returned by the parser. (The terms “combiner”, “operative”, and “applicative” come from the Kernel programming language and its F-expressions, which are somewhat analogous.)
The parser is intended to be extended, by binding combiners, for each application, but it can be used without extension, i.e. without any macros, as a simplistic kind of S-expression language where the basic AST is used as your data structure.
This core crate is no_std
and so can be used in constrained environments
without heap allocation. The crate is generically parameterized over what
allocates the “datums” used as nodes in the constructed ASTs. Allocation
can be done from fixed-size, pre-established, stack arrays. Or, allocation
can be done from a heap, e.g. using the standard Box
type, or from
whatever kind of allocator you can arrange.
This core crate’s purpose mostly is to define the generic types, traits, and
logic that other crates depend on to create their own concrete
implementations to use for their actual parsing. But some basic premade
implementations, that fit with no_std
use, are provided by this core crate
in sub-modules named premade
, and these might be sufficient by themselves
for some limited applications.
§Unicode
Parsing is done based on Rust’s char
type (which is a Unicode scalar
value). The configurable delimiters are single char
s, and so they cannot
be general grapheme clusters (because those can be sequences of multiple
char
s). It seems very unlikely that anyone would seriously want to use
grapheme clusters as the delimiters because the few delimiters only have
bracket and escape semantics. For a parsed input text, all non-delimiter
char
s are preserved exactly (except whitespace around head forms), and so
grapheme clusters are always preserved where it makes sense for our format.
Modules§
- Parts for “combiners”. Combiners are custom user-defined macros for our notation/format/language.
- Datum type used in the abstract syntax tree (AST) returned by parsing.
- Traits and types that provide the different aspects of
Parser
s’ functionality. - Implementations provided for ready use.
- Traits that are our abstraction of “text”.
Structs§
- Represents: the ability to parse a string; the characters used to delimit the nesting form; the method of allocating the
Datum
s; and the environment of bindings of macros. - Item produced by a
SourceStream
iterator that represents its next character, possibly with positional information.
Enums§
- A macro function, bound to an operator sub-form, which is called with the operands sub-form(s) to determine what should be substituted for the whole form. The
OperativeRef
andApplicativeRef
type parameters determine the types used to refer to the functions. - The abstract syntax tree (AST) type returned by parsing. It is extensible by the
ExtraType
parameter, and it is parameterized over theDatumRef
type used to refer to the otherDatum
s in an AST. It can also be used for DAGs. - The possible errors that might be returned by parsing.
Traits§
- Exists to be used similarly to but differently than
DerefMut
so that types likeRc
and itsget_mut
method can be used to holdDatum
s.DerefMut
must never fail, so it can’t be used. We want mutability ofDatum
s so that we can construct lists of them using only the space of the values allocated by aParser
’sDatumAllocator
, since this crate is intended to be usable inno_std
environments which don’t provide heap allocation. - Positional information of a character or text chunk relative to the original source it is from.
- A stream of characters that might know its characters’ positions in the source it is from.
- A logical sequence of characters, possibly represented as separate chunks, that can be iterated multiple times without consuming or destroying the source, and that might know its characters’ positions in the source it is from.
- The basic interface common across both
Text
s andTextChunk
s. This determines the associated type of the characters’ positional information; and this provides the ability to construct and check for emptiness. - A sequence of characters that serves as a single chunk in the underlying representation of some
Text
type. - A
Text
that can logically concatenate its values, optionally by using a providedDatumAllocator
.
Type Aliases§
- The type of values given by the parser iterator