Expand description
unsynn (from german ‘unsinn’ for nonsense) is a minimalist rust parser library. It achieves this by leaving out the actual grammar implementations and doing only minimal error handling.
In exchange it offers simple composeable Parsers and ergonomic Parser construction. Grammars will be implemented in their own crates (see unsynn-rust).
It is primarily intended use is when one wants to create proc macros for rust that define their own grammar or need only sparse rust parsers.
Other uses can be building parsers for gramars outside a rust/proc-macro context. Unsynn can
parse any &str
data (The tokenizer step relies on proc_macro2).
§Examples
§Creating and Parsing Custom Types
The unsynn!{}
macro will generate the Parser
and ToTokens
impls (and more). This
is optional, the impls could be written by hand when necessary.
Notice that unsynn implements Parser
and ToTokens
for many standard rust types. Like
we use u32
in this example.
let mut token_iter = "foo ( 1, 2, 3 )".to_token_iter();
unsynn!{
struct IdentThenParenthesisedNumbers {
ident: Ident,
numbers: ParenthesisGroupContaining::<CommaDelimitedVec<u32>>,
}
}
// iter.parse() is from the IParse trait
let ast: IdentThenParenthesisedNumbers = token_iter.parse().unwrap();
assert_eq!(
ast.tokens_to_string(),
"foo(1,2,3)".tokens_to_string()
)
§Using Composition
Composition can be used without defining new datatypes. This is useful for simple parsers or when one wants to parse things on the fly which are desconstructed immediately.
// We parse this below
let mut token_iter = "foo ( 1, 2, 3 )".to_token_iter();
// Type::parse() is from the Parse trait
let ast =
Cons::<Ident, ParenthesisGroupContaining::<CommaDelimitedVec<u32>>>
::parse(&mut token_iter).unwrap();
assert_eq!(
ast.tokens_to_string(),
"foo ( 1, 2, 3 )".tokens_to_string()
)
§Custom Operators and Keywords
To define keywords and operators we provide the keyword!
and operator!
macros:
keyword! {
Calc = "CALC";
}
operator! {
Add = "+";
Substract = "-";
Multiply = "*";
Divide = "/";
}
// The above can be written within a unsynn!
// See next example about parsing recursive grammars
// looks like BNF, but can't do recursive types
type Expression = Cons<Calc, AdditiveExpr, Semicolon>;
type AdditiveOp = Either<Add, Substract>;
type AdditiveExpr = Either<Cons<MultiplicativeExpr, AdditiveOp, MultiplicativeExpr>, MultiplicativeExpr>;
type MultiplicativeOp = Either<Multiply, Divide>;
type MultiplicativeExpr = Either<Cons<LiteralInteger, MultiplicativeOp, LiteralInteger>, LiteralInteger>;
let ast = "CALC 2*3+4/5 ;".to_token_iter()
.parse::<Expression>().expect("syntax error");
§Parsing Recursive Grammars
Recursive grammars can be parsed using structs and resolving the recursive parts in a Box
or
Rc
. This looks less BNF like but acts closer to it:
unsynn! {
keyword Calc = "CALC";
operator Add = "+";
operator Substract = "-";
operator Multiply = "*";
operator Divide = "/";
struct Expression(Calc, AdditiveExpr, Semicolon);
// we preserve nested Either and Cons here instead defining new enums and structs because that would be more noisy
struct AdditiveOp(Either<Add, Substract>);
// with a Rc (or Box) here we can resolve the recursive nature of the grammar
struct AdditiveExpr(Either<Cons<MultiplicativeExpr, AdditiveOp, Either<Rc<AdditiveExpr>,MultiplicativeExpr>>, MultiplicativeExpr>);
struct MultiplicativeOp(Either<Multiply, Divide>);
struct MultiplicativeExpr(Either<Cons<LiteralInteger, MultiplicativeOp, Rc<MultiplicativeExpr>>, LiteralInteger>);
}
// now we can parse more complex expressions. Adding parenthesis is left as excercise to the reader
let ast = "CALC 10+1-2*3+4/5*100 ;".to_token_iter()
.parse::<Expression>().expect("syntax error");
§Feature Flags
-
hash_keywords
This enables hash tables for larger keyword groups. This is enabled by default since it guarantees fast lookup in all use-cases and the extra dependency it introduces is very small. Nevertheless this feature can be disabled when keyword grouping is not or rarely used to remove the dependency onfxhash
. Keyword lookups then fall back to a binary search implementation. Note that the implementation already optimizes the cases where only one or only a few keywords are in a group. -
docgen
Theunsynn!{}
,keyword!{}
andoperator!{}
will automatically generate some additional docs. This is enabled by default.
§Detailed Introduction / Cookbook
For a more detailed introduction about how to use unsynn see the
Cookbook section in the Parse
trait.
§Roadmap
With v0.1.0 we follow rusts practice of semantic versioning, there will be breaking changes on 0.x releases but we try to keep these at a minimum. The planned ‘unsynn-rust’ and along that a ‘unsynn-derive’ will be implemented. When thes are ready and no major deficiencies in ‘unsynn’ are found then it is time for a 1.0.0 release.
§Planned/Ideas
-
can we add prettyprint for tokens_to_string? this needs a threadlocal storing some context (indent level, indent by (str of spaces), prettyprint flag)
-
make proc_macro2 optional with a feature flag
this would disable parsing &str and related API’s and most of the test suite. But should be sufficient for writing lean proc_macro parsers. -
Enclosed<Begin, Content, End>
likeCons<Begin, Cons<Except<End>, Content>, End>>
-
improve error handing
- document how errors are reported and what the user can do to handle them
- User can/should write forgiving grammars that are error tolerant
- add tests error/span handling
- v0.2.0 will improve the Span handling considerably. Probably by an extra feature flag. We
aim for ergonomic/automagical correct spans, the user shouldnt be burdened by making
things correct. Details for this need to be laied out. Maybe a
SpanOf<T: Parse>
- can we have some
Explain<T>
that explains what was expected and why it failed to simplify complex errors?
-
Brainfart: Dynamic parser construction
insteadparse::<UnsynnType>()
create a parse function dynamically from a str parsed by unsynn itself “Either<This, That>”.to_parser().parse() this will need atrait DynUnsynn
implementing the common/dynamic parts of these and a registry where all entities supporting dynamic construction are registered. This will likely be factored out into a unsynn-dyn crate Add some scanf like DSL to generate these parsers. xmacro may use it like $(foo@Ident: values) -
Braintfart: Memoization in TokenIter:
Rc< enum { Countdown(Cell(usize)), Memo(HashMap< (counter, typeid, Option<NonZeroU64 hash_over_extra_parameters>), (Result, TokenIter_after) >>), }>
Countdown counter which activates memoization only after certain number of tokens parsed, parsing small things does not need the overhead of memoizing. can we somehow (auto) Trait which types become memoized? Small things don’t need to be memoized.
Note to my future self:
Result needs to be dyn Memo where trait Memo: Clone and clone is cheap: enum MaybeRc<T>{Direkt(T), Shared(Rc<T>)}
-
add rust types
- f32: 32-bit floating point number
- f64: 64-bit floating point number (default)
§Design Priorities
Unsynn foremost goal is to make parsers easy and ergonomic to use. We deliberately provide some duplicated functionality and type aliases to prioritize expressiveness. Fast compile times with as little as necessary dependencies comes second. We do not focus explicitly on rust syntax, this will be addressed by other crates.
§Development
unsynn is meant to evolve opportunistically. When you spot a problem or need a new feature feel free to open an issue or (prefered!) send a PR.
Commits and other git operations are augmented and validated with
cehgit. For contributors it is recommened to enable
cehgit too by calling ./.cehgit install-hook --all
within a checked out unsynn repository.
§Contribution/Coding Guidelines
Chances to get contributions merged increase when you:
- Include documentation following the existing documentation practice. Write examples/doctests.
- Passing
./.cehgit run
without errors or warnings. - Passing test-coverage with
cargo mutants
. - Implement reasonable complete things. Not everything needs to be included in a first version, but it must be usable.
§Git Branches
main
Will be updated on new releases. When you plan to make a small contribution that should be merged soon then you can work on top ofmain
.release-*
stable release may get their own branch for fixes and backported features.devel
Development branch which will eventually be merged intomain
. Non-trivial contributions that may take some time to develop should usedevel
as starting point. But be prepared to rebase frequently on top of the ongoingdevel
.feature-*
More complex features and experiments are developed in feature branches. Any non trivial contribution should be done in afeature-*
branch as well. Once complete they become merged intodevel
. Some of these experiments may stall or be abandoned, do not base your contribution on an existing feature branch.
Modules§
- combinator
- A unique feature of unsynn is that one can define a parser as a composition of other
parsers on the fly without the need to define custom structures. This is done by using the
Cons
andEither
types. TheCons
type is used to define a parser that is a conjunction of two to four other parsers, while theEither
type is used to define a parser that is a disjunction of two to four other parsers. - container
- This module provides parsers for types that contain possibly multiple values. This
includes stdlib types like
Option
,Vec
,Box
,Rc
,RefCell
and types for delimited and repeated values with numbered repeats. - delimited
- For easier composition we define the
Delimited
type here which is aT
followed by a optional delimiting entityD
. This is used by theDelimitedVec
type to parse a list of entities separated by a delimiter. - fundamental
- This module contains the fundamental parsers. These parsers are the basic tokens from
proc_macro2
and a few other ones defined by unsynn. These are the terminal entities when parsing tokens. Being able to parseTokenTree
andTokenStream
allows one to parse opaque entities where internal details are left out. TheCached
type is used to cache the string representation of the parsed entity. TheNothing
type is used to match without consuming any tokens. TheExcept
type is used to match when the next token does not match the given type. TheEndOfStream
type is used to match the end of the stream when no tokens are left. TheHiddenState
type is used to hold additional information that is not part of the parsed syntax. - group
- Groups are a way to group tokens together. They are used to represent the contents between
()
,{}
,[]
or no delimiters at all. This module provides parser implementations for opaque group types with defined delimiters and theGroupContaining
types that parses the surrounding delimiters and content of a group type. - literal
- This module provides a set of literal types that can be used to parse and tokenize
literals. The literals are parsed from the token stream and can be used to represent the
parsed value. unsynn defines only simplified literals, such as integers, characters and
strings. The literals here are not full rust syntax, which will be defined in the
unsynn-rust
crate. - names
- Unsynn does not implement rust grammar, for common Operators we make an exception because
they are mostly universal and already partial lexed (
Spacing::Alone/Joint
) it would add a lot confusion when every user has to redefine common operator types. These operator names have their own module and are reexported at the crate root. This allows one to import only the named operators. - operator
- Combined punctuation tokens are represented by
Operator
. Thecrate::operator!
macro can be used to define custom operators. - punct
- This module contains types for punctuation tokens. These are used to represent single and
multi character punctuation tokens. For single character punctuation tokens, there are
there are
PunctAny
,PunctAlone
andPunctJoint
types. - rust_
types - Parsers for rusts types.
Macros§
- keyword
- Define types matching keywords.
- operator
- Define types matching operators (punctuation sequences).
- unsynn
- This macro supports the definition of enums, tuple structs and normal structs and
generates
Parser
andToTokens
implementations for them. It will deriveDebug
. Generics/Lifetimes are not supported on the primary type. Note: eventually a derive macro forParser
andToTokens
will become supported by a ‘unsynn-derive’ crate to give finer control over the expansion.#[derive(Copy, Clone)]
have to be manually defined. Keyword and operator definitions can also be defined, they delegate to thekeyword!
andoperator!
macro described below. All entities can be prefixed bypub
to make them public. Type aliases are supported and are just pass-through. This makes thing easier readable when you define larger unsynn macro blocks.
Structs§
- Brace
Group - A opaque group of tokens within a Brace
- Brace
Group Containing - Parseable content within a Brace
- Bracket
Group - A opaque group of tokens within a Bracket
- Bracket
Group Containing - Parseable content within a Bracket
- Cached
- Getting the underlying string expensive as it always allocates a new
String
. This type caches the string representation of a given entity. Note that this is only reliable for fundamental entities that represent a single token. Spacing between composed tokens is not stable and should be considered informal only. - Cons
- Conjunctive
A
followed byB
and optionalC
andD
WhenC
andD
are not used, they are set toNothing
. - Delimited
- This is used when one wants to parse a list of entities separated by delimiters. The
delimiter is optional and can be
None
eg. when the entity is the last in the list. Usually the delimiter will be some simple punctuation token, but it is not limited to that. - Delimited
Vec - Since the delimiter in
Delimited<T,D>
is optional aVec<Delimited<T,D>>
would parse consecutive values even without delimiters.DelimitedVec<T,D>
will stop parsing after the first value without a delimiter. - Discard
- Succeeds when the next token matches
T
. The token will be removed from the stream but not stored. Consequently theToTokens
implementations will panic with a message that it can not be emitted. This can only be used when a token should be present but not stored and never emitted. - EndOf
Stream - Matches the end of the stream when no tokens are left.
- Error
- Error type for parsing.
- Except
- Succeeds when the next token does not match
T
. Will not consume any tokens. - Expect
- Succeeds when the next token would match
T
. Will not consume any tokens. This is similar to peeking. - Group
- A delimited token stream.
- Group
Containing - Any kind of Group
G
with parseable contentC
. The contentC
must parse exhaustive, aEndOfStream
is automatically implied. - Hidden
State - Sometimes one want to compose types or create structures for unsynn that have members that
are not part of the parsed syntax but add some additional information. This struct can be
used to hold such members while still using the
Parser
andToTokens
trait implementations automatically generated by theunsynn!{}
macro or composition syntax.HiddenState
will not consume any tokens when parsing and will not emit any tokens when generating aTokenStream
. On parsing it is initialized with a default value. It hasDeref
andDerefMut
implemented to access the inner value. - Ident
- A word of Rust code, which may be a keyword or legal variable name.
- Invalid
- A unit that always fails to match. This is useful as default for generics.
See how
Either<A, B, C, D>
uses this for unused alternatives. - LazyVec
- A
Vec<T>
that is filled up to the first appearance of an terminatingS
. ThisS
may be a subset ofT
, thus parsing become lazy. This is the same asCons<Vec<Cons<Except<S>,T>>,S>
but more convenient and efficient. - Literal
- A literal string (
"hello"
), byte string (b"hello"
), character ('a'
), byte character (b'a'
), an integer or floating point number with or without a suffix (1
,1u8
,2.3
,2.3f32
). - Literal
Character - A single quoted character literal (
'x'
). - Literal
Integer - A simple unsigned 128 bit integer. This is the most simple form to parse integers. Note that only decimal integers without any other characters, signs or suffixes are supported, this is not full rust syntax.
- Literal
String - A double quoted string literal (
"hello"
). The quotes are included in the value. Note that this is a simplified string literal, and only double quoted strings are supported, this is not full rust syntax, eg. byte and C string literals are not supported. - NonEmpty
Token Stream - Since parsing a
TokenStream
succeeds even when no tokens are left, this type is used to parse aTokenStream
that is not empty. - None
Group - A opaque group of tokens within a None
- None
Group Containing - Parseable content within a None
- Nothing
- A unit that always matches without consuming any tokens. This is required when one wants
to parse a
Repeats
without a delimiter. Note that usingNothing
as primary entity in aVec
,LazyVec
,DelimitedVec
orRepeats
will result in an infinite loop. - Operator
- Operators made from up to four ASCII punctuation characters. Unused characters default to
\0
. Custom operators can be defined with thecrate::operator!
macro. All but the last character areSpacing::Joint
. Attention must be payed when operators have the same prefix, the shorter ones need to be tried first. - Parenthesis
Group - A opaque group of tokens within a Parenthesis
- Parenthesis
Group Containing - Parseable content within a Parenthesis
- Punct
- A
Punct
is a single punctuation character like+
,-
or#
. - Punct
Alone - A single character punctuation token which is not followed by another punctuation character.
- Punct
Any - A single character punctuation token with any kind of
Spacing
, - Punct
Joint - A single character punctuation token where the lexer joined it with the next
Punct
or a single quote followed by a identifier (rust lifetime). - Repeats
- Like
DelimitedVec<T,D>
but with a minimum and maximum (inclusive) number of elements. Parsing will succeed when at least the minimum number of elements is reached and stop at the maximum number. The delimiterD
defaults toNothing
to parse sequences which don’t have delimiters. - Skip
- Skips over expected tokens. Will parse and consume the tokens but not store them.
Consequently the
ToTokens
implementations will not output any tokens. - Span
- A region of source code, along with macro expansion information.
- Token
Stream - An abstract stream of tokens, or more concretely a sequence of token trees.
Enums§
- Delimiter
- Describes how a sequence of token trees is delimited.
- Either
- Disjunctive
A
orB
or optionalC
orD
tried in that order. WhenC
andD
are not used, they are set toInvalid
. - Error
Kind - Actual kind of an error.
- Spacing
- Whether a
Punct
is followed immediately by anotherPunct
or followed by another token or whitespace. - Token
Tree - A single token or a delimited sequence of token trees (e.g.
[1, (), ..]
).
Traits§
- Group
Delimiter - Access to the surrounding
Delimiter
of aGroupContaining
and its variants. - IParse
- Extension trait for
TokenIter
that callsParse::parse()
. - Parse
- This trait provides the user facing API to parse grammatical entities. It is implemented
for anything that implements the
Parser
trait. The methods here encapsulating the iterator that is used for parsing into a transaction. This iterator is alwaysCopy
. Instead using a peekable iterator or implementing deeper peeking, parse clones this iterator to make access transactional, when parsing succeeds then the transaction becomes committed, otherwise it is rolled back. - Parser
- The
Parser
trait that must be implemented by anything we want to parse. We are parsing over aTokenIter
(proc_macro2::TokenStream
iterator). - Ranged
Repeats - A trait for parsing a repeating
T
with a minimum and maximum limit. Sometimes the number of elements to be parsed is determined at runtime eg. a number of header items needs a matching number of values. - ToTokens
- unsynn defines its own
ToTokens
trait to be able to implement it for std container types. This is similar to theToTokens
from the quote crate but adds some extra methods and is implemented for more types. Moreover theto_token_iter()
method is the main entry point for crating an iterator that can be used for parsing. - Token
Count - We track the position of the error by counting tokens. This trait is implemented for
references to shadow counted
TokenIter
, andusize
. The later allows to pass in a position directly or useusize::MAX
in case no position data is available (which will make this error the be the final one when upgrading). - Transaction
- Helper trait to make
TokenIter
transactional
Type Aliases§
- And
&
- AndAnd
&&
- AndEq
&=
- Any
- Any number of T delimited by D or
Nothing
- Assign
=
- At
@
- AtLeast
- At least N of T delimited by D or
Nothing
- AtMost
- At most N of T delimited by D or
Nothing
- Backslash
\
- Bang
!
- Cached
Group Group
with cached string representation.- Cached
Ident Ident
with cached string representation.- Cached
Literal Literal
with cached string representation.- Cached
Punct Punct
with cached string representation.- Cached
Token Tree TokenTree
(any token) with cached string representation.- Caret
^
- CaretEq
^=
- Colon
:
- Colon
Delimited T
followed by an optional:
- Colon
Delimited Vec - Vector of
T
delimited by:
- Comma
,
- Comma
Delimited T
followed by an optional,
- Comma
Delimited Vec - Vector of
T
delimited by,
- Dollar
$
- Dot
.
- DotDelimited
T
followed by an optional.
- DotDelimited
Vec - Vector of
T
delimited by.
- DotDot
..
- DotDot
Eq ..=
- Ellipsis
...
- Equal
==
- Exactly
- Exactly N of T delimited by D or
Nothing
- FatArrow
=>
- Ge
>=
- Gt
>
- LArrow
<-
- Le
<=
- Lifetime
Tick '
WithSpacing::Joint
- Lt
<
- Many
- One or more of T delimited by D or
Nothing
- Minus
-
- MinusEq
-=
- NotEqual
!=
- Optional
- Zero or one of T delimited by D or
Nothing
- Or
|
- OrEq
|=
- OrOr
||
- PathSep
::
- Path
SepDelimited T
followed by an optional::
- Path
SepDelimited Vec - Vector of
T
delimited by::
- Percent
%
- Percent
Eq %=
- Plus
+
- PlusEq
+=
- Pound
#
- Question
?
- RArrow
->
- Result
- Result type for parsing.
- Semicolon
;
- Semicolon
Delimited T
followed by an optional;
- Semicolon
Delimited Vec - Vector of
T
delimited by;
- Shl
<<
- ShlEq
<<=
- Shr
>>
- ShrEq
>>=
- Slash
/
- SlashEq
/=
- Star
*
- StarEq
*=
- Tilde
~
- Token
Iter - Type alias for the iterator type we use for parsing. This Iterator is Clone and produces
&TokenTree
. The shadow counter counts tokens in the background to track progress which is used to keep the error that made the most progress in disjunctive parsers. - Underscore
_