Expand description
Types and functionality related to the lexer.
This module contains the structs and enums used to represent tokens, and the parse() function which returns
a result of TokenStream and Metadata. In line with the specification, the lexer correctly detects the
GLSL version and switches grammar accordingly. There is also an alternative parse_with_version() function
which allows assuming the GLSL version rather than detecting it on-the-fly. The preprocessor submodule
contains types used to represent tokens within preprocessor directives.
The way spans are counted can differ depending on your needs. By default, the spans count offsets between
individual chars, but there are alternate functions that assume different encodings:
parse_with_utf_16_offsets(),parse_with_utf_16_offsets_and_version(),parse_with_utf_8_offsets(),parse_with_utf_8_offsets_and_version().
§Lexer
This lexer uses the “Maximal munch” principle to greedily create tokens. This means the longest possible valid token is always produced. Some examples:
i---7 becomes (i) (--) (-) (7)
i----7 becomes (i) (--) (--) (7)
i-----7 becomes (i) (--) (--) (-) (7)
i-- - --7 becomes (i) (--) (-) (--) (7)The longest possible tokens are produced even if they form an invalid expression. For example, i----7
could’ve been a valid GLSL expression if it was parsed as (i) (--) (-) (-) (7), but this behaviour is not
exhibited as that would require knowing the context and the lexer is not context-aware.
For a BNF notation of the official lexer grammar, see this file.
§Differences in behaviour
Since this crate is part of a larger effort to provide an LSP implementation, it is designed to handle errors in a UX friendly manner. This means that there are some minor differences between the behaviour of this lexer and of a lexer as specified by the GLSL specification. The differences are listed below:
- When the lexer comes across a character which is not part of the allowed character set it emits the
Invalidtoken. The specification has no such token; it just mentions that a character outside of the allowed character set must produce a compile-time error. - When the lexer comes across a block comment which does not have a delimiter (and therefore goes to the
end-of-file) it still produces a
BlockCommenttoken with thecontains_eoffield set totrue. The specification does not mention what should technically happen in such a case, but compilers seem to produce a compile-time error. - The lexer treats any number that matches the following pattern
0[0-9]+as an octal number. The specification says that an octal number can only contain digits0-7. This change was done to produce better errors; the entire span009would be highlighted as an invalid octal number token, rather than an error about two consecutive number tokens (00and9) which would be more confusing. - The lexer treats any identifier immediately after a number (without separating whitespace) as a suffix. The
specification only defines the
u|Usuffix as valid for integers, and thef|F&lf|LFsuffix as valid for floating point numbers. Anything afterwards should be treated as a new token, so this would be valid:#define TEST +5 \n uint i = 5uTEST. Currently, this crate doesn’t work according to this behaviour, hence for now the lexer will treat the suffix asuTESTinstead.
See the preprocessor submodule for an overview of the lexer’s behaviour for each individual preprocessor
directive.
To be certain that the source is valid, these cases (apart from the macro issue) must be checked afterwards by
iterating over the TokenStream. The parsing functions provided in this crate do this for you, but if you
are performing your own manipulation you must perform these checks yourself.
A potential idea for consideration would be to include the alternate behaviour behind a flag (i.e. stop parsing
after encountering an error). This is currently not a priority, but if you would like such functionality please
file an issue on the github repository to show interest. An alternative would be to set a flag in the
Metadata which signifies whether any errors were encountered.
Modules§
- preprocessor
- Types related to preprocessor token streams.
Structs§
- Metadata
- Metadata about the GLSL source string.
Enums§
- NumType
- The type/notation of a number token.
- OpTy
- A mathematical/comparison operator.
- Parse
Err - The error type for lexer parsing operations.
- Token
- A token representing a unit of text in the GLSL source string.
Functions§
- parse
- Parses a GLSL source string into a token stream.
- parse_
with_ utf_ 8_ offsets - Parses a GLSL source string into a token stream.
- parse_
with_ utf_ 8_ offsets_ and_ version - Parses a GLSL source string into a token stream, assuming a specified GLSL version.
- parse_
with_ utf_ 16_ offsets - Parses a GLSL source string into a token stream.
- parse_
with_ utf_ 16_ offsets_ and_ version - Parses a GLSL source string into a token stream, assuming a specified GLSL version.
- parse_
with_ version - Parses a GLSL source string into a token stream, assuming a specific GLSL version.
Type Aliases§
- Token
Stream - A vector of tokens representing a GLSL source string.