Module lexer

Expand description

Types and functionality related to the lexer.

This module contains the structs and enums used to represent tokens, and the parse() function which returns a result of TokenStream and Metadata. In line with the specification, the lexer correctly detects the GLSL version and switches grammar accordingly. There is also an alternative parse_with_version() function which allows assuming the GLSL version rather than detecting it on-the-fly. The preprocessor submodule contains types used to represent tokens within preprocessor directives.

The way spans are counted can differ depending on your needs. By default, the spans count offsets between individual chars, but there are alternate functions that assume different encodings:

§Lexer

This lexer uses the “Maximal munch” principle to greedily create tokens. This means the longest possible valid token is always produced. Some examples:

i---7      becomes (i) (--) (-) (7)
i----7     becomes (i) (--) (--) (7)
i-----7    becomes (i) (--) (--) (-) (7)
i-- - --7  becomes (i) (--) (-) (--) (7)

The longest possible tokens are produced even if they form an invalid expression. For example, i----7 could’ve been a valid GLSL expression if it was parsed as (i) (--) (-) (-) (7), but this behaviour is not exhibited as that would require knowing the context and the lexer is not context-aware.

For a BNF notation of the official lexer grammar, see this file.

§Differences in behaviour

Since this crate is part of a larger effort to provide an LSP implementation, it is designed to handle errors in a UX friendly manner. This means that there are some minor differences between the behaviour of this lexer and of a lexer as specified by the GLSL specification. The differences are listed below:

When the lexer comes across a character which is not part of the allowed character set it emits the Invalid token. The specification has no such token; it just mentions that a character outside of the allowed character set must produce a compile-time error.
When the lexer comes across a block comment which does not have a delimiter (and therefore goes to the end-of-file) it still produces a BlockComment token with the contains_eof field set to true. The specification does not mention what should technically happen in such a case, but compilers seem to produce a compile-time error.
The lexer treats any number that matches the following pattern 0[0-9]+ as an octal number. The specification says that an octal number can only contain digits 0-7. This change was done to produce better errors; the entire span 009 would be highlighted as an invalid octal number token, rather than an error about two consecutive number tokens (00 and 9) which would be more confusing.
The lexer treats any identifier immediately after a number (without separating whitespace) as a suffix. The specification only defines the u|U suffix as valid for integers, and the f|F & lf|LF suffix as valid for floating point numbers. Anything afterwards should be treated as a new token, so this would be valid: #define TEST +5 \n uint i = 5uTEST. Currently, this crate doesn’t work according to this behaviour, hence for now the lexer will treat the suffix as uTEST instead.

See the preprocessor submodule for an overview of the lexer’s behaviour for each individual preprocessor directive.

To be certain that the source is valid, these cases (apart from the macro issue) must be checked afterwards by iterating over the TokenStream. The parsing functions provided in this crate do this for you, but if you are performing your own manipulation you must perform these checks yourself.

A potential idea for consideration would be to include the alternate behaviour behind a flag (i.e. stop parsing after encountering an error). This is currently not a priority, but if you would like such functionality please file an issue on the github repository to show interest. An alternative would be to set a flag in the Metadata which signifies whether any errors were encountered.

Modules§

preprocessor: Types related to preprocessor token streams.

Structs§

Metadata: Metadata about the GLSL source string.

Enums§

NumType: The type/notation of a number token.
OpTy: A mathematical/comparison operator.
ParseErr: The error type for lexer parsing operations.
Token: A token representing a unit of text in the GLSL source string.

Functions§

parse: Parses a GLSL source string into a token stream.
parse_with_utf_8_offsets: Parses a GLSL source string into a token stream.
parse_with_utf_8_offsets_and_version: Parses a GLSL source string into a token stream, assuming a specified GLSL version.
parse_with_utf_16_offsets: Parses a GLSL source string into a token stream.
parse_with_utf_16_offsets_and_version: Parses a GLSL source string into a token stream, assuming a specified GLSL version.
parse_with_version: Parses a GLSL source string into a token stream, assuming a specific GLSL version.

Type Aliases§

TokenStream: A vector of tokens representing a GLSL source string.

Module lexer

Module lexer Copy item path

§Lexer

§Differences in behaviour

Modules§

Structs§

Enums§

Functions§

Type Aliases§

Module lexer