An implementation of the CSS Syntax Level 3 tokenization algorithm. It is intended as a low-level building block for buidling parsers for CSS or CSS-alike languages (for example SASS).

This crate provides the [Lexer] struct, which borrows &str and can incrementally produce [Tokens][Token]. The encoding of the &str is assumed to be utf-8.

The [Lexer] may be configured with additional [Features][Feature] to allow for lexing tokens in ways which diverge from the CSS specification (such as tokenizing comments using //). With no additional features this lexer is fully spec compliant.

[Tokens][Token] are untyped (there are no super-classes like Ident); but they have a [Kind] which can be used to determine their type. Tokens do not store the underlying character data, nor do they store their offsets. They just provide "facts" about the underlying data. In order to re-build a string, each [Token] will need to be wrapped in a [Cursor] and consult the original &str to get the character data. This design allows Tokens live in the stack, avoiding heap allocation as they are always size_of 8. Likewise [Cursors][Cursor] are always a size_of 12.

Limitations

The [Lexer] has limitations around document sizes and token sizes, in order to keep [Token], [SourceOffset] and [Cursor] small. It's very unlikely the average document will run into these limitations, but they're listed here for completeness:

Documents are limited to ~4gb in size. [SourceOffset] is a [u32] so cannot represent larger offsets. Attempting to lex larger documents is considrered undefined behaviour.
[Tokens][Token] are limited to ~4gb in length. A [Token's][Token] is a [u32] so cannot represent larger lengths. If the lexer encounters a token with larger length this is considered undefined behaviour.
Number [Tokens][Token] are limited to 16,777,216 characters in length. For example encountering a number with 17MM 0s is considered undefined behaviour. This is not the same as the number value, which is an [f32]. (Please note that the CSS spec dictates numbers are f32, CSS does not have larger numbers).
Dimension [Tokens][Token] are limited to 4,096 numeric characters in length and 4,096 ident characters in length. For example encountering a dimension with 4,097 0s is considered undefined behaviour.

General usage

A parser can be implemented on top of the [Lexer] by instantiating a [Lexer] with [Lexer::new()] or [Lexer::new_with_features()] if you wish to opt-into non-spec-compliant features. The [Lexer] needs to be given a &str which it will reference to produce Tokens.

Repeatedly calling [Lexer::advance()] will move the Lexer's internal position one [Token] forward, and return the newly lexed [Token], once the end of &str is reached [Lexer::advance()] will repeatedly return [Token::EOF].

Example

use css_lexer::*;
let mut lexer = Lexer::new(&EmptyAtomSet::ATOMS, "width: 1px");
assert_eq!(lexer.offset(), 0);
{
    let token = lexer.advance();
    assert_eq!(token, Kind::Ident);
    let cursor = token.with_cursor(SourceOffset(0));
    assert_eq!(cursor.str_slice(lexer.source()), "width");
}
{
    let token = lexer.advance();
    assert_eq!(token, Kind::Colon);
    assert_eq!(token, ':');
}
{
    let token = lexer.advance();
    assert_eq!(token, Kind::Whitespace);
}
{
    let token = lexer.advance();
    assert_eq!(token, Kind::Dimension);
}

css_lexer 0.0.4

Limitations

General usage

Example