pub struct TokenizerConfig {Show 14 fields
pub keywords: HashMap<String, TokenType>,
pub single_tokens: HashMap<char, TokenType>,
pub quotes: HashMap<String, String>,
pub identifiers: HashMap<char, char>,
pub comments: HashMap<String, Option<String>>,
pub string_escapes: Vec<char>,
pub nested_comments: bool,
pub escape_follow_chars: Vec<char>,
pub b_prefix_is_byte_string: bool,
pub numeric_literals: HashMap<String, String>,
pub identifiers_can_start_with_digit: bool,
pub hex_number_strings: bool,
pub hex_string_is_integer_type: bool,
pub string_escapes_allowed_in_raw_strings: bool,
}Expand description
Tokenizer configuration for a dialect
Fields§
§keywords: HashMap<String, TokenType>Keywords mapping (uppercase keyword -> token type)
single_tokens: HashMap<char, TokenType>Single character tokens
quotes: HashMap<String, String>Quote characters (start -> end)
identifiers: HashMap<char, char>Identifier quote characters (start -> end)
comments: HashMap<String, Option<String>>Comment definitions (start -> optional end)
string_escapes: Vec<char>String escape characters
nested_comments: boolWhether to support nested comments
escape_follow_chars: Vec<char>Valid escape follow characters (for MySQL-style escaping). When a backslash is followed by a character NOT in this list, the backslash is discarded. When empty, all backslash escapes preserve the backslash for unrecognized sequences.
b_prefix_is_byte_string: boolWhether b’…’ is a byte string (true for BigQuery) or bit string (false for standard SQL). Default is false (bit string).
numeric_literals: HashMap<String, String>Numeric literal suffixes (uppercase suffix -> type name), e.g. {“L”: “BIGINT”, “S”: “SMALLINT”} Used by Hive/Spark to parse 1L as CAST(1 AS BIGINT)
identifiers_can_start_with_digit: boolWhether unquoted identifiers can start with a digit (e.g., 1a, 1_a).
When true, a number followed by letters/underscore is treated as an identifier.
Used by Hive, Spark, MySQL, ClickHouse.
hex_number_strings: boolWhether 0x/0X prefix should be treated as hex literals.
When true, 0XCC is tokenized instead of Number(“0”) + Identifier(“XCC”).
Used by BigQuery, SQLite, Teradata.
hex_string_is_integer_type: boolWhether hex string literals from 0x prefix represent integer values. When true (BigQuery), 0xA is tokenized as HexNumber (integer in hex notation). When false (SQLite, Teradata), 0xCC is tokenized as HexString (binary/blob value).
string_escapes_allowed_in_raw_strings: boolWhether string escape sequences (like ') are allowed in raw strings. When true (BigQuery default), ' inside r’…’ escapes the quote. When false (Spark/Databricks), backslashes in raw strings are always literal. Python sqlglot: STRING_ESCAPES_ALLOWED_IN_RAW_STRINGS (default True)
Trait Implementations§
Source§impl Clone for TokenizerConfig
impl Clone for TokenizerConfig
Source§fn clone(&self) -> TokenizerConfig
fn clone(&self) -> TokenizerConfig
1.0.0 · Source§fn clone_from(&mut self, source: &Self)
fn clone_from(&mut self, source: &Self)
source. Read more