Struct alkale::TokenizerContext

source ·

pub struct TokenizerContext<Source: Iterator<Item = char>, TokenData> { /* private fields */ }

Expand description

Provides many helpful methods for tokenization. This should be made once per tokenizer, and may be passed around as needed. When tokenization is done, result can convert this object into usable output.

A new context may be created with new or from.

Source code may be traversed with next, peek, skip, and has_next.

Implementations§

source §

impl<S: Iterator<Item = char>, T> TokenizerContext<S, T>

source

pub fn try_parse_identifier( &mut self, first_predicate: impl Fn(&char) -> bool, rest_predicate: impl Fn(&char) -> bool, ) -> Option<(String, Span)>

This function will read a character to see if it matches the first predicate (first_predicate). If it does, then it will read as many characters as possible that match the second predicate (rest_predicate)

If the first predicate originally matched, then this function will return a String of all consumed characters along with the Span of the consumed region. If not, returns None.

If you are looking to match standard language identifiers (that is, [a-zA-Z_][a-zA-Z0-9_]*), use [standard_identifier].

source

pub fn try_parse_standard_identifier(&mut self) -> Option<(String, Span)>

If the next character matches [a-zA-Z_], this function will consume [a-zA-Z0-9_] characters until EOF or a non-matching character.

All consumed characters will be returned as a String, along with the Span of the consumed area. If the first check failed, this function returns None.

See [identifier] for a more generic version of this.

Examples found in repository ?

examples/file/main.rs (line 63)

pub fn main() {
    // Buffer for the input path
    let mut str_path = String::new();

    println!(
        "Current path: {:?}",
        std::env::current_dir().expect("No current dir exists.")
    );
    println!("Leave input blank to use default path.");
    print!("Input path to file: ");

    stdout().lock().flush().expect("Could not flush stdout.");

    // Read path input
    stdin()
        .lock()
        .read_line(&mut str_path)
        .expect("Could not read stdin.");

    let trimmed = str_path.trim();

    // Get path from string.
    let path = if trimmed.is_empty() {
        Path::new("./examples/file/file.txt")
    } else {
        Path::new(trimmed)
    };

    println!();
    println!("Path: {:?}", path);

    // Load file from path
    let file = File::open(path).expect("Could not open file.");

    // Get a buffered reader of the file.
    let mut reader = BufReader::new(file);

    // Create a context from the BufReader. The closure dictates
    // how we want to handle any read failures— in this case just panic.
    let mut context = TokenizerContext::new_file(&mut reader, |x| match x {
        Ok(char) => char,
        Err(_) => panic!("Unable to read file."),
    });

    // Tokenizer logic
    while context.has_next() {
        if let Some((ident, span)) = context.try_parse_standard_identifier() {
            context.push_token(Token::new(ident, span));
        } else {
            context.skip();
        }
    }

    // Print result
    println!();
    println!("{:#?}", context.result());
}

More examples

Hide additional examples

examples/json.rs (line 64)

pub fn main() {
    use JsonToken::*;

    let code = r##"
        {
            "a": "Assigned to the \"a\" property!",
            "b" : 25.2,
            "c": false,
            "d": true,
            "12": null,
            "list": [
                1,
                -2e+2,
                3.1
            ]
        }
    "##;

    let mut context = TokenizerContext::new(code.chars());

    while context.has_next() {
        // Map single-character tokens to their data.
        let found_token = context.map_single_char_token(|x| match x {
            '{' => Some(OpenBrace),
            '}' => Some(CloseBrace),
            '[' => Some(OpenBracket),
            ']' => Some(CloseBracket),
            ':' => Some(Colon),
            ',' => Some(Comma),
            _ => None,
        });

        // If the above codeblock pushed a new token, move onto the next iteration.
        if found_token {
            continue;
        }

        // Attempt to parse an identifier for certain tokens.
        if let Some((name, span)) = context.try_parse_standard_identifier() {
            let data = match name.as_ref() {
                "null" => Some(Null),
                "true" => Some(Bool(true)),
                "false" => Some(Bool(false)),
                _ => None,
            };

            // If the above table mapped the identifier to valid token data, push a token.
            // If it didn't report an error— either way restart the loop.
            if let Some(data) = data {
                context.push_token(Token::new(data, span));
            } else {
                context.report(ErrorNotification(
                    format!("Unexpected identifier {}", name),
                    span,
                ))
            }

            continue;
        }

        // If the next element in the source code is a string, parse it and report errors as necessary.
        if let Some((result, span)) = context.try_parse_strict_string() {
            match result {
                Ok(string) => {
                    // Push the valid string.
                    context.push_token(Token::new(Str(string), span));
                }
                Err(errors) => {
                    // Create a notification for every error.
                    for error in errors {
                        use ParseCharError::*;
                        use StringTokenError::*;

                        match error {
                            CharError(NoEscape(span)) => ErrorNotification(
                                "Missing escape code after backslash".to_owned(),
                                span,
                            ),
                            CharError(IllegalEscape(char, span)) => {
                                ErrorNotification(format!("Illegal escape code '{}'", char), span)
                            }
                            CharError(NoCharFound) => {
                                unreachable!("Strings will never create this error")
                            }
                            NoClosingDelimiter => ErrorNotification(
                                "Missing closing delimiter on string".to_owned(),
                                span.clone(),
                            ),
                        };
                    }
                }
            }
            continue;
        }

        // If the next character is a minus sign, skip it and set
        // this variable to its span. Otherwise, set it to None.
        //
        // Note: Negatives are parsed during tokenization because no subtraction
        // exists in JSON, every - sign is unary so there's no ambiguity.
        let negative_sign_span = if context.peek_is('-') {
            let span = context.next_span().unwrap().1;
            context.skip_whitespace();
            Some(span)
        } else {
            None
        };

        // Attempt to parse a floating point number. If one was found and it was valid, push
        // a token for it, otherwise report a parsing error.
        if let Some((result, span)) = context.try_parse_float() {
            if let Ok(mut number) = result {
                // If a negative sign was found prior, negate the number.
                // This is completely lossless because floating point numbers use a sign bit.
                negative_sign_span.is_some().then(|| number = -number);

                context.push_token(Token::new(Number(number), span));
            } else {
                context.report(ErrorNotification(
                    "Floating-point number is malformed".to_owned(),
                    span,
                ));
            }

            continue;
        } else if let Some(span) = negative_sign_span {
            // If no number was found, but we DID find a negative sign, then that negative sign is alone and
            // is thus invalid.
            context.report(ErrorNotification(
                "Negative sign should have a number after it".to_owned(),
                span,
            ));
            continue;
        }

        // If whitespace is found, skip it and continue, otherwise throw an error indicating this
        // is an unknown character.
        if context.peek_is_map(|char| char.is_whitespace()) {
            context.skip_whitespace();
        } else {
            let (char, span) = context.next_span().unwrap();

            context.report(ErrorNotification(
                format!("Unexpected character '{char}'"),
                span,
            ))
        }
    }

    println!("{:#?}", context.result());
}

source §

impl<S: Iterator<Item = char>, T> TokenizerContext<S, T>

source

pub fn consume_standard_number(&mut self) -> Option<(String, Span)>

If the next character is an ascii digit or period, consume it and all ascii digits, ascii letters, underscores, and/or periods following it. The consumed characters and a span over them all will be returned. Otherwise, return None.

This method is intended to be used for more advanced number parsing where manual analysis not covered by this module is needed.

This method may also match + and - if they are directly preceded by an e or E. Otherwise, they are treated like any other number-terminating character.

source

pub fn parse_standard_base(&mut self) -> Result<StandardBase, InvalidBaseError>

If the next character is 0, consume it and the character after it. The second character is the base.

A base of x will result in hexadecimal.
A base of o will result in octal.
A base of b will result in binary.
Any other character will return an error.

If no 0 was found initially, decimal is returned.

source

pub fn parse_standard_base_strict( &mut self, ) -> Option<Result<StandardBase, InvalidBaseError>>

Same as TokenizerContext::parse_standard_base but returns None if no 0 was found initially.

source

pub fn try_parse_integer_from_base<R: Copy + Zero + CheckedAdd<Output = R> + CheckedMul<Output = R> + From<u8> + Unsigned>( &mut self, base: &impl NumericalBase, ) -> Option<(Span, Result<R, IntegerOutOfRangeError>)>

Takes a NumericalBase and continuously consumes that base’s digits from the source code until a non-digit character is encountered. Consumed digits will be accumulated into a single numerical value. (E.g. “12A” in hex returns 298, assuming the number type can support 298)

If the next character in the source code is not a valid digit for this base, return None.

If the target number type cannot support the value, then an IntegerOutOfRangeError is returned. This error should be raised as a tokenizer notification as appropriate.

No matter what the result is, this function will always consume as many digits as possible. An overflow will not cause the function to exit early. This means you may use the returned Span for reporting the overflow location.

Only unsigned values may be parsed— this method entirely ignores negatives.

This method should be used for languages that only support integers. See: TokenizerContext::parse_standard_base.

source

pub fn try_parse_integer<R: Copy + Zero + CheckedAdd<Output = R> + CheckedMul<Output = R> + From<u8> + Unsigned>( &mut self, ) -> Option<(Span, Result<R, IntegerOutOfRangeError>)>

Continuously consumes ascii digits from the source code until a non-digit character is encountered. Consumed digits will be accumulated into a single numerical value. (E.g. “298” returns 298, assuming the number type can support 298)

If the next character in the source code is not a valid digit, return None.

If the target number type cannot support the value, then an IntegerOutOfRangeError is returned. This error should be raised as a tokenizer notification as appropriate.

No matter what the result is, this function will always consume as many digits as possible. An overflow will not cause the function to exit early. This means you may use the returned Span for reporting the overflow location.

Only unsigned values may be parsed— this method entirely ignores negatives.

This method should be used for languages that only support integers. See: TokenizerContext::try_parse_integer_from_base for a base-neutral version.

source

pub fn try_parse_float( &mut self, ) -> Option<(Result<f64, ParseFloatError>, Span)>

Consumes a standard number with TokenizerContext::consume_standard_number and parses the consumed string as a 64-bit floating point number. Negative signs are ignored— returned floats should always be positive.

This should generally be used for language that only have floating point numbers, and have no direct need for pure integers. This method does not account for NaN or Infinity. Those will need to be manually accounted for elsewhere.

Returns None if no number was found, otherwise returns a 2-tuple of the parsed number (or relevant error) and a span over the entire number’s position in the source code.

If integers need to be parsed, use TokenizerContext::try_parse_integer.

Examples found in repository ?

examples/expression.rs (line 58)

fn main() {
    use MathSymbol::*;

    // The test program we're going to tokenize.
    let program = "23 * (012 - 3) / 1_2_3 + 5e-3";

    // The TokenizerContext for our example program.
    let mut context = TokenizerContext::new(program.chars());

    // While there are more characters in the source code...
    while context.has_next() {
        // If the next character is one of these, push its respective token.
        let single_char_pushed = context.map_single_char_token(|char| match char {
            '+' => Some(Plus),
            '-' => Some(Minus),
            '*' => Some(Times),
            '/' => Some(Divide),
            '%' => Some(Modulo),
            '(' => Some(OpenParen),
            ')' => Some(CloseParen),
            _ => None,
        });

        // If the above properly pushed a token, skip this iteration.
        if single_char_pushed {
            continue;
        }

        // Try to parse a floating point number (f64), throw an error or push the token as necessary.
        if let Some((result, span)) = context.try_parse_float() {
            if let Ok(value) = result {
                context.push_token(Token::new(Number(value), span));
            } else {
                context.report(ErrorNotification(String::from("Malformed number."), span));
            }

            continue;
        }

        // If the above didn't occur, skip the next character. Throw an error if it isn't whitespace.
        if let Some((char, span)) = context.next_span() {
            if !char.is_whitespace() {
                context.report(ErrorNotification(
                    format!("Unexpected character `{}` in expression.", char),
                    span,
                ));
            }
        }
    }

    // Print the result— this example should have no error notifications.
    println!("{:#?}", context.result());
}

More examples

Hide additional examples

examples/json.rs (line 136)

pub fn main() {
    use JsonToken::*;

    let code = r##"
        {
            "a": "Assigned to the \"a\" property!",
            "b" : 25.2,
            "c": false,
            "d": true,
            "12": null,
            "list": [
                1,
                -2e+2,
                3.1
            ]
        }
    "##;

    let mut context = TokenizerContext::new(code.chars());

    while context.has_next() {
        // Map single-character tokens to their data.
        let found_token = context.map_single_char_token(|x| match x {
            '{' => Some(OpenBrace),
            '}' => Some(CloseBrace),
            '[' => Some(OpenBracket),
            ']' => Some(CloseBracket),
            ':' => Some(Colon),
            ',' => Some(Comma),
            _ => None,
        });

        // If the above codeblock pushed a new token, move onto the next iteration.
        if found_token {
            continue;
        }

        // Attempt to parse an identifier for certain tokens.
        if let Some((name, span)) = context.try_parse_standard_identifier() {
            let data = match name.as_ref() {
                "null" => Some(Null),
                "true" => Some(Bool(true)),
                "false" => Some(Bool(false)),
                _ => None,
            };

            // If the above table mapped the identifier to valid token data, push a token.
            // If it didn't report an error— either way restart the loop.
            if let Some(data) = data {
                context.push_token(Token::new(data, span));
            } else {
                context.report(ErrorNotification(
                    format!("Unexpected identifier {}", name),
                    span,
                ))
            }

            continue;
        }

        // If the next element in the source code is a string, parse it and report errors as necessary.
        if let Some((result, span)) = context.try_parse_strict_string() {
            match result {
                Ok(string) => {
                    // Push the valid string.
                    context.push_token(Token::new(Str(string), span));
                }
                Err(errors) => {
                    // Create a notification for every error.
                    for error in errors {
                        use ParseCharError::*;
                        use StringTokenError::*;

                        match error {
                            CharError(NoEscape(span)) => ErrorNotification(
                                "Missing escape code after backslash".to_owned(),
                                span,
                            ),
                            CharError(IllegalEscape(char, span)) => {
                                ErrorNotification(format!("Illegal escape code '{}'", char), span)
                            }
                            CharError(NoCharFound) => {
                                unreachable!("Strings will never create this error")
                            }
                            NoClosingDelimiter => ErrorNotification(
                                "Missing closing delimiter on string".to_owned(),
                                span.clone(),
                            ),
                        };
                    }
                }
            }
            continue;
        }

        // If the next character is a minus sign, skip it and set
        // this variable to its span. Otherwise, set it to None.
        //
        // Note: Negatives are parsed during tokenization because no subtraction
        // exists in JSON, every - sign is unary so there's no ambiguity.
        let negative_sign_span = if context.peek_is('-') {
            let span = context.next_span().unwrap().1;
            context.skip_whitespace();
            Some(span)
        } else {
            None
        };

        // Attempt to parse a floating point number. If one was found and it was valid, push
        // a token for it, otherwise report a parsing error.
        if let Some((result, span)) = context.try_parse_float() {
            if let Ok(mut number) = result {
                // If a negative sign was found prior, negate the number.
                // This is completely lossless because floating point numbers use a sign bit.
                negative_sign_span.is_some().then(|| number = -number);

                context.push_token(Token::new(Number(number), span));
            } else {
                context.report(ErrorNotification(
                    "Floating-point number is malformed".to_owned(),
                    span,
                ));
            }

            continue;
        } else if let Some(span) = negative_sign_span {
            // If no number was found, but we DID find a negative sign, then that negative sign is alone and
            // is thus invalid.
            context.report(ErrorNotification(
                "Negative sign should have a number after it".to_owned(),
                span,
            ));
            continue;
        }

        // If whitespace is found, skip it and continue, otherwise throw an error indicating this
        // is an unknown character.
        if context.peek_is_map(|char| char.is_whitespace()) {
            context.skip_whitespace();
        } else {
            let (char, span) = context.next_span().unwrap();

            context.report(ErrorNotification(
                format!("Unexpected character '{char}'"),
                span,
            ))
        }
    }

    println!("{:#?}", context.result());
}

source

pub fn try_parse_number<R: Copy + Zero + CheckedAdd<Output = R> + CheckedMul<Output = R> + From<u8> + Unsigned>( &mut self, ) -> Option<(Result<ParsedNumber<R>, ParseNumberError>, Span)>

Attempts to parse a general-form number. This method will return None if no number is found at all.

If the number starts with a 0, the second character must represent a StandardBase, ie: b, o, or x— else an error will be returned.

If the number starts with an explicit base as defined above, OR the number contains no decimal point, OR the number does not end in f, then it will be treated as an integer and parsed the same as TokenizerContext::try_parse_integer for the generic integer type R.

If the number does not pass the above condition, it will be parsed as an f64 and returned.

source §

impl<S: Iterator<Item = char>, T> TokenizerContext<S, T>

source

pub fn parse_string<E>( &mut self, char_consumer: impl Fn(&mut Self, &mut String, char) -> Result<(), E>, ) -> (Result<String, Vec<StringTokenError<E>>>, Span)

Attempt to read a string-like structure from source code.

This method will immediately consume a character and treat it as the delimiter, it will then consume characters until EOF or an ending delimiter is reached.

Characters are consumed by the argument char_consumer, error notifications returned by it will be reported to the argument TokenizerContext automatically.

source

pub fn try_parse_simple_string( &mut self, ) -> Option<(Result<String, Vec<StringTokenError<ParseCharError>>>, Span)>

Parses a simple, rust-like string. Both ' and " are allowed as a delimiter. This string tokenizer should be sufficient for most languages. This uses parse_simple_character to parse each character within the string.

If the next character is not ' or ", this function is a no-op.

If your language has both string and character tokens, try_parse_strict_string may be more applicable.

source

pub fn try_parse_strict_string( &mut self, ) -> Option<(Result<String, Vec<StringTokenError<ParseCharError>>>, Span)>

Same as try_parse_simple_string but only allows " as a string delimiter. This should generally be used if ' is reserved for a character token.

If the next character is not ", this function is a no-op.

Also see: parse_simple_character

Examples found in repository ?

examples/json.rs (line 87)

pub fn main() {
    use JsonToken::*;

    let code = r##"
        {
            "a": "Assigned to the \"a\" property!",
            "b" : 25.2,
            "c": false,
            "d": true,
            "12": null,
            "list": [
                1,
                -2e+2,
                3.1
            ]
        }
    "##;

    let mut context = TokenizerContext::new(code.chars());

    while context.has_next() {
        // Map single-character tokens to their data.
        let found_token = context.map_single_char_token(|x| match x {
            '{' => Some(OpenBrace),
            '}' => Some(CloseBrace),
            '[' => Some(OpenBracket),
            ']' => Some(CloseBracket),
            ':' => Some(Colon),
            ',' => Some(Comma),
            _ => None,
        });

        // If the above codeblock pushed a new token, move onto the next iteration.
        if found_token {
            continue;
        }

        // Attempt to parse an identifier for certain tokens.
        if let Some((name, span)) = context.try_parse_standard_identifier() {
            let data = match name.as_ref() {
                "null" => Some(Null),
                "true" => Some(Bool(true)),
                "false" => Some(Bool(false)),
                _ => None,
            };

            // If the above table mapped the identifier to valid token data, push a token.
            // If it didn't report an error— either way restart the loop.
            if let Some(data) = data {
                context.push_token(Token::new(data, span));
            } else {
                context.report(ErrorNotification(
                    format!("Unexpected identifier {}", name),
                    span,
                ))
            }

            continue;
        }

        // If the next element in the source code is a string, parse it and report errors as necessary.
        if let Some((result, span)) = context.try_parse_strict_string() {
            match result {
                Ok(string) => {
                    // Push the valid string.
                    context.push_token(Token::new(Str(string), span));
                }
                Err(errors) => {
                    // Create a notification for every error.
                    for error in errors {
                        use ParseCharError::*;
                        use StringTokenError::*;

                        match error {
                            CharError(NoEscape(span)) => ErrorNotification(
                                "Missing escape code after backslash".to_owned(),
                                span,
                            ),
                            CharError(IllegalEscape(char, span)) => {
                                ErrorNotification(format!("Illegal escape code '{}'", char), span)
                            }
                            CharError(NoCharFound) => {
                                unreachable!("Strings will never create this error")
                            }
                            NoClosingDelimiter => ErrorNotification(
                                "Missing closing delimiter on string".to_owned(),
                                span.clone(),
                            ),
                        };
                    }
                }
            }
            continue;
        }

        // If the next character is a minus sign, skip it and set
        // this variable to its span. Otherwise, set it to None.
        //
        // Note: Negatives are parsed during tokenization because no subtraction
        // exists in JSON, every - sign is unary so there's no ambiguity.
        let negative_sign_span = if context.peek_is('-') {
            let span = context.next_span().unwrap().1;
            context.skip_whitespace();
            Some(span)
        } else {
            None
        };

        // Attempt to parse a floating point number. If one was found and it was valid, push
        // a token for it, otherwise report a parsing error.
        if let Some((result, span)) = context.try_parse_float() {
            if let Ok(mut number) = result {
                // If a negative sign was found prior, negate the number.
                // This is completely lossless because floating point numbers use a sign bit.
                negative_sign_span.is_some().then(|| number = -number);

                context.push_token(Token::new(Number(number), span));
            } else {
                context.report(ErrorNotification(
                    "Floating-point number is malformed".to_owned(),
                    span,
                ));
            }

            continue;
        } else if let Some(span) = negative_sign_span {
            // If no number was found, but we DID find a negative sign, then that negative sign is alone and
            // is thus invalid.
            context.report(ErrorNotification(
                "Negative sign should have a number after it".to_owned(),
                span,
            ));
            continue;
        }

        // If whitespace is found, skip it and continue, otherwise throw an error indicating this
        // is an unknown character.
        if context.peek_is_map(|char| char.is_whitespace()) {
            context.skip_whitespace();
        } else {
            let (char, span) = context.next_span().unwrap();

            context.report(ErrorNotification(
                format!("Unexpected character '{char}'"),
                span,
            ))
        }
    }

    println!("{:#?}", context.result());
}

source

pub fn try_parse_character_token( &mut self, ) -> Option<(Result<char, CharTokenError>, Span)>

Parses a simple character token. If the next character is a ', it will use parse_simple_character to find the “character,” and then parse the closing ' delimiter.

This function will return None if no ' was found. This function will return an Err variant if the character token couldn’t be parsed.

source §

impl<S: Iterator<Item = char>, T> TokenizerContext<S, T>

source

pub fn get_indent_level( &mut self, indent_char: char, chars_per_level: usize, ) -> usize

This function will consume as many indent_char (specified by argument) characters as it can, then return an “indent level” based on the chars_per_level argument.

This method should only be called at the beginning of a line. It is intended for tokenizers where line indentation is significant, such as in Python. ident_char should be spaces or tabs most of the time.

If a line has N indent characters, then the returned value is floor(N / chars_per_level). For example, if chars_per_level is 4, then 8 through 11 indent characters are all considered part of indentation level 2. 12 would would be indentation level 3, 7 would would be indentation level 1.

source

pub fn skip_whitespace(&mut self)

Repeatedly consume whitespace until a non-whitespace character or EOF.

Examples found in repository ?

examples/json.rs (line 128)

pub fn main() {
    use JsonToken::*;

    let code = r##"
        {
            "a": "Assigned to the \"a\" property!",
            "b" : 25.2,
            "c": false,
            "d": true,
            "12": null,
            "list": [
                1,
                -2e+2,
                3.1
            ]
        }
    "##;

    let mut context = TokenizerContext::new(code.chars());

    while context.has_next() {
        // Map single-character tokens to their data.
        let found_token = context.map_single_char_token(|x| match x {
            '{' => Some(OpenBrace),
            '}' => Some(CloseBrace),
            '[' => Some(OpenBracket),
            ']' => Some(CloseBracket),
            ':' => Some(Colon),
            ',' => Some(Comma),
            _ => None,
        });

        // If the above codeblock pushed a new token, move onto the next iteration.
        if found_token {
            continue;
        }

        // Attempt to parse an identifier for certain tokens.
        if let Some((name, span)) = context.try_parse_standard_identifier() {
            let data = match name.as_ref() {
                "null" => Some(Null),
                "true" => Some(Bool(true)),
                "false" => Some(Bool(false)),
                _ => None,
            };

            // If the above table mapped the identifier to valid token data, push a token.
            // If it didn't report an error— either way restart the loop.
            if let Some(data) = data {
                context.push_token(Token::new(data, span));
            } else {
                context.report(ErrorNotification(
                    format!("Unexpected identifier {}", name),
                    span,
                ))
            }

            continue;
        }

        // If the next element in the source code is a string, parse it and report errors as necessary.
        if let Some((result, span)) = context.try_parse_strict_string() {
            match result {
                Ok(string) => {
                    // Push the valid string.
                    context.push_token(Token::new(Str(string), span));
                }
                Err(errors) => {
                    // Create a notification for every error.
                    for error in errors {
                        use ParseCharError::*;
                        use StringTokenError::*;

                        match error {
                            CharError(NoEscape(span)) => ErrorNotification(
                                "Missing escape code after backslash".to_owned(),
                                span,
                            ),
                            CharError(IllegalEscape(char, span)) => {
                                ErrorNotification(format!("Illegal escape code '{}'", char), span)
                            }
                            CharError(NoCharFound) => {
                                unreachable!("Strings will never create this error")
                            }
                            NoClosingDelimiter => ErrorNotification(
                                "Missing closing delimiter on string".to_owned(),
                                span.clone(),
                            ),
                        };
                    }
                }
            }
            continue;
        }

        // If the next character is a minus sign, skip it and set
        // this variable to its span. Otherwise, set it to None.
        //
        // Note: Negatives are parsed during tokenization because no subtraction
        // exists in JSON, every - sign is unary so there's no ambiguity.
        let negative_sign_span = if context.peek_is('-') {
            let span = context.next_span().unwrap().1;
            context.skip_whitespace();
            Some(span)
        } else {
            None
        };

        // Attempt to parse a floating point number. If one was found and it was valid, push
        // a token for it, otherwise report a parsing error.
        if let Some((result, span)) = context.try_parse_float() {
            if let Ok(mut number) = result {
                // If a negative sign was found prior, negate the number.
                // This is completely lossless because floating point numbers use a sign bit.
                negative_sign_span.is_some().then(|| number = -number);

                context.push_token(Token::new(Number(number), span));
            } else {
                context.report(ErrorNotification(
                    "Floating-point number is malformed".to_owned(),
                    span,
                ));
            }

            continue;
        } else if let Some(span) = negative_sign_span {
            // If no number was found, but we DID find a negative sign, then that negative sign is alone and
            // is thus invalid.
            context.report(ErrorNotification(
                "Negative sign should have a number after it".to_owned(),
                span,
            ));
            continue;
        }

        // If whitespace is found, skip it and continue, otherwise throw an error indicating this
        // is an unknown character.
        if context.peek_is_map(|char| char.is_whitespace()) {
            context.skip_whitespace();
        } else {
            let (char, span) = context.next_span().unwrap();

            context.report(ErrorNotification(
                format!("Unexpected character '{char}'"),
                span,
            ))
        }
    }

    println!("{:#?}", context.result());
}

source §

impl<S: Iterator<Item = char>, T> TokenizerContext<S, T>

source

pub fn recover_with(&mut self, predicate: impl Fn(char) -> bool)

This function will repeatedly skip characters until EOF or until it finds a character that matches the input predicate.

While this can be used for any purpose, this is intended to be used for error recovery to bring the tokenizer into a “safe” position to resume tokenization.

Also see: [skip_until]

source

pub fn skip_until(&mut self, match_character: char)

This function will repeatedly skip characters until EOF or until it finds a character that matches in input character.

Examples found in repository ?

examples/foreach.rs (line 61)

fn tokenize(source: &str) -> TokenizerResult<ForeachToken> {
    use ForeachToken::*;

    // Create the reader context
    let mut context = TokenizerContext::new(source.chars());

    // Iterate as long as more characters exist in the tokenizer
    while context.has_next() {
        // Attempt to read an identifier.
        let mut identifier = String::with_capacity(64);
        let identifier_span = context.read_into_while(&mut identifier, is_identifier_char);

        // If span is None, then 0 characters were read; i.e. there is no identifier.
        let Some(span) = identifier_span else {
            // Because there's no identifier here, push a single-character token, if there is one.
            // Consume a single character either way.
            let (char, span) = context.next_span().unwrap();

            let token = match char {
                '[' => OpenBracket,
                ']' => CloseBracket,
                '{' => OpenBrace,
                '}' => CloseBrace,
                ';' => Semicolon,
                _ => continue, // Any other character will just be ignored.
            };

            context.push_token(Token::new(token, span));
            continue;
        };

        // "//" will be matched as an identifier due to language rules.
        // If it's found, then skip until the next newline and continue.
        // Note: Something like "A//" passes this check, this is correct behavior.
        if identifier.starts_with("//") {
            context.skip_until('\n');
            continue;
        }

        // Create a token from the identifier. Some specific identifier are their own tokens.
        let token = match identifier.as_str() {
            "=" => Assign,
            ":=" => ConstAssign,
            "=>" => Foreach,
            "->" => Return,
            _ => Identifier(identifier),
        };

        // Push the token from above along with the identifier's span.
        context.push_token(Token::new(token, span));
    }

    // Return the result
    context.result()
}

source

pub fn capture_span<R>( &mut self, predicate: impl FnOnce(&mut TokenizerContext<S, T>) -> R, ) -> (R, Span)

Takes an input predicate and executes it immediately. The result of the predicate will be returned alongside a Span that captures the entire affected region of source code of the predicate.

In other words, this method creates a span between the source code position before and after the predicate is executed. The span is then returned alongside the predicate’s result.

The predicate will be supplied with the mutable TokenizerContext reference that was supplied to prevent a situation where two mutable references are needed.

source

pub fn read_into_while( &mut self, string: &mut String, predicate: impl Fn(&char) -> bool, ) -> Option

If the next character matches the input predicate, append it to the String argument. Repeat until EOF or until next character doesn’t match.

Returns a Span over all of the consumed characters, or None if no characters were consumed.

Examples found in repository ?

examples/foreach.rs (line 36)

fn tokenize(source: &str) -> TokenizerResult<ForeachToken> {
    use ForeachToken::*;

    // Create the reader context
    let mut context = TokenizerContext::new(source.chars());

    // Iterate as long as more characters exist in the tokenizer
    while context.has_next() {
        // Attempt to read an identifier.
        let mut identifier = String::with_capacity(64);
        let identifier_span = context.read_into_while(&mut identifier, is_identifier_char);

        // If span is None, then 0 characters were read; i.e. there is no identifier.
        let Some(span) = identifier_span else {
            // Because there's no identifier here, push a single-character token, if there is one.
            // Consume a single character either way.
            let (char, span) = context.next_span().unwrap();

            let token = match char {
                '[' => OpenBracket,
                ']' => CloseBracket,
                '{' => OpenBrace,
                '}' => CloseBrace,
                ';' => Semicolon,
                _ => continue, // Any other character will just be ignored.
            };

            context.push_token(Token::new(token, span));
            continue;
        };

        // "//" will be matched as an identifier due to language rules.
        // If it's found, then skip until the next newline and continue.
        // Note: Something like "A//" passes this check, this is correct behavior.
        if identifier.starts_with("//") {
            context.skip_until('\n');
            continue;
        }

        // Create a token from the identifier. Some specific identifier are their own tokens.
        let token = match identifier.as_str() {
            "=" => Assign,
            ":=" => ConstAssign,
            "=>" => Foreach,
            "->" => Return,
            _ => Identifier(identifier),
        };

        // Push the token from above along with the identifier's span.
        context.push_token(Token::new(token, span));
    }

    // Return the result
    context.result()
}

source

pub fn fold<A>( &mut self, accumulator: A, predicate: impl Fn(char, &mut A) -> ControlFlow<(), ()>, ) -> A

Continuously peeks the next character and executes the predicate function. If the predicate function returns ControlFlow::Break, stop and return the accumulator. If the predicate function returns ControlFlow::Continue, skip the character that was peeked and repeat.

The predicate function is supplied the peeked character and a mutable reference to the current accumulator. The predicate will not be called if EOF is reached.

source

pub fn map_single_char_token( &mut self, predicate: impl FnOnce(char) -> Option<T>, ) -> bool

Peeks a single character and passes it into the predicate. If the predicate returns None, nothing happens. If the predicate returns Some token data, the character will be consumed and its span used to construct a Token alongside the returned data— the constructed token will then be pushed to this context’s token list.

This method will return true if a token was pushed, and false if nothing occured. If this context has no characters left to iterate, this method will do nothing.

This method is intended to be used to easily handle single-character tokens.

Examples found in repository ?

examples/brainfuck.rs (lines 31-41)

fn tokenize(source: &str) -> TokenizerResult<BFTokenType> {
    use BFTokenType::*;

    // Create the reader context
    let mut context = TokenizerContext::new(source.chars());

    // Repeat this code while there are more characters in the source code.
    while context.has_next() {
        // Attempt to map these characters to their respective tokens.
        let pushed_token = context.map_single_char_token(|char| match char {
            '+' => Some(Increment),
            '-' => Some(Decrement),
            '>' => Some(MoveRight),
            '<' => Some(MoveLeft),
            '[' => Some(BeginWhile),
            ']' => Some(EndWhile),
            '.' => Some(WriteIO),
            ',' => Some(ReadIO),
            _ => None,
        });

        // If a token was NOT pushed above— i.e. it was a different character, just skip it and move on.
        if !pushed_token {
            context.skip();
        }
    }

    // Return the result
    context.result()
}

More examples

Hide additional examples

examples/expression.rs (lines 41-50)

fn main() {
    use MathSymbol::*;

    // The test program we're going to tokenize.
    let program = "23 * (012 - 3) / 1_2_3 + 5e-3";

    // The TokenizerContext for our example program.
    let mut context = TokenizerContext::new(program.chars());

    // While there are more characters in the source code...
    while context.has_next() {
        // If the next character is one of these, push its respective token.
        let single_char_pushed = context.map_single_char_token(|char| match char {
            '+' => Some(Plus),
            '-' => Some(Minus),
            '*' => Some(Times),
            '/' => Some(Divide),
            '%' => Some(Modulo),
            '(' => Some(OpenParen),
            ')' => Some(CloseParen),
            _ => None,
        });

        // If the above properly pushed a token, skip this iteration.
        if single_char_pushed {
            continue;
        }

        // Try to parse a floating point number (f64), throw an error or push the token as necessary.
        if let Some((result, span)) = context.try_parse_float() {
            if let Ok(value) = result {
                context.push_token(Token::new(Number(value), span));
            } else {
                context.report(ErrorNotification(String::from("Malformed number."), span));
            }

            continue;
        }

        // If the above didn't occur, skip the next character. Throw an error if it isn't whitespace.
        if let Some((char, span)) = context.next_span() {
            if !char.is_whitespace() {
                context.report(ErrorNotification(
                    format!("Unexpected character `{}` in expression.", char),
                    span,
                ));
            }
        }
    }

    // Print the result— this example should have no error notifications.
    println!("{:#?}", context.result());
}

examples/json.rs (lines 48-56)

pub fn main() {
    use JsonToken::*;

    let code = r##"
        {
            "a": "Assigned to the \"a\" property!",
            "b" : 25.2,
            "c": false,
            "d": true,
            "12": null,
            "list": [
                1,
                -2e+2,
                3.1
            ]
        }
    "##;

    let mut context = TokenizerContext::new(code.chars());

    while context.has_next() {
        // Map single-character tokens to their data.
        let found_token = context.map_single_char_token(|x| match x {
            '{' => Some(OpenBrace),
            '}' => Some(CloseBrace),
            '[' => Some(OpenBracket),
            ']' => Some(CloseBracket),
            ':' => Some(Colon),
            ',' => Some(Comma),
            _ => None,
        });

        // If the above codeblock pushed a new token, move onto the next iteration.
        if found_token {
            continue;
        }

        // Attempt to parse an identifier for certain tokens.
        if let Some((name, span)) = context.try_parse_standard_identifier() {
            let data = match name.as_ref() {
                "null" => Some(Null),
                "true" => Some(Bool(true)),
                "false" => Some(Bool(false)),
                _ => None,
            };

            // If the above table mapped the identifier to valid token data, push a token.
            // If it didn't report an error— either way restart the loop.
            if let Some(data) = data {
                context.push_token(Token::new(data, span));
            } else {
                context.report(ErrorNotification(
                    format!("Unexpected identifier {}", name),
                    span,
                ))
            }

            continue;
        }

        // If the next element in the source code is a string, parse it and report errors as necessary.
        if let Some((result, span)) = context.try_parse_strict_string() {
            match result {
                Ok(string) => {
                    // Push the valid string.
                    context.push_token(Token::new(Str(string), span));
                }
                Err(errors) => {
                    // Create a notification for every error.
                    for error in errors {
                        use ParseCharError::*;
                        use StringTokenError::*;

                        match error {
                            CharError(NoEscape(span)) => ErrorNotification(
                                "Missing escape code after backslash".to_owned(),
                                span,
                            ),
                            CharError(IllegalEscape(char, span)) => {
                                ErrorNotification(format!("Illegal escape code '{}'", char), span)
                            }
                            CharError(NoCharFound) => {
                                unreachable!("Strings will never create this error")
                            }
                            NoClosingDelimiter => ErrorNotification(
                                "Missing closing delimiter on string".to_owned(),
                                span.clone(),
                            ),
                        };
                    }
                }
            }
            continue;
        }

        // If the next character is a minus sign, skip it and set
        // this variable to its span. Otherwise, set it to None.
        //
        // Note: Negatives are parsed during tokenization because no subtraction
        // exists in JSON, every - sign is unary so there's no ambiguity.
        let negative_sign_span = if context.peek_is('-') {
            let span = context.next_span().unwrap().1;
            context.skip_whitespace();
            Some(span)
        } else {
            None
        };

        // Attempt to parse a floating point number. If one was found and it was valid, push
        // a token for it, otherwise report a parsing error.
        if let Some((result, span)) = context.try_parse_float() {
            if let Ok(mut number) = result {
                // If a negative sign was found prior, negate the number.
                // This is completely lossless because floating point numbers use a sign bit.
                negative_sign_span.is_some().then(|| number = -number);

                context.push_token(Token::new(Number(number), span));
            } else {
                context.report(ErrorNotification(
                    "Floating-point number is malformed".to_owned(),
                    span,
                ));
            }

            continue;
        } else if let Some(span) = negative_sign_span {
            // If no number was found, but we DID find a negative sign, then that negative sign is alone and
            // is thus invalid.
            context.report(ErrorNotification(
                "Negative sign should have a number after it".to_owned(),
                span,
            ));
            continue;
        }

        // If whitespace is found, skip it and continue, otherwise throw an error indicating this
        // is an unknown character.
        if context.peek_is_map(|char| char.is_whitespace()) {
            context.skip_whitespace();
        } else {
            let (char, span) = context.next_span().unwrap();

            context.report(ErrorNotification(
                format!("Unexpected character '{char}'"),
                span,
            ))
        }
    }

    println!("{:#?}", context.result());
}

source §

impl<'src, TokenData> TokenizerContext<Peekable<Chars<'src>>, TokenData>

source

pub fn new_str(string: &'src str) -> Self

Create a new TokenizerContext from a string. This method will generally be more useful than new.

source §

impl<'src, TokenData, F: FnMut(Result<char, ReadCharError>) -> char> TokenizerContext<Map<CharsRaw<'src, BufReader<File>>, F>, TokenData>

source

pub fn new_file(reader: &'src mut BufReader<File>, handler: F) -> Self

Create a new TokenizerContext from a BufReader.

Examples found in repository ?

examples/file/main.rs (lines 56-59)

pub fn main() {
    // Buffer for the input path
    let mut str_path = String::new();

    println!(
        "Current path: {:?}",
        std::env::current_dir().expect("No current dir exists.")
    );
    println!("Leave input blank to use default path.");
    print!("Input path to file: ");

    stdout().lock().flush().expect("Could not flush stdout.");

    // Read path input
    stdin()
        .lock()
        .read_line(&mut str_path)
        .expect("Could not read stdin.");

    let trimmed = str_path.trim();

    // Get path from string.
    let path = if trimmed.is_empty() {
        Path::new("./examples/file/file.txt")
    } else {
        Path::new(trimmed)
    };

    println!();
    println!("Path: {:?}", path);

    // Load file from path
    let file = File::open(path).expect("Could not open file.");

    // Get a buffered reader of the file.
    let mut reader = BufReader::new(file);

    // Create a context from the BufReader. The closure dictates
    // how we want to handle any read failures— in this case just panic.
    let mut context = TokenizerContext::new_file(&mut reader, |x| match x {
        Ok(char) => char,
        Err(_) => panic!("Unable to read file."),
    });

    // Tokenizer logic
    while context.has_next() {
        if let Some((ident, span)) = context.try_parse_standard_identifier() {
            context.push_token(Token::new(ident, span));
        } else {
            context.skip();
        }
    }

    // Print result
    println!();
    println!("{:#?}", context.result());
}

source §

impl<Source: Iterator<Item = char>, TokenData> TokenizerContext<Source, TokenData>

source

pub fn new(source: impl IntoIterator<IntoIter = Source, Item = char>) -> Self

Create a new TokenizerContext from a character iterator.

Examples found in repository ?

examples/brainfuck.rs (line 26)

fn tokenize(source: &str) -> TokenizerResult<BFTokenType> {
    use BFTokenType::*;

    // Create the reader context
    let mut context = TokenizerContext::new(source.chars());

    // Repeat this code while there are more characters in the source code.
    while context.has_next() {
        // Attempt to map these characters to their respective tokens.
        let pushed_token = context.map_single_char_token(|char| match char {
            '+' => Some(Increment),
            '-' => Some(Decrement),
            '>' => Some(MoveRight),
            '<' => Some(MoveLeft),
            '[' => Some(BeginWhile),
            ']' => Some(EndWhile),
            '.' => Some(WriteIO),
            ',' => Some(ReadIO),
            _ => None,
        });

        // If a token was NOT pushed above— i.e. it was a different character, just skip it and move on.
        if !pushed_token {
            context.skip();
        }
    }

    // Return the result
    context.result()
}

More examples

Hide additional examples

examples/iterator.rs (line 20)

fn main() {
    let source = "this is my source code :3";
    let mut ctx = TokenizerContext::new(source.chars());

    // Collect every pair of characters into a Pair token.
    // If there is a lone character at the end, collect it into a Single token.
    let iter = ctx.create_iterator(|context| {
        // If there's another character to look at, consume it and continue
        if let Some((char1, span1)) = context.next_span() {
            // If another character follows this one, consume it and create a Pair token.
            // If there is no character after this one, create a Single token.
            if let Some((char2, span2)) = context.next_span() {
                context.push_token(Token::new(
                    MyToken::Pair(char1, char2),
                    span1.between(&span2),
                ))
            } else {
                context.push_token(Token::new(MyToken::Single(char1), span1))
            }
        }
    });

    // Print each token in the iterator.
    // This has the benefit of not having every single token in-memory at once.
    for token in iter {
        println!("{:?}", token)    
    }
}

examples/expression.rs (line 36)

fn main() {
    use MathSymbol::*;

    // The test program we're going to tokenize.
    let program = "23 * (012 - 3) / 1_2_3 + 5e-3";

    // The TokenizerContext for our example program.
    let mut context = TokenizerContext::new(program.chars());

    // While there are more characters in the source code...
    while context.has_next() {
        // If the next character is one of these, push its respective token.
        let single_char_pushed = context.map_single_char_token(|char| match char {
            '+' => Some(Plus),
            '-' => Some(Minus),
            '*' => Some(Times),
            '/' => Some(Divide),
            '%' => Some(Modulo),
            '(' => Some(OpenParen),
            ')' => Some(CloseParen),
            _ => None,
        });

        // If the above properly pushed a token, skip this iteration.
        if single_char_pushed {
            continue;
        }

        // Try to parse a floating point number (f64), throw an error or push the token as necessary.
        if let Some((result, span)) = context.try_parse_float() {
            if let Ok(value) = result {
                context.push_token(Token::new(Number(value), span));
            } else {
                context.report(ErrorNotification(String::from("Malformed number."), span));
            }

            continue;
        }

        // If the above didn't occur, skip the next character. Throw an error if it isn't whitespace.
        if let Some((char, span)) = context.next_span() {
            if !char.is_whitespace() {
                context.report(ErrorNotification(
                    format!("Unexpected character `{}` in expression.", char),
                    span,
                ));
            }
        }
    }

    // Print the result— this example should have no error notifications.
    println!("{:#?}", context.result());
}

examples/foreach.rs (line 30)

fn tokenize(source: &str) -> TokenizerResult<ForeachToken> {
    use ForeachToken::*;

    // Create the reader context
    let mut context = TokenizerContext::new(source.chars());

    // Iterate as long as more characters exist in the tokenizer
    while context.has_next() {
        // Attempt to read an identifier.
        let mut identifier = String::with_capacity(64);
        let identifier_span = context.read_into_while(&mut identifier, is_identifier_char);

        // If span is None, then 0 characters were read; i.e. there is no identifier.
        let Some(span) = identifier_span else {
            // Because there's no identifier here, push a single-character token, if there is one.
            // Consume a single character either way.
            let (char, span) = context.next_span().unwrap();

            let token = match char {
                '[' => OpenBracket,
                ']' => CloseBracket,
                '{' => OpenBrace,
                '}' => CloseBrace,
                ';' => Semicolon,
                _ => continue, // Any other character will just be ignored.
            };

            context.push_token(Token::new(token, span));
            continue;
        };

        // "//" will be matched as an identifier due to language rules.
        // If it's found, then skip until the next newline and continue.
        // Note: Something like "A//" passes this check, this is correct behavior.
        if identifier.starts_with("//") {
            context.skip_until('\n');
            continue;
        }

        // Create a token from the identifier. Some specific identifier are their own tokens.
        let token = match identifier.as_str() {
            "=" => Assign,
            ":=" => ConstAssign,
            "=>" => Foreach,
            "->" => Return,
            _ => Identifier(identifier),
        };

        // Push the token from above along with the identifier's span.
        context.push_token(Token::new(token, span));
    }

    // Return the result
    context.result()
}

examples/json.rs (line 44)

pub fn main() {
    use JsonToken::*;

    let code = r##"
        {
            "a": "Assigned to the \"a\" property!",
            "b" : 25.2,
            "c": false,
            "d": true,
            "12": null,
            "list": [
                1,
                -2e+2,
                3.1
            ]
        }
    "##;

    let mut context = TokenizerContext::new(code.chars());

    while context.has_next() {
        // Map single-character tokens to their data.
        let found_token = context.map_single_char_token(|x| match x {
            '{' => Some(OpenBrace),
            '}' => Some(CloseBrace),
            '[' => Some(OpenBracket),
            ']' => Some(CloseBracket),
            ':' => Some(Colon),
            ',' => Some(Comma),
            _ => None,
        });

        // If the above codeblock pushed a new token, move onto the next iteration.
        if found_token {
            continue;
        }

        // Attempt to parse an identifier for certain tokens.
        if let Some((name, span)) = context.try_parse_standard_identifier() {
            let data = match name.as_ref() {
                "null" => Some(Null),
                "true" => Some(Bool(true)),
                "false" => Some(Bool(false)),
                _ => None,
            };

            // If the above table mapped the identifier to valid token data, push a token.
            // If it didn't report an error— either way restart the loop.
            if let Some(data) = data {
                context.push_token(Token::new(data, span));
            } else {
                context.report(ErrorNotification(
                    format!("Unexpected identifier {}", name),
                    span,
                ))
            }

            continue;
        }

        // If the next element in the source code is a string, parse it and report errors as necessary.
        if let Some((result, span)) = context.try_parse_strict_string() {
            match result {
                Ok(string) => {
                    // Push the valid string.
                    context.push_token(Token::new(Str(string), span));
                }
                Err(errors) => {
                    // Create a notification for every error.
                    for error in errors {
                        use ParseCharError::*;
                        use StringTokenError::*;

                        match error {
                            CharError(NoEscape(span)) => ErrorNotification(
                                "Missing escape code after backslash".to_owned(),
                                span,
                            ),
                            CharError(IllegalEscape(char, span)) => {
                                ErrorNotification(format!("Illegal escape code '{}'", char), span)
                            }
                            CharError(NoCharFound) => {
                                unreachable!("Strings will never create this error")
                            }
                            NoClosingDelimiter => ErrorNotification(
                                "Missing closing delimiter on string".to_owned(),
                                span.clone(),
                            ),
                        };
                    }
                }
            }
            continue;
        }

        // If the next character is a minus sign, skip it and set
        // this variable to its span. Otherwise, set it to None.
        //
        // Note: Negatives are parsed during tokenization because no subtraction
        // exists in JSON, every - sign is unary so there's no ambiguity.
        let negative_sign_span = if context.peek_is('-') {
            let span = context.next_span().unwrap().1;
            context.skip_whitespace();
            Some(span)
        } else {
            None
        };

        // Attempt to parse a floating point number. If one was found and it was valid, push
        // a token for it, otherwise report a parsing error.
        if let Some((result, span)) = context.try_parse_float() {
            if let Ok(mut number) = result {
                // If a negative sign was found prior, negate the number.
                // This is completely lossless because floating point numbers use a sign bit.
                negative_sign_span.is_some().then(|| number = -number);

                context.push_token(Token::new(Number(number), span));
            } else {
                context.report(ErrorNotification(
                    "Floating-point number is malformed".to_owned(),
                    span,
                ));
            }

            continue;
        } else if let Some(span) = negative_sign_span {
            // If no number was found, but we DID find a negative sign, then that negative sign is alone and
            // is thus invalid.
            context.report(ErrorNotification(
                "Negative sign should have a number after it".to_owned(),
                span,
            ));
            continue;
        }

        // If whitespace is found, skip it and continue, otherwise throw an error indicating this
        // is an unknown character.
        if context.peek_is_map(|char| char.is_whitespace()) {
            context.skip_whitespace();
        } else {
            let (char, span) = context.next_span().unwrap();

            context.report(ErrorNotification(
                format!("Unexpected character '{char}'"),
                span,
            ))
        }
    }

    println!("{:#?}", context.result());
}

source

pub fn next(&mut self) -> Option<char>

Return the next character in the source code and progress the underlying iterator. If the position of this character is needed, either use span before calling this method, or use next_span.

source

pub fn next_span(&mut self) -> Option<(char, Span)>

The same as next but includes the Span of the next character in the returned value.

Examples found in repository ?

examples/iterator.rs (line 26)

fn main() {
    let source = "this is my source code :3";
    let mut ctx = TokenizerContext::new(source.chars());

    // Collect every pair of characters into a Pair token.
    // If there is a lone character at the end, collect it into a Single token.
    let iter = ctx.create_iterator(|context| {
        // If there's another character to look at, consume it and continue
        if let Some((char1, span1)) = context.next_span() {
            // If another character follows this one, consume it and create a Pair token.
            // If there is no character after this one, create a Single token.
            if let Some((char2, span2)) = context.next_span() {
                context.push_token(Token::new(
                    MyToken::Pair(char1, char2),
                    span1.between(&span2),
                ))
            } else {
                context.push_token(Token::new(MyToken::Single(char1), span1))
            }
        }
    });

    // Print each token in the iterator.
    // This has the benefit of not having every single token in-memory at once.
    for token in iter {
        println!("{:?}", token)    
    }
}

More examples

Hide additional examples

examples/expression.rs (line 69)

fn main() {
    use MathSymbol::*;

    // The test program we're going to tokenize.
    let program = "23 * (012 - 3) / 1_2_3 + 5e-3";

    // The TokenizerContext for our example program.
    let mut context = TokenizerContext::new(program.chars());

    // While there are more characters in the source code...
    while context.has_next() {
        // If the next character is one of these, push its respective token.
        let single_char_pushed = context.map_single_char_token(|char| match char {
            '+' => Some(Plus),
            '-' => Some(Minus),
            '*' => Some(Times),
            '/' => Some(Divide),
            '%' => Some(Modulo),
            '(' => Some(OpenParen),
            ')' => Some(CloseParen),
            _ => None,
        });

        // If the above properly pushed a token, skip this iteration.
        if single_char_pushed {
            continue;
        }

        // Try to parse a floating point number (f64), throw an error or push the token as necessary.
        if let Some((result, span)) = context.try_parse_float() {
            if let Ok(value) = result {
                context.push_token(Token::new(Number(value), span));
            } else {
                context.report(ErrorNotification(String::from("Malformed number."), span));
            }

            continue;
        }

        // If the above didn't occur, skip the next character. Throw an error if it isn't whitespace.
        if let Some((char, span)) = context.next_span() {
            if !char.is_whitespace() {
                context.report(ErrorNotification(
                    format!("Unexpected character `{}` in expression.", char),
                    span,
                ));
            }
        }
    }

    // Print the result— this example should have no error notifications.
    println!("{:#?}", context.result());
}

examples/foreach.rs (line 42)

fn tokenize(source: &str) -> TokenizerResult<ForeachToken> {
    use ForeachToken::*;

    // Create the reader context
    let mut context = TokenizerContext::new(source.chars());

    // Iterate as long as more characters exist in the tokenizer
    while context.has_next() {
        // Attempt to read an identifier.
        let mut identifier = String::with_capacity(64);
        let identifier_span = context.read_into_while(&mut identifier, is_identifier_char);

        // If span is None, then 0 characters were read; i.e. there is no identifier.
        let Some(span) = identifier_span else {
            // Because there's no identifier here, push a single-character token, if there is one.
            // Consume a single character either way.
            let (char, span) = context.next_span().unwrap();

            let token = match char {
                '[' => OpenBracket,
                ']' => CloseBracket,
                '{' => OpenBrace,
                '}' => CloseBrace,
                ';' => Semicolon,
                _ => continue, // Any other character will just be ignored.
            };

            context.push_token(Token::new(token, span));
            continue;
        };

        // "//" will be matched as an identifier due to language rules.
        // If it's found, then skip until the next newline and continue.
        // Note: Something like "A//" passes this check, this is correct behavior.
        if identifier.starts_with("//") {
            context.skip_until('\n');
            continue;
        }

        // Create a token from the identifier. Some specific identifier are their own tokens.
        let token = match identifier.as_str() {
            "=" => Assign,
            ":=" => ConstAssign,
            "=>" => Foreach,
            "->" => Return,
            _ => Identifier(identifier),
        };

        // Push the token from above along with the identifier's span.
        context.push_token(Token::new(token, span));
    }

    // Return the result
    context.result()
}

examples/json.rs (line 127)

pub fn main() {
    use JsonToken::*;

    let code = r##"
        {
            "a": "Assigned to the \"a\" property!",
            "b" : 25.2,
            "c": false,
            "d": true,
            "12": null,
            "list": [
                1,
                -2e+2,
                3.1
            ]
        }
    "##;

    let mut context = TokenizerContext::new(code.chars());

    while context.has_next() {
        // Map single-character tokens to their data.
        let found_token = context.map_single_char_token(|x| match x {
            '{' => Some(OpenBrace),
            '}' => Some(CloseBrace),
            '[' => Some(OpenBracket),
            ']' => Some(CloseBracket),
            ':' => Some(Colon),
            ',' => Some(Comma),
            _ => None,
        });

        // If the above codeblock pushed a new token, move onto the next iteration.
        if found_token {
            continue;
        }

        // Attempt to parse an identifier for certain tokens.
        if let Some((name, span)) = context.try_parse_standard_identifier() {
            let data = match name.as_ref() {
                "null" => Some(Null),
                "true" => Some(Bool(true)),
                "false" => Some(Bool(false)),
                _ => None,
            };

            // If the above table mapped the identifier to valid token data, push a token.
            // If it didn't report an error— either way restart the loop.
            if let Some(data) = data {
                context.push_token(Token::new(data, span));
            } else {
                context.report(ErrorNotification(
                    format!("Unexpected identifier {}", name),
                    span,
                ))
            }

            continue;
        }

        // If the next element in the source code is a string, parse it and report errors as necessary.
        if let Some((result, span)) = context.try_parse_strict_string() {
            match result {
                Ok(string) => {
                    // Push the valid string.
                    context.push_token(Token::new(Str(string), span));
                }
                Err(errors) => {
                    // Create a notification for every error.
                    for error in errors {
                        use ParseCharError::*;
                        use StringTokenError::*;

                        match error {
                            CharError(NoEscape(span)) => ErrorNotification(
                                "Missing escape code after backslash".to_owned(),
                                span,
                            ),
                            CharError(IllegalEscape(char, span)) => {
                                ErrorNotification(format!("Illegal escape code '{}'", char), span)
                            }
                            CharError(NoCharFound) => {
                                unreachable!("Strings will never create this error")
                            }
                            NoClosingDelimiter => ErrorNotification(
                                "Missing closing delimiter on string".to_owned(),
                                span.clone(),
                            ),
                        };
                    }
                }
            }
            continue;
        }

        // If the next character is a minus sign, skip it and set
        // this variable to its span. Otherwise, set it to None.
        //
        // Note: Negatives are parsed during tokenization because no subtraction
        // exists in JSON, every - sign is unary so there's no ambiguity.
        let negative_sign_span = if context.peek_is('-') {
            let span = context.next_span().unwrap().1;
            context.skip_whitespace();
            Some(span)
        } else {
            None
        };

        // Attempt to parse a floating point number. If one was found and it was valid, push
        // a token for it, otherwise report a parsing error.
        if let Some((result, span)) = context.try_parse_float() {
            if let Ok(mut number) = result {
                // If a negative sign was found prior, negate the number.
                // This is completely lossless because floating point numbers use a sign bit.
                negative_sign_span.is_some().then(|| number = -number);

                context.push_token(Token::new(Number(number), span));
            } else {
                context.report(ErrorNotification(
                    "Floating-point number is malformed".to_owned(),
                    span,
                ));
            }

            continue;
        } else if let Some(span) = negative_sign_span {
            // If no number was found, but we DID find a negative sign, then that negative sign is alone and
            // is thus invalid.
            context.report(ErrorNotification(
                "Negative sign should have a number after it".to_owned(),
                span,
            ));
            continue;
        }

        // If whitespace is found, skip it and continue, otherwise throw an error indicating this
        // is an unknown character.
        if context.peek_is_map(|char| char.is_whitespace()) {
            context.skip_whitespace();
        } else {
            let (char, span) = context.next_span().unwrap();

            context.report(ErrorNotification(
                format!("Unexpected character '{char}'"),
                span,
            ))
        }
    }

    println!("{:#?}", context.result());
}

source

pub fn peek(&mut self) -> Option<char>

Return the next character in the source code, but do not progress the underlying iterator.

source

pub fn has_next(&mut self) -> bool

Returns true if there are more characters in the source code to iterate over. If the next character will be immediately read, just use peek and check whether it’s None.

Examples found in repository ?

examples/brainfuck.rs (line 29)

fn tokenize(source: &str) -> TokenizerResult<BFTokenType> {
    use BFTokenType::*;

    // Create the reader context
    let mut context = TokenizerContext::new(source.chars());

    // Repeat this code while there are more characters in the source code.
    while context.has_next() {
        // Attempt to map these characters to their respective tokens.
        let pushed_token = context.map_single_char_token(|char| match char {
            '+' => Some(Increment),
            '-' => Some(Decrement),
            '>' => Some(MoveRight),
            '<' => Some(MoveLeft),
            '[' => Some(BeginWhile),
            ']' => Some(EndWhile),
            '.' => Some(WriteIO),
            ',' => Some(ReadIO),
            _ => None,
        });

        // If a token was NOT pushed above— i.e. it was a different character, just skip it and move on.
        if !pushed_token {
            context.skip();
        }
    }

    // Return the result
    context.result()
}

More examples

Hide additional examples

examples/file/main.rs (line 62)

pub fn main() {
    // Buffer for the input path
    let mut str_path = String::new();

    println!(
        "Current path: {:?}",
        std::env::current_dir().expect("No current dir exists.")
    );
    println!("Leave input blank to use default path.");
    print!("Input path to file: ");

    stdout().lock().flush().expect("Could not flush stdout.");

    // Read path input
    stdin()
        .lock()
        .read_line(&mut str_path)
        .expect("Could not read stdin.");

    let trimmed = str_path.trim();

    // Get path from string.
    let path = if trimmed.is_empty() {
        Path::new("./examples/file/file.txt")
    } else {
        Path::new(trimmed)
    };

    println!();
    println!("Path: {:?}", path);

    // Load file from path
    let file = File::open(path).expect("Could not open file.");

    // Get a buffered reader of the file.
    let mut reader = BufReader::new(file);

    // Create a context from the BufReader. The closure dictates
    // how we want to handle any read failures— in this case just panic.
    let mut context = TokenizerContext::new_file(&mut reader, |x| match x {
        Ok(char) => char,
        Err(_) => panic!("Unable to read file."),
    });

    // Tokenizer logic
    while context.has_next() {
        if let Some((ident, span)) = context.try_parse_standard_identifier() {
            context.push_token(Token::new(ident, span));
        } else {
            context.skip();
        }
    }

    // Print result
    println!();
    println!("{:#?}", context.result());
}

examples/expression.rs (line 39)

fn main() {
    use MathSymbol::*;

    // The test program we're going to tokenize.
    let program = "23 * (012 - 3) / 1_2_3 + 5e-3";

    // The TokenizerContext for our example program.
    let mut context = TokenizerContext::new(program.chars());

    // While there are more characters in the source code...
    while context.has_next() {
        // If the next character is one of these, push its respective token.
        let single_char_pushed = context.map_single_char_token(|char| match char {
            '+' => Some(Plus),
            '-' => Some(Minus),
            '*' => Some(Times),
            '/' => Some(Divide),
            '%' => Some(Modulo),
            '(' => Some(OpenParen),
            ')' => Some(CloseParen),
            _ => None,
        });

        // If the above properly pushed a token, skip this iteration.
        if single_char_pushed {
            continue;
        }

        // Try to parse a floating point number (f64), throw an error or push the token as necessary.
        if let Some((result, span)) = context.try_parse_float() {
            if let Ok(value) = result {
                context.push_token(Token::new(Number(value), span));
            } else {
                context.report(ErrorNotification(String::from("Malformed number."), span));
            }

            continue;
        }

        // If the above didn't occur, skip the next character. Throw an error if it isn't whitespace.
        if let Some((char, span)) = context.next_span() {
            if !char.is_whitespace() {
                context.report(ErrorNotification(
                    format!("Unexpected character `{}` in expression.", char),
                    span,
                ));
            }
        }
    }

    // Print the result— this example should have no error notifications.
    println!("{:#?}", context.result());
}

examples/foreach.rs (line 33)

fn tokenize(source: &str) -> TokenizerResult<ForeachToken> {
    use ForeachToken::*;

    // Create the reader context
    let mut context = TokenizerContext::new(source.chars());

    // Iterate as long as more characters exist in the tokenizer
    while context.has_next() {
        // Attempt to read an identifier.
        let mut identifier = String::with_capacity(64);
        let identifier_span = context.read_into_while(&mut identifier, is_identifier_char);

        // If span is None, then 0 characters were read; i.e. there is no identifier.
        let Some(span) = identifier_span else {
            // Because there's no identifier here, push a single-character token, if there is one.
            // Consume a single character either way.
            let (char, span) = context.next_span().unwrap();

            let token = match char {
                '[' => OpenBracket,
                ']' => CloseBracket,
                '{' => OpenBrace,
                '}' => CloseBrace,
                ';' => Semicolon,
                _ => continue, // Any other character will just be ignored.
            };

            context.push_token(Token::new(token, span));
            continue;
        };

        // "//" will be matched as an identifier due to language rules.
        // If it's found, then skip until the next newline and continue.
        // Note: Something like "A//" passes this check, this is correct behavior.
        if identifier.starts_with("//") {
            context.skip_until('\n');
            continue;
        }

        // Create a token from the identifier. Some specific identifier are their own tokens.
        let token = match identifier.as_str() {
            "=" => Assign,
            ":=" => ConstAssign,
            "=>" => Foreach,
            "->" => Return,
            _ => Identifier(identifier),
        };

        // Push the token from above along with the identifier's span.
        context.push_token(Token::new(token, span));
    }

    // Return the result
    context.result()
}

examples/json.rs (line 46)

pub fn main() {
    use JsonToken::*;

    let code = r##"
        {
            "a": "Assigned to the \"a\" property!",
            "b" : 25.2,
            "c": false,
            "d": true,
            "12": null,
            "list": [
                1,
                -2e+2,
                3.1
            ]
        }
    "##;

    let mut context = TokenizerContext::new(code.chars());

    while context.has_next() {
        // Map single-character tokens to their data.
        let found_token = context.map_single_char_token(|x| match x {
            '{' => Some(OpenBrace),
            '}' => Some(CloseBrace),
            '[' => Some(OpenBracket),
            ']' => Some(CloseBracket),
            ':' => Some(Colon),
            ',' => Some(Comma),
            _ => None,
        });

        // If the above codeblock pushed a new token, move onto the next iteration.
        if found_token {
            continue;
        }

        // Attempt to parse an identifier for certain tokens.
        if let Some((name, span)) = context.try_parse_standard_identifier() {
            let data = match name.as_ref() {
                "null" => Some(Null),
                "true" => Some(Bool(true)),
                "false" => Some(Bool(false)),
                _ => None,
            };

            // If the above table mapped the identifier to valid token data, push a token.
            // If it didn't report an error— either way restart the loop.
            if let Some(data) = data {
                context.push_token(Token::new(data, span));
            } else {
                context.report(ErrorNotification(
                    format!("Unexpected identifier {}", name),
                    span,
                ))
            }

            continue;
        }

        // If the next element in the source code is a string, parse it and report errors as necessary.
        if let Some((result, span)) = context.try_parse_strict_string() {
            match result {
                Ok(string) => {
                    // Push the valid string.
                    context.push_token(Token::new(Str(string), span));
                }
                Err(errors) => {
                    // Create a notification for every error.
                    for error in errors {
                        use ParseCharError::*;
                        use StringTokenError::*;

                        match error {
                            CharError(NoEscape(span)) => ErrorNotification(
                                "Missing escape code after backslash".to_owned(),
                                span,
                            ),
                            CharError(IllegalEscape(char, span)) => {
                                ErrorNotification(format!("Illegal escape code '{}'", char), span)
                            }
                            CharError(NoCharFound) => {
                                unreachable!("Strings will never create this error")
                            }
                            NoClosingDelimiter => ErrorNotification(
                                "Missing closing delimiter on string".to_owned(),
                                span.clone(),
                            ),
                        };
                    }
                }
            }
            continue;
        }

        // If the next character is a minus sign, skip it and set
        // this variable to its span. Otherwise, set it to None.
        //
        // Note: Negatives are parsed during tokenization because no subtraction
        // exists in JSON, every - sign is unary so there's no ambiguity.
        let negative_sign_span = if context.peek_is('-') {
            let span = context.next_span().unwrap().1;
            context.skip_whitespace();
            Some(span)
        } else {
            None
        };

        // Attempt to parse a floating point number. If one was found and it was valid, push
        // a token for it, otherwise report a parsing error.
        if let Some((result, span)) = context.try_parse_float() {
            if let Ok(mut number) = result {
                // If a negative sign was found prior, negate the number.
                // This is completely lossless because floating point numbers use a sign bit.
                negative_sign_span.is_some().then(|| number = -number);

                context.push_token(Token::new(Number(number), span));
            } else {
                context.report(ErrorNotification(
                    "Floating-point number is malformed".to_owned(),
                    span,
                ));
            }

            continue;
        } else if let Some(span) = negative_sign_span {
            // If no number was found, but we DID find a negative sign, then that negative sign is alone and
            // is thus invalid.
            context.report(ErrorNotification(
                "Negative sign should have a number after it".to_owned(),
                span,
            ));
            continue;
        }

        // If whitespace is found, skip it and continue, otherwise throw an error indicating this
        // is an unknown character.
        if context.peek_is_map(|char| char.is_whitespace()) {
            context.skip_whitespace();
        } else {
            let (char, span) = context.next_span().unwrap();

            context.report(ErrorNotification(
                format!("Unexpected character '{char}'"),
                span,
            ))
        }
    }

    println!("{:#?}", context.result());
}

source

pub fn skip(&mut self)

Progress the underlying iterator, ignoring a single character.

Examples found in repository ?

examples/brainfuck.rs (line 45)

fn tokenize(source: &str) -> TokenizerResult<BFTokenType> {
    use BFTokenType::*;

    // Create the reader context
    let mut context = TokenizerContext::new(source.chars());

    // Repeat this code while there are more characters in the source code.
    while context.has_next() {
        // Attempt to map these characters to their respective tokens.
        let pushed_token = context.map_single_char_token(|char| match char {
            '+' => Some(Increment),
            '-' => Some(Decrement),
            '>' => Some(MoveRight),
            '<' => Some(MoveLeft),
            '[' => Some(BeginWhile),
            ']' => Some(EndWhile),
            '.' => Some(WriteIO),
            ',' => Some(ReadIO),
            _ => None,
        });

        // If a token was NOT pushed above— i.e. it was a different character, just skip it and move on.
        if !pushed_token {
            context.skip();
        }
    }

    // Return the result
    context.result()
}

More examples

Hide additional examples

examples/file/main.rs (line 66)

pub fn main() {
    // Buffer for the input path
    let mut str_path = String::new();

    println!(
        "Current path: {:?}",
        std::env::current_dir().expect("No current dir exists.")
    );
    println!("Leave input blank to use default path.");
    print!("Input path to file: ");

    stdout().lock().flush().expect("Could not flush stdout.");

    // Read path input
    stdin()
        .lock()
        .read_line(&mut str_path)
        .expect("Could not read stdin.");

    let trimmed = str_path.trim();

    // Get path from string.
    let path = if trimmed.is_empty() {
        Path::new("./examples/file/file.txt")
    } else {
        Path::new(trimmed)
    };

    println!();
    println!("Path: {:?}", path);

    // Load file from path
    let file = File::open(path).expect("Could not open file.");

    // Get a buffered reader of the file.
    let mut reader = BufReader::new(file);

    // Create a context from the BufReader. The closure dictates
    // how we want to handle any read failures— in this case just panic.
    let mut context = TokenizerContext::new_file(&mut reader, |x| match x {
        Ok(char) => char,
        Err(_) => panic!("Unable to read file."),
    });

    // Tokenizer logic
    while context.has_next() {
        if let Some((ident, span)) = context.try_parse_standard_identifier() {
            context.push_token(Token::new(ident, span));
        } else {
            context.skip();
        }
    }

    // Print result
    println!();
    println!("{:#?}", context.result());
}

source

pub fn peek_is(&mut self, char: char) -> bool

Returns true if the next character is equal to the argument character. Returns false on EOF. Does not modify the underlying iterator.

Examples found in repository ?

examples/json.rs (line 126)

pub fn main() {
    use JsonToken::*;

    let code = r##"
        {
            "a": "Assigned to the \"a\" property!",
            "b" : 25.2,
            "c": false,
            "d": true,
            "12": null,
            "list": [
                1,
                -2e+2,
                3.1
            ]
        }
    "##;

    let mut context = TokenizerContext::new(code.chars());

    while context.has_next() {
        // Map single-character tokens to their data.
        let found_token = context.map_single_char_token(|x| match x {
            '{' => Some(OpenBrace),
            '}' => Some(CloseBrace),
            '[' => Some(OpenBracket),
            ']' => Some(CloseBracket),
            ':' => Some(Colon),
            ',' => Some(Comma),
            _ => None,
        });

        // If the above codeblock pushed a new token, move onto the next iteration.
        if found_token {
            continue;
        }

        // Attempt to parse an identifier for certain tokens.
        if let Some((name, span)) = context.try_parse_standard_identifier() {
            let data = match name.as_ref() {
                "null" => Some(Null),
                "true" => Some(Bool(true)),
                "false" => Some(Bool(false)),
                _ => None,
            };

            // If the above table mapped the identifier to valid token data, push a token.
            // If it didn't report an error— either way restart the loop.
            if let Some(data) = data {
                context.push_token(Token::new(data, span));
            } else {
                context.report(ErrorNotification(
                    format!("Unexpected identifier {}", name),
                    span,
                ))
            }

            continue;
        }

        // If the next element in the source code is a string, parse it and report errors as necessary.
        if let Some((result, span)) = context.try_parse_strict_string() {
            match result {
                Ok(string) => {
                    // Push the valid string.
                    context.push_token(Token::new(Str(string), span));
                }
                Err(errors) => {
                    // Create a notification for every error.
                    for error in errors {
                        use ParseCharError::*;
                        use StringTokenError::*;

                        match error {
                            CharError(NoEscape(span)) => ErrorNotification(
                                "Missing escape code after backslash".to_owned(),
                                span,
                            ),
                            CharError(IllegalEscape(char, span)) => {
                                ErrorNotification(format!("Illegal escape code '{}'", char), span)
                            }
                            CharError(NoCharFound) => {
                                unreachable!("Strings will never create this error")
                            }
                            NoClosingDelimiter => ErrorNotification(
                                "Missing closing delimiter on string".to_owned(),
                                span.clone(),
                            ),
                        };
                    }
                }
            }
            continue;
        }

        // If the next character is a minus sign, skip it and set
        // this variable to its span. Otherwise, set it to None.
        //
        // Note: Negatives are parsed during tokenization because no subtraction
        // exists in JSON, every - sign is unary so there's no ambiguity.
        let negative_sign_span = if context.peek_is('-') {
            let span = context.next_span().unwrap().1;
            context.skip_whitespace();
            Some(span)
        } else {
            None
        };

        // Attempt to parse a floating point number. If one was found and it was valid, push
        // a token for it, otherwise report a parsing error.
        if let Some((result, span)) = context.try_parse_float() {
            if let Ok(mut number) = result {
                // If a negative sign was found prior, negate the number.
                // This is completely lossless because floating point numbers use a sign bit.
                negative_sign_span.is_some().then(|| number = -number);

                context.push_token(Token::new(Number(number), span));
            } else {
                context.report(ErrorNotification(
                    "Floating-point number is malformed".to_owned(),
                    span,
                ));
            }

            continue;
        } else if let Some(span) = negative_sign_span {
            // If no number was found, but we DID find a negative sign, then that negative sign is alone and
            // is thus invalid.
            context.report(ErrorNotification(
                "Negative sign should have a number after it".to_owned(),
                span,
            ));
            continue;
        }

        // If whitespace is found, skip it and continue, otherwise throw an error indicating this
        // is an unknown character.
        if context.peek_is_map(|char| char.is_whitespace()) {
            context.skip_whitespace();
        } else {
            let (char, span) = context.next_span().unwrap();

            context.report(ErrorNotification(
                format!("Unexpected character '{char}'"),
                span,
            ))
        }
    }

    println!("{:#?}", context.result());
}

source

pub fn peek_is_map(&mut self, predicate: impl FnOnce(&char) -> bool) -> bool

Returns true if the next character matches the predicate. Returns false on EOF. Does not modify the underlying iterator.

Examples found in repository ?

examples/json.rs (line 163)

pub fn main() {
    use JsonToken::*;

    let code = r##"
        {
            "a": "Assigned to the \"a\" property!",
            "b" : 25.2,
            "c": false,
            "d": true,
            "12": null,
            "list": [
                1,
                -2e+2,
                3.1
            ]
        }
    "##;

    let mut context = TokenizerContext::new(code.chars());

    while context.has_next() {
        // Map single-character tokens to their data.
        let found_token = context.map_single_char_token(|x| match x {
            '{' => Some(OpenBrace),
            '}' => Some(CloseBrace),
            '[' => Some(OpenBracket),
            ']' => Some(CloseBracket),
            ':' => Some(Colon),
            ',' => Some(Comma),
            _ => None,
        });

        // If the above codeblock pushed a new token, move onto the next iteration.
        if found_token {
            continue;
        }

        // Attempt to parse an identifier for certain tokens.
        if let Some((name, span)) = context.try_parse_standard_identifier() {
            let data = match name.as_ref() {
                "null" => Some(Null),
                "true" => Some(Bool(true)),
                "false" => Some(Bool(false)),
                _ => None,
            };

            // If the above table mapped the identifier to valid token data, push a token.
            // If it didn't report an error— either way restart the loop.
            if let Some(data) = data {
                context.push_token(Token::new(data, span));
            } else {
                context.report(ErrorNotification(
                    format!("Unexpected identifier {}", name),
                    span,
                ))
            }

            continue;
        }

        // If the next element in the source code is a string, parse it and report errors as necessary.
        if let Some((result, span)) = context.try_parse_strict_string() {
            match result {
                Ok(string) => {
                    // Push the valid string.
                    context.push_token(Token::new(Str(string), span));
                }
                Err(errors) => {
                    // Create a notification for every error.
                    for error in errors {
                        use ParseCharError::*;
                        use StringTokenError::*;

                        match error {
                            CharError(NoEscape(span)) => ErrorNotification(
                                "Missing escape code after backslash".to_owned(),
                                span,
                            ),
                            CharError(IllegalEscape(char, span)) => {
                                ErrorNotification(format!("Illegal escape code '{}'", char), span)
                            }
                            CharError(NoCharFound) => {
                                unreachable!("Strings will never create this error")
                            }
                            NoClosingDelimiter => ErrorNotification(
                                "Missing closing delimiter on string".to_owned(),
                                span.clone(),
                            ),
                        };
                    }
                }
            }
            continue;
        }

        // If the next character is a minus sign, skip it and set
        // this variable to its span. Otherwise, set it to None.
        //
        // Note: Negatives are parsed during tokenization because no subtraction
        // exists in JSON, every - sign is unary so there's no ambiguity.
        let negative_sign_span = if context.peek_is('-') {
            let span = context.next_span().unwrap().1;
            context.skip_whitespace();
            Some(span)
        } else {
            None
        };

        // Attempt to parse a floating point number. If one was found and it was valid, push
        // a token for it, otherwise report a parsing error.
        if let Some((result, span)) = context.try_parse_float() {
            if let Ok(mut number) = result {
                // If a negative sign was found prior, negate the number.
                // This is completely lossless because floating point numbers use a sign bit.
                negative_sign_span.is_some().then(|| number = -number);

                context.push_token(Token::new(Number(number), span));
            } else {
                context.report(ErrorNotification(
                    "Floating-point number is malformed".to_owned(),
                    span,
                ));
            }

            continue;
        } else if let Some(span) = negative_sign_span {
            // If no number was found, but we DID find a negative sign, then that negative sign is alone and
            // is thus invalid.
            context.report(ErrorNotification(
                "Negative sign should have a number after it".to_owned(),
                span,
            ));
            continue;
        }

        // If whitespace is found, skip it and continue, otherwise throw an error indicating this
        // is an unknown character.
        if context.peek_is_map(|char| char.is_whitespace()) {
            context.skip_whitespace();
        } else {
            let (char, span) = context.next_span().unwrap();

            context.report(ErrorNotification(
                format!("Unexpected character '{char}'"),
                span,
            ))
        }
    }

    println!("{:#?}", context.result());
}

source

pub fn peek_is_not(&mut self, char: char) -> bool

Returns true if the next character isn’t equal to the argument character. Returns true on EOF. Does not modify the underlying iterator.

source

pub fn span(&self) -> &Span

Get a Span of the character currently pointed to by this reader. If no next character exists, this Span will be the position of a theoretical non-newline character after the final character in the source code.

This method may be useful when a position in the source code needs to be “saved” for later reference. For example, at the beginning of a multi-character token like a number. When using it for this purpose, Span::between may be helpful.

source

pub fn report(&mut self, notification: impl Notification + 'static)

Logs a notification in this TokenizerContext which can be referenced later.

Examples found in repository ?

examples/expression.rs (line 62)

fn main() {
    use MathSymbol::*;

    // The test program we're going to tokenize.
    let program = "23 * (012 - 3) / 1_2_3 + 5e-3";

    // The TokenizerContext for our example program.
    let mut context = TokenizerContext::new(program.chars());

    // While there are more characters in the source code...
    while context.has_next() {
        // If the next character is one of these, push its respective token.
        let single_char_pushed = context.map_single_char_token(|char| match char {
            '+' => Some(Plus),
            '-' => Some(Minus),
            '*' => Some(Times),
            '/' => Some(Divide),
            '%' => Some(Modulo),
            '(' => Some(OpenParen),
            ')' => Some(CloseParen),
            _ => None,
        });

        // If the above properly pushed a token, skip this iteration.
        if single_char_pushed {
            continue;
        }

        // Try to parse a floating point number (f64), throw an error or push the token as necessary.
        if let Some((result, span)) = context.try_parse_float() {
            if let Ok(value) = result {
                context.push_token(Token::new(Number(value), span));
            } else {
                context.report(ErrorNotification(String::from("Malformed number."), span));
            }

            continue;
        }

        // If the above didn't occur, skip the next character. Throw an error if it isn't whitespace.
        if let Some((char, span)) = context.next_span() {
            if !char.is_whitespace() {
                context.report(ErrorNotification(
                    format!("Unexpected character `{}` in expression.", char),
                    span,
                ));
            }
        }
    }

    // Print the result— this example should have no error notifications.
    println!("{:#?}", context.result());
}

More examples

Hide additional examples

examples/json.rs (lines 77-80)

pub fn main() {
    use JsonToken::*;

    let code = r##"
        {
            "a": "Assigned to the \"a\" property!",
            "b" : 25.2,
            "c": false,
            "d": true,
            "12": null,
            "list": [
                1,
                -2e+2,
                3.1
            ]
        }
    "##;

    let mut context = TokenizerContext::new(code.chars());

    while context.has_next() {
        // Map single-character tokens to their data.
        let found_token = context.map_single_char_token(|x| match x {
            '{' => Some(OpenBrace),
            '}' => Some(CloseBrace),
            '[' => Some(OpenBracket),
            ']' => Some(CloseBracket),
            ':' => Some(Colon),
            ',' => Some(Comma),
            _ => None,
        });

        // If the above codeblock pushed a new token, move onto the next iteration.
        if found_token {
            continue;
        }

        // Attempt to parse an identifier for certain tokens.
        if let Some((name, span)) = context.try_parse_standard_identifier() {
            let data = match name.as_ref() {
                "null" => Some(Null),
                "true" => Some(Bool(true)),
                "false" => Some(Bool(false)),
                _ => None,
            };

            // If the above table mapped the identifier to valid token data, push a token.
            // If it didn't report an error— either way restart the loop.
            if let Some(data) = data {
                context.push_token(Token::new(data, span));
            } else {
                context.report(ErrorNotification(
                    format!("Unexpected identifier {}", name),
                    span,
                ))
            }

            continue;
        }

        // If the next element in the source code is a string, parse it and report errors as necessary.
        if let Some((result, span)) = context.try_parse_strict_string() {
            match result {
                Ok(string) => {
                    // Push the valid string.
                    context.push_token(Token::new(Str(string), span));
                }
                Err(errors) => {
                    // Create a notification for every error.
                    for error in errors {
                        use ParseCharError::*;
                        use StringTokenError::*;

                        match error {
                            CharError(NoEscape(span)) => ErrorNotification(
                                "Missing escape code after backslash".to_owned(),
                                span,
                            ),
                            CharError(IllegalEscape(char, span)) => {
                                ErrorNotification(format!("Illegal escape code '{}'", char), span)
                            }
                            CharError(NoCharFound) => {
                                unreachable!("Strings will never create this error")
                            }
                            NoClosingDelimiter => ErrorNotification(
                                "Missing closing delimiter on string".to_owned(),
                                span.clone(),
                            ),
                        };
                    }
                }
            }
            continue;
        }

        // If the next character is a minus sign, skip it and set
        // this variable to its span. Otherwise, set it to None.
        //
        // Note: Negatives are parsed during tokenization because no subtraction
        // exists in JSON, every - sign is unary so there's no ambiguity.
        let negative_sign_span = if context.peek_is('-') {
            let span = context.next_span().unwrap().1;
            context.skip_whitespace();
            Some(span)
        } else {
            None
        };

        // Attempt to parse a floating point number. If one was found and it was valid, push
        // a token for it, otherwise report a parsing error.
        if let Some((result, span)) = context.try_parse_float() {
            if let Ok(mut number) = result {
                // If a negative sign was found prior, negate the number.
                // This is completely lossless because floating point numbers use a sign bit.
                negative_sign_span.is_some().then(|| number = -number);

                context.push_token(Token::new(Number(number), span));
            } else {
                context.report(ErrorNotification(
                    "Floating-point number is malformed".to_owned(),
                    span,
                ));
            }

            continue;
        } else if let Some(span) = negative_sign_span {
            // If no number was found, but we DID find a negative sign, then that negative sign is alone and
            // is thus invalid.
            context.report(ErrorNotification(
                "Negative sign should have a number after it".to_owned(),
                span,
            ));
            continue;
        }

        // If whitespace is found, skip it and continue, otherwise throw an error indicating this
        // is an unknown character.
        if context.peek_is_map(|char| char.is_whitespace()) {
            context.skip_whitespace();
        } else {
            let (char, span) = context.next_span().unwrap();

            context.report(ErrorNotification(
                format!("Unexpected character '{char}'"),
                span,
            ))
        }
    }

    println!("{:#?}", context.result());
}

source

pub fn report_direct(&mut self, notification: Box<dyn Notification>)

Logs a boxed notification in this TokenizerContext which can be referenced later.

source

pub fn push_token(&mut self, token: Token<TokenData>)

Push a token to this TokenizerContext’s result.

Examples found in repository ?

examples/iterator.rs (lines 30-33)

fn main() {
    let source = "this is my source code :3";
    let mut ctx = TokenizerContext::new(source.chars());

    // Collect every pair of characters into a Pair token.
    // If there is a lone character at the end, collect it into a Single token.
    let iter = ctx.create_iterator(|context| {
        // If there's another character to look at, consume it and continue
        if let Some((char1, span1)) = context.next_span() {
            // If another character follows this one, consume it and create a Pair token.
            // If there is no character after this one, create a Single token.
            if let Some((char2, span2)) = context.next_span() {
                context.push_token(Token::new(
                    MyToken::Pair(char1, char2),
                    span1.between(&span2),
                ))
            } else {
                context.push_token(Token::new(MyToken::Single(char1), span1))
            }
        }
    });

    // Print each token in the iterator.
    // This has the benefit of not having every single token in-memory at once.
    for token in iter {
        println!("{:?}", token)    
    }
}

More examples

Hide additional examples

examples/file/main.rs (line 64)

pub fn main() {
    // Buffer for the input path
    let mut str_path = String::new();

    println!(
        "Current path: {:?}",
        std::env::current_dir().expect("No current dir exists.")
    );
    println!("Leave input blank to use default path.");
    print!("Input path to file: ");

    stdout().lock().flush().expect("Could not flush stdout.");

    // Read path input
    stdin()
        .lock()
        .read_line(&mut str_path)
        .expect("Could not read stdin.");

    let trimmed = str_path.trim();

    // Get path from string.
    let path = if trimmed.is_empty() {
        Path::new("./examples/file/file.txt")
    } else {
        Path::new(trimmed)
    };

    println!();
    println!("Path: {:?}", path);

    // Load file from path
    let file = File::open(path).expect("Could not open file.");

    // Get a buffered reader of the file.
    let mut reader = BufReader::new(file);

    // Create a context from the BufReader. The closure dictates
    // how we want to handle any read failures— in this case just panic.
    let mut context = TokenizerContext::new_file(&mut reader, |x| match x {
        Ok(char) => char,
        Err(_) => panic!("Unable to read file."),
    });

    // Tokenizer logic
    while context.has_next() {
        if let Some((ident, span)) = context.try_parse_standard_identifier() {
            context.push_token(Token::new(ident, span));
        } else {
            context.skip();
        }
    }

    // Print result
    println!();
    println!("{:#?}", context.result());
}

examples/expression.rs (line 60)

fn main() {
    use MathSymbol::*;

    // The test program we're going to tokenize.
    let program = "23 * (012 - 3) / 1_2_3 + 5e-3";

    // The TokenizerContext for our example program.
    let mut context = TokenizerContext::new(program.chars());

    // While there are more characters in the source code...
    while context.has_next() {
        // If the next character is one of these, push its respective token.
        let single_char_pushed = context.map_single_char_token(|char| match char {
            '+' => Some(Plus),
            '-' => Some(Minus),
            '*' => Some(Times),
            '/' => Some(Divide),
            '%' => Some(Modulo),
            '(' => Some(OpenParen),
            ')' => Some(CloseParen),
            _ => None,
        });

        // If the above properly pushed a token, skip this iteration.
        if single_char_pushed {
            continue;
        }

        // Try to parse a floating point number (f64), throw an error or push the token as necessary.
        if let Some((result, span)) = context.try_parse_float() {
            if let Ok(value) = result {
                context.push_token(Token::new(Number(value), span));
            } else {
                context.report(ErrorNotification(String::from("Malformed number."), span));
            }

            continue;
        }

        // If the above didn't occur, skip the next character. Throw an error if it isn't whitespace.
        if let Some((char, span)) = context.next_span() {
            if !char.is_whitespace() {
                context.report(ErrorNotification(
                    format!("Unexpected character `{}` in expression.", char),
                    span,
                ));
            }
        }
    }

    // Print the result— this example should have no error notifications.
    println!("{:#?}", context.result());
}

examples/foreach.rs (line 53)

fn tokenize(source: &str) -> TokenizerResult<ForeachToken> {
    use ForeachToken::*;

    // Create the reader context
    let mut context = TokenizerContext::new(source.chars());

    // Iterate as long as more characters exist in the tokenizer
    while context.has_next() {
        // Attempt to read an identifier.
        let mut identifier = String::with_capacity(64);
        let identifier_span = context.read_into_while(&mut identifier, is_identifier_char);

        // If span is None, then 0 characters were read; i.e. there is no identifier.
        let Some(span) = identifier_span else {
            // Because there's no identifier here, push a single-character token, if there is one.
            // Consume a single character either way.
            let (char, span) = context.next_span().unwrap();

            let token = match char {
                '[' => OpenBracket,
                ']' => CloseBracket,
                '{' => OpenBrace,
                '}' => CloseBrace,
                ';' => Semicolon,
                _ => continue, // Any other character will just be ignored.
            };

            context.push_token(Token::new(token, span));
            continue;
        };

        // "//" will be matched as an identifier due to language rules.
        // If it's found, then skip until the next newline and continue.
        // Note: Something like "A//" passes this check, this is correct behavior.
        if identifier.starts_with("//") {
            context.skip_until('\n');
            continue;
        }

        // Create a token from the identifier. Some specific identifier are their own tokens.
        let token = match identifier.as_str() {
            "=" => Assign,
            ":=" => ConstAssign,
            "=>" => Foreach,
            "->" => Return,
            _ => Identifier(identifier),
        };

        // Push the token from above along with the identifier's span.
        context.push_token(Token::new(token, span));
    }

    // Return the result
    context.result()
}

examples/json.rs (line 75)

pub fn main() {
    use JsonToken::*;

    let code = r##"
        {
            "a": "Assigned to the \"a\" property!",
            "b" : 25.2,
            "c": false,
            "d": true,
            "12": null,
            "list": [
                1,
                -2e+2,
                3.1
            ]
        }
    "##;

    let mut context = TokenizerContext::new(code.chars());

    while context.has_next() {
        // Map single-character tokens to their data.
        let found_token = context.map_single_char_token(|x| match x {
            '{' => Some(OpenBrace),
            '}' => Some(CloseBrace),
            '[' => Some(OpenBracket),
            ']' => Some(CloseBracket),
            ':' => Some(Colon),
            ',' => Some(Comma),
            _ => None,
        });

        // If the above codeblock pushed a new token, move onto the next iteration.
        if found_token {
            continue;
        }

        // Attempt to parse an identifier for certain tokens.
        if let Some((name, span)) = context.try_parse_standard_identifier() {
            let data = match name.as_ref() {
                "null" => Some(Null),
                "true" => Some(Bool(true)),
                "false" => Some(Bool(false)),
                _ => None,
            };

            // If the above table mapped the identifier to valid token data, push a token.
            // If it didn't report an error— either way restart the loop.
            if let Some(data) = data {
                context.push_token(Token::new(data, span));
            } else {
                context.report(ErrorNotification(
                    format!("Unexpected identifier {}", name),
                    span,
                ))
            }

            continue;
        }

        // If the next element in the source code is a string, parse it and report errors as necessary.
        if let Some((result, span)) = context.try_parse_strict_string() {
            match result {
                Ok(string) => {
                    // Push the valid string.
                    context.push_token(Token::new(Str(string), span));
                }
                Err(errors) => {
                    // Create a notification for every error.
                    for error in errors {
                        use ParseCharError::*;
                        use StringTokenError::*;

                        match error {
                            CharError(NoEscape(span)) => ErrorNotification(
                                "Missing escape code after backslash".to_owned(),
                                span,
                            ),
                            CharError(IllegalEscape(char, span)) => {
                                ErrorNotification(format!("Illegal escape code '{}'", char), span)
                            }
                            CharError(NoCharFound) => {
                                unreachable!("Strings will never create this error")
                            }
                            NoClosingDelimiter => ErrorNotification(
                                "Missing closing delimiter on string".to_owned(),
                                span.clone(),
                            ),
                        };
                    }
                }
            }
            continue;
        }

        // If the next character is a minus sign, skip it and set
        // this variable to its span. Otherwise, set it to None.
        //
        // Note: Negatives are parsed during tokenization because no subtraction
        // exists in JSON, every - sign is unary so there's no ambiguity.
        let negative_sign_span = if context.peek_is('-') {
            let span = context.next_span().unwrap().1;
            context.skip_whitespace();
            Some(span)
        } else {
            None
        };

        // Attempt to parse a floating point number. If one was found and it was valid, push
        // a token for it, otherwise report a parsing error.
        if let Some((result, span)) = context.try_parse_float() {
            if let Ok(mut number) = result {
                // If a negative sign was found prior, negate the number.
                // This is completely lossless because floating point numbers use a sign bit.
                negative_sign_span.is_some().then(|| number = -number);

                context.push_token(Token::new(Number(number), span));
            } else {
                context.report(ErrorNotification(
                    "Floating-point number is malformed".to_owned(),
                    span,
                ));
            }

            continue;
        } else if let Some(span) = negative_sign_span {
            // If no number was found, but we DID find a negative sign, then that negative sign is alone and
            // is thus invalid.
            context.report(ErrorNotification(
                "Negative sign should have a number after it".to_owned(),
                span,
            ));
            continue;
        }

        // If whitespace is found, skip it and continue, otherwise throw an error indicating this
        // is an unknown character.
        if context.peek_is_map(|char| char.is_whitespace()) {
            context.skip_whitespace();
        } else {
            let (char, span) = context.next_span().unwrap();

            context.report(ErrorNotification(
                format!("Unexpected character '{char}'"),
                span,
            ))
        }
    }

    println!("{:#?}", context.result());
}

source

pub fn result(self) -> TokenizerResult<TokenData>

Convert this TokenizerContext into a TokenizerResult, containing the token list as well as any generated notifications.

Examples found in repository ?

examples/brainfuck.rs (line 50)

fn tokenize(source: &str) -> TokenizerResult<BFTokenType> {
    use BFTokenType::*;

    // Create the reader context
    let mut context = TokenizerContext::new(source.chars());

    // Repeat this code while there are more characters in the source code.
    while context.has_next() {
        // Attempt to map these characters to their respective tokens.
        let pushed_token = context.map_single_char_token(|char| match char {
            '+' => Some(Increment),
            '-' => Some(Decrement),
            '>' => Some(MoveRight),
            '<' => Some(MoveLeft),
            '[' => Some(BeginWhile),
            ']' => Some(EndWhile),
            '.' => Some(WriteIO),
            ',' => Some(ReadIO),
            _ => None,
        });

        // If a token was NOT pushed above— i.e. it was a different character, just skip it and move on.
        if !pushed_token {
            context.skip();
        }
    }

    // Return the result
    context.result()
}

More examples

Hide additional examples

examples/file/main.rs (line 72)

pub fn main() {
    // Buffer for the input path
    let mut str_path = String::new();

    println!(
        "Current path: {:?}",
        std::env::current_dir().expect("No current dir exists.")
    );
    println!("Leave input blank to use default path.");
    print!("Input path to file: ");

    stdout().lock().flush().expect("Could not flush stdout.");

    // Read path input
    stdin()
        .lock()
        .read_line(&mut str_path)
        .expect("Could not read stdin.");

    let trimmed = str_path.trim();

    // Get path from string.
    let path = if trimmed.is_empty() {
        Path::new("./examples/file/file.txt")
    } else {
        Path::new(trimmed)
    };

    println!();
    println!("Path: {:?}", path);

    // Load file from path
    let file = File::open(path).expect("Could not open file.");

    // Get a buffered reader of the file.
    let mut reader = BufReader::new(file);

    // Create a context from the BufReader. The closure dictates
    // how we want to handle any read failures— in this case just panic.
    let mut context = TokenizerContext::new_file(&mut reader, |x| match x {
        Ok(char) => char,
        Err(_) => panic!("Unable to read file."),
    });

    // Tokenizer logic
    while context.has_next() {
        if let Some((ident, span)) = context.try_parse_standard_identifier() {
            context.push_token(Token::new(ident, span));
        } else {
            context.skip();
        }
    }

    // Print result
    println!();
    println!("{:#?}", context.result());
}

examples/expression.rs (line 80)

fn main() {
    use MathSymbol::*;

    // The test program we're going to tokenize.
    let program = "23 * (012 - 3) / 1_2_3 + 5e-3";

    // The TokenizerContext for our example program.
    let mut context = TokenizerContext::new(program.chars());

    // While there are more characters in the source code...
    while context.has_next() {
        // If the next character is one of these, push its respective token.
        let single_char_pushed = context.map_single_char_token(|char| match char {
            '+' => Some(Plus),
            '-' => Some(Minus),
            '*' => Some(Times),
            '/' => Some(Divide),
            '%' => Some(Modulo),
            '(' => Some(OpenParen),
            ')' => Some(CloseParen),
            _ => None,
        });

        // If the above properly pushed a token, skip this iteration.
        if single_char_pushed {
            continue;
        }

        // Try to parse a floating point number (f64), throw an error or push the token as necessary.
        if let Some((result, span)) = context.try_parse_float() {
            if let Ok(value) = result {
                context.push_token(Token::new(Number(value), span));
            } else {
                context.report(ErrorNotification(String::from("Malformed number."), span));
            }

            continue;
        }

        // If the above didn't occur, skip the next character. Throw an error if it isn't whitespace.
        if let Some((char, span)) = context.next_span() {
            if !char.is_whitespace() {
                context.report(ErrorNotification(
                    format!("Unexpected character `{}` in expression.", char),
                    span,
                ));
            }
        }
    }

    // Print the result— this example should have no error notifications.
    println!("{:#?}", context.result());
}

examples/foreach.rs (line 79)

fn tokenize(source: &str) -> TokenizerResult<ForeachToken> {
    use ForeachToken::*;

    // Create the reader context
    let mut context = TokenizerContext::new(source.chars());

    // Iterate as long as more characters exist in the tokenizer
    while context.has_next() {
        // Attempt to read an identifier.
        let mut identifier = String::with_capacity(64);
        let identifier_span = context.read_into_while(&mut identifier, is_identifier_char);

        // If span is None, then 0 characters were read; i.e. there is no identifier.
        let Some(span) = identifier_span else {
            // Because there's no identifier here, push a single-character token, if there is one.
            // Consume a single character either way.
            let (char, span) = context.next_span().unwrap();

            let token = match char {
                '[' => OpenBracket,
                ']' => CloseBracket,
                '{' => OpenBrace,
                '}' => CloseBrace,
                ';' => Semicolon,
                _ => continue, // Any other character will just be ignored.
            };

            context.push_token(Token::new(token, span));
            continue;
        };

        // "//" will be matched as an identifier due to language rules.
        // If it's found, then skip until the next newline and continue.
        // Note: Something like "A//" passes this check, this is correct behavior.
        if identifier.starts_with("//") {
            context.skip_until('\n');
            continue;
        }

        // Create a token from the identifier. Some specific identifier are their own tokens.
        let token = match identifier.as_str() {
            "=" => Assign,
            ":=" => ConstAssign,
            "=>" => Foreach,
            "->" => Return,
            _ => Identifier(identifier),
        };

        // Push the token from above along with the identifier's span.
        context.push_token(Token::new(token, span));
    }

    // Return the result
    context.result()
}

examples/json.rs (line 175)

pub fn main() {
    use JsonToken::*;

    let code = r##"
        {
            "a": "Assigned to the \"a\" property!",
            "b" : 25.2,
            "c": false,
            "d": true,
            "12": null,
            "list": [
                1,
                -2e+2,
                3.1
            ]
        }
    "##;

    let mut context = TokenizerContext::new(code.chars());

    while context.has_next() {
        // Map single-character tokens to their data.
        let found_token = context.map_single_char_token(|x| match x {
            '{' => Some(OpenBrace),
            '}' => Some(CloseBrace),
            '[' => Some(OpenBracket),
            ']' => Some(CloseBracket),
            ':' => Some(Colon),
            ',' => Some(Comma),
            _ => None,
        });

        // If the above codeblock pushed a new token, move onto the next iteration.
        if found_token {
            continue;
        }

        // Attempt to parse an identifier for certain tokens.
        if let Some((name, span)) = context.try_parse_standard_identifier() {
            let data = match name.as_ref() {
                "null" => Some(Null),
                "true" => Some(Bool(true)),
                "false" => Some(Bool(false)),
                _ => None,
            };

            // If the above table mapped the identifier to valid token data, push a token.
            // If it didn't report an error— either way restart the loop.
            if let Some(data) = data {
                context.push_token(Token::new(data, span));
            } else {
                context.report(ErrorNotification(
                    format!("Unexpected identifier {}", name),
                    span,
                ))
            }

            continue;
        }

        // If the next element in the source code is a string, parse it and report errors as necessary.
        if let Some((result, span)) = context.try_parse_strict_string() {
            match result {
                Ok(string) => {
                    // Push the valid string.
                    context.push_token(Token::new(Str(string), span));
                }
                Err(errors) => {
                    // Create a notification for every error.
                    for error in errors {
                        use ParseCharError::*;
                        use StringTokenError::*;

                        match error {
                            CharError(NoEscape(span)) => ErrorNotification(
                                "Missing escape code after backslash".to_owned(),
                                span,
                            ),
                            CharError(IllegalEscape(char, span)) => {
                                ErrorNotification(format!("Illegal escape code '{}'", char), span)
                            }
                            CharError(NoCharFound) => {
                                unreachable!("Strings will never create this error")
                            }
                            NoClosingDelimiter => ErrorNotification(
                                "Missing closing delimiter on string".to_owned(),
                                span.clone(),
                            ),
                        };
                    }
                }
            }
            continue;
        }

        // If the next character is a minus sign, skip it and set
        // this variable to its span. Otherwise, set it to None.
        //
        // Note: Negatives are parsed during tokenization because no subtraction
        // exists in JSON, every - sign is unary so there's no ambiguity.
        let negative_sign_span = if context.peek_is('-') {
            let span = context.next_span().unwrap().1;
            context.skip_whitespace();
            Some(span)
        } else {
            None
        };

        // Attempt to parse a floating point number. If one was found and it was valid, push
        // a token for it, otherwise report a parsing error.
        if let Some((result, span)) = context.try_parse_float() {
            if let Ok(mut number) = result {
                // If a negative sign was found prior, negate the number.
                // This is completely lossless because floating point numbers use a sign bit.
                negative_sign_span.is_some().then(|| number = -number);

                context.push_token(Token::new(Number(number), span));
            } else {
                context.report(ErrorNotification(
                    "Floating-point number is malformed".to_owned(),
                    span,
                ));
            }

            continue;
        } else if let Some(span) = negative_sign_span {
            // If no number was found, but we DID find a negative sign, then that negative sign is alone and
            // is thus invalid.
            context.report(ErrorNotification(
                "Negative sign should have a number after it".to_owned(),
                span,
            ));
            continue;
        }

        // If whitespace is found, skip it and continue, otherwise throw an error indicating this
        // is an unknown character.
        if context.peek_is_map(|char| char.is_whitespace()) {
            context.skip_whitespace();
        } else {
            let (char, span) = context.next_span().unwrap();

            context.report(ErrorNotification(
                format!("Unexpected character '{char}'"),
                span,
            ))
        }
    }

    println!("{:#?}", context.result());
}

source

pub fn create_iterator<'ctx, F: FnMut(&mut Self)>( &'ctx mut self, function: F, ) -> TokenizerIterator<'ctx, Source, TokenData, F> ⓘ

Creates a token iterator over this context. The supplied function will be called with a mutable reference to this TokenizerContext as its argument whenever new tokens are needed.

This tokenizer ignores notifications, and those should be polled after this tokenizer is done being used.

Using an iterator tokenizer is preferred as it will minimize the number of tokens in memory at a given time, minimizing potential issues for large files.

Examples found in repository ?

examples/iterator.rs (lines 24-38)

fn main() {
    let source = "this is my source code :3";
    let mut ctx = TokenizerContext::new(source.chars());

    // Collect every pair of characters into a Pair token.
    // If there is a lone character at the end, collect it into a Single token.
    let iter = ctx.create_iterator(|context| {
        // If there's another character to look at, consume it and continue
        if let Some((char1, span1)) = context.next_span() {
            // If another character follows this one, consume it and create a Pair token.
            // If there is no character after this one, create a Single token.
            if let Some((char2, span2)) = context.next_span() {
                context.push_token(Token::new(
                    MyToken::Pair(char1, char2),
                    span1.between(&span2),
                ))
            } else {
                context.push_token(Token::new(MyToken::Single(char1), span1))
            }
        }
    });

    // Print each token in the iterator.
    // This has the benefit of not having every single token in-memory at once.
    for token in iter {
        println!("{:?}", token)    
    }
}

Auto Trait Implementations§

§

impl<Source, TokenData> Freeze for TokenizerContext<Source, TokenData>
where Source: Freeze,

§

impl<Source, TokenData> Unpin for TokenizerContext<Source, TokenData>
where Source: Unpin, TokenData: Unpin,

§

impl<Source, TokenData> !UnwindSafe for TokenizerContext<Source, TokenData>

Blanket Implementations§

source §

impl<T> Any for T
where T: 'static + ?Sized,

source §

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more

source §

impl<T> Borrow<T> for T
where T: ?Sized,

source §

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more

source §

impl<T> BorrowMut<T> for T
where T: ?Sized,

source §

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more

source §

impl<T> From<T> for T

source §

fn from(t: T) -> T

Returns the argument unchanged.

source §

impl<T, U> Into for T
where U: From<T>,

source §

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

source §

impl<T, U> TryFrom for T
where U: Into<T>,

source §

type Error = Infallible

The type returned in the event of a conversion error.

source §

fn try_from(value: U) -> Result<T, <T as TryFrom>::Error>

Performs the conversion.

source §

impl<T, U> TryInto for T
where U: TryFrom<T>,

source §

type Error = >::Error

The type returned in the event of a conversion error.

source §

fn try_into(self) -> Result<U, >::Error>

Performs the conversion.

Struct alkale::TokenizerContextCopy item path

Implementations§

impl<S: Iterator<Item = char>, T> TokenizerContext<S, T>

pub fn try_parse_identifier( &mut self, first_predicate: impl Fn(&char) -> bool, rest_predicate: impl Fn(&char) -> bool, ) -> Option<(String, Span)>

pub fn try_parse_standard_identifier(&mut self) -> Option<(String, Span)>

impl<S: Iterator<Item = char>, T> TokenizerContext<S, T>

pub fn consume_standard_number(&mut self) -> Option<(String, Span)>

pub fn parse_standard_base(&mut self) -> Result<StandardBase, InvalidBaseError>

pub fn parse_standard_base_strict( &mut self, ) -> Option<Result<StandardBase, InvalidBaseError>>

pub fn try_parse_integer_from_base<R: Copy + Zero + CheckedAdd<Output = R> + CheckedMul<Output = R> + From<u8> + Unsigned>( &mut self, base: &impl NumericalBase, ) -> Option<(Span, Result<R, IntegerOutOfRangeError>)>

pub fn try_parse_integer<R: Copy + Zero + CheckedAdd<Output = R> + CheckedMul<Output = R> + From<u8> + Unsigned>( &mut self, ) -> Option<(Span, Result<R, IntegerOutOfRangeError>)>

pub fn try_parse_float( &mut self, ) -> Option<(Result<f64, ParseFloatError>, Span)>

pub fn try_parse_number<R: Copy + Zero + CheckedAdd<Output = R> + CheckedMul<Output = R> + From<u8> + Unsigned>( &mut self, ) -> Option<(Result<ParsedNumber<R>, ParseNumberError>, Span)>

impl<S: Iterator<Item = char>, T> TokenizerContext<S, T>

pub fn parse_string<E>( &mut self, char_consumer: impl Fn(&mut Self, &mut String, char) -> Result<(), E>, ) -> (Result<String, Vec<StringTokenError<E>>>, Span)

pub fn try_parse_simple_string( &mut self, ) -> Option<(Result<String, Vec<StringTokenError<ParseCharError>>>, Span)>

pub fn try_parse_strict_string( &mut self, ) -> Option<(Result<String, Vec<StringTokenError<ParseCharError>>>, Span)>

pub fn try_parse_character_token( &mut self, ) -> Option<(Result<char, CharTokenError>, Span)>

impl<S: Iterator<Item = char>, T> TokenizerContext<S, T>

pub fn get_indent_level( &mut self, indent_char: char, chars_per_level: usize, ) -> usize

pub fn skip_whitespace(&mut self)

impl<S: Iterator<Item = char>, T> TokenizerContext<S, T>

pub fn recover_with(&mut self, predicate: impl Fn(char) -> bool)

pub fn skip_until(&mut self, match_character: char)

pub fn capture_span<R>( &mut self, predicate: impl FnOnce(&mut TokenizerContext<S, T>) -> R, ) -> (R, Span)

pub fn read_into_while( &mut self, string: &mut String, predicate: impl Fn(&char) -> bool, ) -> Option<Span>

pub fn fold<A>( &mut self, accumulator: A, predicate: impl Fn(char, &mut A) -> ControlFlow<(), ()>, ) -> A

pub fn map_single_char_token( &mut self, predicate: impl FnOnce(char) -> Option<T>, ) -> bool

impl<'src, TokenData> TokenizerContext<Peekable<Chars<'src>>, TokenData>

pub fn new_str(string: &'src str) -> Self

impl<'src, TokenData, F: FnMut(Result<char, ReadCharError>) -> char> TokenizerContext<Map<CharsRaw<'src, BufReader<File>>, F>, TokenData>

pub fn new_file(reader: &'src mut BufReader<File>, handler: F) -> Self

impl<Source: Iterator<Item = char>, TokenData> TokenizerContext<Source, TokenData>

pub fn new(source: impl IntoIterator<IntoIter = Source, Item = char>) -> Self

pub fn next(&mut self) -> Option<char>

pub fn next_span(&mut self) -> Option<(char, Span)>

pub fn peek(&mut self) -> Option<char>

pub fn has_next(&mut self) -> bool

pub fn skip(&mut self)

pub fn peek_is(&mut self, char: char) -> bool

pub fn peek_is_map(&mut self, predicate: impl FnOnce(&char) -> bool) -> bool

pub fn peek_is_not(&mut self, char: char) -> bool

pub fn span(&self) -> &Span

pub fn report(&mut self, notification: impl Notification + 'static)

pub fn report_direct(&mut self, notification: Box<dyn Notification>)

pub fn push_token(&mut self, token: Token<TokenData>)

pub fn result(self) -> TokenizerResult<TokenData>

pub fn create_iterator<'ctx, F: FnMut(&mut Self)>( &'ctx mut self, function: F, ) -> TokenizerIterator<'ctx, Source, TokenData, F> ⓘ

Auto Trait Implementations§

impl<Source, TokenData> Freeze for TokenizerContext<Source, TokenData>where Source: Freeze,

impl<Source, TokenData> !RefUnwindSafe for TokenizerContext<Source, TokenData>

impl<Source, TokenData> !Send for TokenizerContext<Source, TokenData>

impl<Source, TokenData> !Sync for TokenizerContext<Source, TokenData>

impl<Source, TokenData> Unpin for TokenizerContext<Source, TokenData>where Source: Unpin, TokenData: Unpin,

impl<Source, TokenData> !UnwindSafe for TokenizerContext<Source, TokenData>

Blanket Implementations§

impl<T> Any for Twhere T: 'static + ?Sized,

fn type_id(&self) -> TypeId

impl<T> Borrow<T> for Twhere T: ?Sized,

fn borrow(&self) -> &T

impl<T> BorrowMut<T> for Twhere T: ?Sized,

fn borrow_mut(&mut self) -> &mut T

impl<T> From<T> for T

fn from(t: T) -> T

impl<T, U> Into<U> for Twhere U: From<T>,

fn into(self) -> U

impl<T, U> TryFrom<U> for Twhere U: Into<T>,

type Error = Infallible

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

impl<T, U> TryInto<U> for Twhere U: TryFrom<T>,

type Error = <U as TryFrom<T>>::Error

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Struct alkale::TokenizerContext

impl<Source, TokenData> Freeze for TokenizerContext<Source, TokenData>
where Source: Freeze,

impl<Source, TokenData> Unpin for TokenizerContext<Source, TokenData>
where Source: Unpin, TokenData: Unpin,

impl<T> Any for T
where T: 'static + ?Sized,

impl<T> Borrow<T> for T
where T: ?Sized,

impl<T> BorrowMut<T> for T
where T: ?Sized,

impl<T, U> Into<U> for T
where U: From<T>,

impl<T, U> TryFrom<U> for T
where U: Into<T>,

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,