tokenizer_py 0.1.2

crate with a tokenizer that works like a Python tokenizer

Coverage
84.62%
22 out of 26 items documented5 out of 6 items with examples
Ø build duration
all releases: 9s Average build duration of successful builds in releases after 2024-10-23.
Links
Repository
crates.io
Dependencies
Versions
Owners

tokenizer_py-0.1.2 has been yanked.

Python-like Tokenizer in Rust

Crates.io License

This project implements a Python-like tokenizer in Rust. It can tokenize a string into a sequence of tokens, which are represented by the Token enum. The supported tokens are:

Token::Name: a name token, such as a function or variable name
Token::Number: a number token, such as a literal integer or floating-point number
Token::String: a string token, such as a single or double-quoted string
Token::OP: an operator token, such as an arithmetic or comparison operator
Token::Indent: an indent token, indicating that a block of code is being indented
Token::Dedent: a dedent token, indicating that a block of code is being dedented
Token::Comment: a comment token, such as a single-line or multi-line comment
Token::NewLine: a newline token, indicating a new line in the source code
Token::NL: a token indicating a new line, for compatibility with the original tokenizer
Token::EndMarker: an end-of-file marker

The tokenizer uses a simple state machine to tokenize the input text. It recognizes the following tokens:

Whitespace: spaces, tabs, and newlines
Numbers: integers and floating-point numbers
- float: floats numbers
- int: integer numbers
Names: identifiers and keywords
Strings: single- and double-quoted strings
- basic-String: single- and double-quoted strings
- format-String: format string from python
- byte-String: byte string from python
- raw-String: raw string
- multy-line-String: single- and double-quoted multy-line-string
Operators: arithmetic, comparison, and other operators
Comments: single-line comments

The tokenizer also provides a tokenize method that takes a string as input and returns a Result containing a vector of tokens.

Here is an example of how to use the tokenizer:

use tokenizer_py::{Tokenizer, Token};

let tokenizer = Tokenizer::new("hello world".to_string());
let tokens = tokenizer.tokenize().unwrap();
assert_eq!(tokens, vec![
    Token::Name("hello".to_string()),
    Token::Name("world".to_string()),
    Token::EndMarker,
]);

Usage

Add this to your Cargo.toml:

[dependencies]

tokenizer_py = "0.1.1"

Error Handling

The tokenizer uses the Result type to indicate possible errors during tokenization. The possible errors are:

TokenizerError::Operator: an invalid operator was encountered
TokenizerError::Number: an invalid number was encountered
TokenizerError::Indent: an invalid indent was encountered
TokenizerError::String: an invalid string was encountered

Here is an example of how to handle these errors:

match tokenizer.tokenize() {
Ok(tokens) => {
// process tokens
}
Err(TokenizerError::Operator(op)) => {
// handle invalid operator
}
Err(TokenizerError::Number(num)) => {
// handle invalid number
}
Err(TokenizerError::Indent(indent)) => {
// handle invalid indent
}
Err(TokenizerError::String(string)) => {
// handle invalid string
}
}