Lexariel
Lexical Analyzer for Asmodeus Language
┌───────────────────────────────────────────────────────────────┐
│ │
│ ██╗ ███████╗██╗ ██╗ █████╗ ██████╗ ██╗███████╗██╗ │
│ ██║ ██╔════╝╚██╗██╔╝██╔══██╗██╔══██╗██║██╔════╝██║ │
│ ██║ █████╗ ╚███╔╝ ███████║██████╔╝██║█████╗ ██║ │
│ ██║ ██╔══╝ ██╔██╗ ██╔══██║██╔══██╗██║██╔══╝ ██║ │
│ ███████╗███████╗██╔╝ ██╗██║ ██║██║ ██║██║███████╗███████╗ │
│ ╚══════╝╚══════╝╚═╝ ╚═╝╚═╝ ╚═╝╚═╝ ╚═╝╚═╝╚══════╝╚══════╝ │
│ │
│ Asmodeus Language Tokenizer │
└───────────────────────────────────────────────────────────────┘
Lexariel is the lexical analyzer (tokenizer) for the Asmodeus language. It converts raw source code into a stream of structured tokens that can be consumed by the parser. Built with performance and error recovery in mind.
🎯 Features
Core Tokenization
- Complete Token Recognition: All Asmodeus language constructs
- Multiple Comment Styles: Both
;and//comment syntax - Number Format Support: Decimal, hexadecimal (0x), and binary (0b)
- Identifier Recognition: Labels, instruction mnemonics, and symbols
- Addressing Mode Tokens:
#,[,], register names - Error Recovery: Continues parsing after lexical errors
Advanced Features
- Position Tracking: Line and column information for all tokens
- Whitespace Handling: Intelligent whitespace skipping
- String Literals: Support for quoted strings and character literals
- Macro Keywords: Recognition of
MAKRO,KONMmacro delimiters - Directive Support:
RST,RPAand other assembler directives
🚀 Quick Start
Basic Usage
use tokenize;
let source = r#"
start:
POB #42 ; Load immediate value
WYJSCIE // Output the value
STP ; Stop program
"#;
let tokens = tokenize?;
for token in tokens
Output Example
Token { kind: Identifier, value: "start", line: 2, column: 5 }
Token { kind: Colon, value: ":", line: 2, column: 10 }
Token { kind: Keyword, value: "POB", line: 3, column: 9 }
Token { kind: Hash, value: "#", line: 3, column: 13 }
Token { kind: Number, value: "42", line: 3, column: 14 }
Token { kind: Keyword, value: "WYJSCIE", line: 4, column: 9 }
Token { kind: Keyword, value: "STP", line: 5, column: 9 }
Advanced Tokenization
use ;
let source = r#"
MAKRO add_numbers num1 num2
POB num1
DOD num2
WYJSCIE
KONM
data_section:
value1: RST 0x2A ; Hex number
value2: RST 0b101010 ; Binary number
buffer: RPA
"#;
let tokens = tokenize?;
// Filter only keywords
let keywords: = tokens.iter
.filter
.collect;
// Count different token types
let mut counts = new;
for token in &tokens
📚 Token Types
Core Token Types
| Token Kind | Description | Examples |
|---|---|---|
Keyword |
Assembly instructions and directives | POB, DOD, STP, RST |
Identifier |
User-defined names | start, loop, data_value |
Number |
Numeric literals | 42, 0x2A, 0b101010 |
Hash |
Immediate value prefix | # |
LeftBracket |
Indirect addressing start | [ |
RightBracket |
Indirect addressing end | ] |
Colon |
Label definition | : |
Comma |
Parameter separator | , |
Directive |
Assembler directives | MAKRO, KONM |
Number Format Support
// Decimal numbers
let tokens = tokenize?;
// Hexadecimal numbers
let tokens = tokenize?;
// Binary numbers
let tokens = tokenize?;
// Negative numbers
let tokens = tokenize?;
Comment Styles
// Semicolon comments (traditional assembly)
let source = r#"
POB value ; This is a comment
STP ; Stop the program
"#;
// C-style comments
let source = r#"
POB value // This is also a comment
STP // Stop the program
"#;
// Both styles can be mixed
let source = r#"
; Program header comment
start: // Entry point
POB #42 ; Load value
STP // End program
"#;
🔧 API Reference
Main Functions
;
The primary entry point for tokenization. Takes source code and returns a vector of tokens or a lexer error.
Core Types
Lexer Class
For more control over tokenization:
use Lexer;
let mut lexer = new;
let tokens = lexer.tokenize?;
// Access lexer state if needed
println!;
📖 Examples
Basic Program Tokenization
use tokenize;
let program = r#"
; Simple addition program
start:
POB first_num ; Load first number
DOD second_num ; Add second number
WYJSCIE ; Output result
STP ; Stop
first_num: RST 25
second_num: RST 17
"#;
let tokens = tokenize?;
// Print all tokens with position info
for token in tokens
Macro Definition Tokenization
let macro_source = r#"
MAKRO multiply_by_two value
POB value
DOD value
WYJSCIE
KONM
start:
multiply_by_two data_value
STP
data_value: RST 21
"#;
let tokens = tokenize?;
// Find macro boundaries
let macro_start = tokens.iter.position;
let macro_end = tokens.iter.position;
println!;
Number Format Recognition
let numbers_program = r#"
decimal_val: RST 42 ; Decimal
hex_val: RST 0x2A ; Hexadecimal
binary_val: RST 0b101010 ; Binary
negative_val: RST -10 ; Negative
"#;
let tokens = tokenize?;
// Extract all numbers with their formats
for token in tokens
Error Handling
use ;
// Source with lexical error
let bad_source = r#"
start:
POB @invalid_char ; @ is not valid
STP
"#;
match tokenize
Addressing Mode Recognition
let addressing_examples = r#"
; Direct addressing
POB value
; Immediate addressing
POB #42
; Indirect addressing
POB [address]
; Register addressing
POB R1
"#;
let tokens = tokenize?;
// Find addressing mode indicators
for in tokens.iter.enumerate
🧪 Testing
Unit Tests
Specific Test Categories
# Test basic tokenization
# Test number format recognition
# Test comment handling
# Test error recovery
Integration Tests
🔍 Performance Characteristics
- Speed: ~1M lines per second tokenization
- Memory: O(n) where n is source length
- Error Recovery: Continues after lexical errors
- Position Tracking: Minimal overhead for line/column info
Benchmarking
use tokenize;
use Instant;
let large_source = include_str!;
let start = now;
let tokens = tokenize?;
let duration = start.elapsed;
println!;
🚫 Error Recovery
Lexariel is designed to continue tokenization even after encountering errors:
let source_with_errors = r#"
start:
POB #42 ; Valid
@@@ ; Invalid characters
STP ; Valid again
"#;
// Lexer will report error but continue tokenizing
match tokenize
🔗 Integration with Asmodeus Pipeline
Lexariel is the first stage in the Asmodeus compilation pipeline:
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Source │───▶│ Lexariel │───▶│ Parseid │
│ Code │ │ (Lexer) │ │ (Parser) │
│ (.asmod) │ │ │ │ │
└─────────────┘ └─────────────┘ └─────────────┘
│
▼
┌─────────────┐
│ Tokens │
│ │
└─────────────┘
Usage in Pipeline
use tokenize;
use parse;
// Complete pipeline from source to AST
let source = "POB #42\nWYJSCIE\nSTP";
let tokens = tokenize?; // Lexariel
let ast = parse?; // Parseid
🎨 Token Visualization
For debugging and development:
use ;
🤝 Contributing
Adding New Token Types
- Add new variant to
TokenKindenum - Update the lexer logic in
lexer.rs - Add tests for the new token type
- Update documentation
Parser Integration
When adding new syntax to Asmodeus:
- Define tokens in Lexariel
- Update parser in Parseid to handle new tokens
- Add assembler support in Hephasm if needed
📜 License
This crate is part of the Asmodeus project and is licensed under the MIT License.
🔗 Related Components
- Parseid - Parser that consumes Lexariel tokens
- Shared - Common types and utilities
- Main Asmodeus - Complete language toolchain