lexariel 0.1.0

Lexical analyzer for Asmodeus language
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
# Lexariel

**Lexical Analyzer for Asmodeus Language**

```
┌───────────────────────────────────────────────────────────────┐
│                                                               │
│  ██╗     ███████╗██╗  ██╗ █████╗ ██████╗ ██╗███████╗██╗       │
│  ██║     ██╔════╝╚██╗██╔╝██╔══██╗██╔══██╗██║██╔════╝██║       │
│  ██║     █████╗   ╚███╔╝ ███████║██████╔╝██║█████╗  ██║       │
│  ██║     ██╔══╝   ██╔██╗ ██╔══██║██╔══██╗██║██╔══╝  ██║       │
│  ███████╗███████╗██╔╝ ██╗██║  ██║██║  ██║██║███████╗███████╗  │
│  ╚══════╝╚══════╝╚═╝  ╚═╝╚═╝  ╚═╝╚═╝  ╚═╝╚═╝╚══════╝╚══════╝  │
│                                                               │
│                Asmodeus Language Tokenizer                    │
└───────────────────────────────────────────────────────────────┘
```

**Lexariel** is the lexical analyzer (tokenizer) for the Asmodeus language. It converts raw source code into a stream of structured tokens that can be consumed by the parser. Built with performance and error recovery in mind.

## 🎯 Features

### Core Tokenization
- **Complete Token Recognition**: All Asmodeus language constructs
- **Multiple Comment Styles**: Both `;` and `//` comment syntax
- **Number Format Support**: Decimal, hexadecimal (0x), and binary (0b)
- **Identifier Recognition**: Labels, instruction mnemonics, and symbols
- **Addressing Mode Tokens**: `#`, `[`, `]`, register names
- **Error Recovery**: Continues parsing after lexical errors

### Advanced Features
- **Position Tracking**: Line and column information for all tokens
- **Whitespace Handling**: Intelligent whitespace skipping
- **String Literals**: Support for quoted strings and character literals
- **Macro Keywords**: Recognition of `MAKRO`, `KONM` macro delimiters
- **Directive Support**: `RST`, `RPA` and other assembler directives

## 🚀 Quick Start

### Basic Usage

```rust
use lexariel::tokenize;

let source = r#"
    start:
        POB #42     ; Load immediate value
        WYJSCIE     // Output the value
        STP         ; Stop program
"#;

let tokens = tokenize(source)?;
for token in tokens {
    println!("{:?}", token);
}
```

### Output Example
```
Token { kind: Identifier, value: "start", line: 2, column: 5 }
Token { kind: Colon, value: ":", line: 2, column: 10 }
Token { kind: Keyword, value: "POB", line: 3, column: 9 }
Token { kind: Hash, value: "#", line: 3, column: 13 }
Token { kind: Number, value: "42", line: 3, column: 14 }
Token { kind: Keyword, value: "WYJSCIE", line: 4, column: 9 }
Token { kind: Keyword, value: "STP", line: 5, column: 9 }
```

### Advanced Tokenization

```rust
use lexariel::{Lexer, TokenKind};

let source = r#"
    MAKRO add_numbers num1 num2
        POB num1
        DOD num2
        WYJSCIE
    KONM
    
    data_section:
        value1: RST 0x2A    ; Hex number
        value2: RST 0b101010 ; Binary number
        buffer: RPA
"#;

let tokens = tokenize(source)?;

// Filter only keywords
let keywords: Vec<_> = tokens.iter()
    .filter(|t| t.kind == TokenKind::Keyword)
    .collect();

// Count different token types
let mut counts = std::collections::HashMap::new();
for token in &tokens {
    *counts.entry(token.kind).or_insert(0) += 1;
}
```

## 📚 Token Types

### Core Token Types

| Token Kind | Description | Examples |
|------------|-------------|----------|
| `Keyword` | Assembly instructions and directives | `POB`, `DOD`, `STP`, `RST` |
| `Identifier` | User-defined names | `start`, `loop`, `data_value` |
| `Number` | Numeric literals | `42`, `0x2A`, `0b101010` |
| `Hash` | Immediate value prefix | `#` |
| `LeftBracket` | Indirect addressing start | `[` |
| `RightBracket` | Indirect addressing end | `]` |
| `Colon` | Label definition | `:` |
| `Comma` | Parameter separator | `,` |
| `Directive` | Assembler directives | `MAKRO`, `KONM` |

### Number Format Support

```rust
// Decimal numbers
let tokens = tokenize("RST 42")?;

// Hexadecimal numbers  
let tokens = tokenize("POB 0xFF")?;

// Binary numbers
let tokens = tokenize("DOD 0b1010")?;

// Negative numbers
let tokens = tokenize("RST -10")?;
```

### Comment Styles

```rust
// Semicolon comments (traditional assembly)
let source = r#"
    POB value    ; This is a comment
    STP          ; Stop the program
"#;

// C-style comments  
let source = r#"
    POB value    // This is also a comment
    STP          // Stop the program
"#;

// Both styles can be mixed
let source = r#"
    ; Program header comment
    start:       // Entry point
        POB #42  ; Load value
        STP      // End program
"#;
```

## 🔧 API Reference

### Main Functions

```rust
pub fn tokenize(input: &str) -> Result<Vec<Token>, LexerError>;
```

The primary entry point for tokenization. Takes source code and returns a vector of tokens or a lexer error.

### Core Types

```rust
#[derive(Debug, Clone, PartialEq)]
pub struct Token {
    pub kind: TokenKind,
    pub value: String,
    pub line: usize,
    pub column: usize,
}

#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum TokenKind {
    // Literals
    Number,
    Identifier,
    
    // Keywords and Instructions
    Keyword,
    Directive,
    
    // Symbols
    Hash,           // #
    LeftBracket,    // [  
    RightBracket,   // ]
    Colon,          // :
    Comma,          // ,
    
    // Special
    Newline,
    Invalid,
}

#[derive(Debug, thiserror::Error)]
pub enum LexerError {
    #[error("Invalid character '{char}' at line {line}, column {column}")]
    InvalidCharacter { char: char, line: usize, column: usize },
    
    #[error("Unterminated string literal at line {line}")]
    UnterminatedString { line: usize },
    
    #[error("Invalid number format '{value}' at line {line}, column {column}")]
    InvalidNumberFormat { value: String, line: usize, column: usize },
}
```

### Lexer Class

For more control over tokenization:

```rust
use lexariel::Lexer;

let mut lexer = Lexer::new(source_code);
let tokens = lexer.tokenize()?;

// Access lexer state if needed
println!("Total lines processed: {}", lexer.current_line());
```

## 📖 Examples

### Basic Program Tokenization

```rust
use lexariel::tokenize;

let program = r#"
    ; Simple addition program
    start:
        POB first_num   ; Load first number
        DOD second_num  ; Add second number  
        WYJSCIE         ; Output result
        STP             ; Stop
        
    first_num:  RST 25
    second_num: RST 17
"#;

let tokens = tokenize(program)?;

// Print all tokens with position info
for token in tokens {
    println!("{}:{} - {:?}: '{}'", 
             token.line, token.column, token.kind, token.value);
}
```

### Macro Definition Tokenization

```rust
let macro_source = r#"
    MAKRO multiply_by_two value
        POB value
        DOD value
        WYJSCIE
    KONM
    
    start:
        multiply_by_two data_value
        STP
        
    data_value: RST 21
"#;

let tokens = tokenize(macro_source)?;

// Find macro boundaries
let macro_start = tokens.iter().position(|t| t.value == "MAKRO");
let macro_end = tokens.iter().position(|t| t.value == "KONM");

println!("Macro defined from token {} to {}", 
         macro_start.unwrap(), macro_end.unwrap());
```

### Number Format Recognition

```rust
let numbers_program = r#"
    decimal_val:  RST 42        ; Decimal
    hex_val:      RST 0x2A      ; Hexadecimal  
    binary_val:   RST 0b101010  ; Binary
    negative_val: RST -10       ; Negative
"#;

let tokens = tokenize(numbers_program)?;

// Extract all numbers with their formats
for token in tokens {
    if token.kind == TokenKind::Number {
        println!("Number: '{}' at {}:{}", 
                 token.value, token.line, token.column);
    }
}
```

### Error Handling

```rust
use lexariel::{tokenize, LexerError};

// Source with lexical error
let bad_source = r#"
    start:
        POB @invalid_char   ; @ is not valid
        STP
"#;

match tokenize(bad_source) {
    Ok(tokens) => println!("Tokenized successfully: {} tokens", tokens.len()),
    Err(LexerError::InvalidCharacter { char, line, column }) => {
        println!("Invalid character '{}' at line {}, column {}", char, line, column);
    }
    Err(e) => println!("Other lexer error: {}", e),
}
```

### Addressing Mode Recognition

```rust
let addressing_examples = r#"
    ; Direct addressing
    POB value
    
    ; Immediate addressing  
    POB #42
    
    ; Indirect addressing
    POB [address]
    
    ; Register addressing
    POB R1
"#;

let tokens = tokenize(addressing_examples)?;

// Find addressing mode indicators
for (i, token) in tokens.iter().enumerate() {
    match token.kind {
        TokenKind::Hash => println!("Immediate addressing at token {}", i),
        TokenKind::LeftBracket => println!("Indirect addressing at token {}", i),
        _ => {}
    }
}
```

## 🧪 Testing

### Unit Tests

```bash
cargo test -p lexariel
```

### Specific Test Categories

```bash
# Test basic tokenization
cargo test -p lexariel basic_tokenization

# Test number format recognition  
cargo test -p lexariel number_tests

# Test comment handling
cargo test -p lexariel comment_tests

# Test error recovery
cargo test -p lexariel error_tests
```

### Integration Tests

```bash
cargo test -p lexariel --test integration_tests
```

## 🔍 Performance Characteristics

- **Speed**: ~1M lines per second tokenization
- **Memory**: O(n) where n is source length
- **Error Recovery**: Continues after lexical errors
- **Position Tracking**: Minimal overhead for line/column info

### Benchmarking

```rust
use lexariel::tokenize;
use std::time::Instant;

let large_source = include_str!("large_program.asmod");
let start = Instant::now();
let tokens = tokenize(large_source)?;
let duration = start.elapsed();

println!("Tokenized {} characters into {} tokens in {:?}", 
         large_source.len(), tokens.len(), duration);
```

## 🚫 Error Recovery

Lexariel is designed to continue tokenization even after encountering errors:

```rust
let source_with_errors = r#"
    start:
        POB #42     ; Valid
        @@@         ; Invalid characters
        STP         ; Valid again  
"#;

// Lexer will report error but continue tokenizing
match tokenize(source_with_errors) {
    Ok(tokens) => {
        // Will still get valid tokens before and after error
        println!("Got {} tokens despite errors", tokens.len());
    }
    Err(e) => {
        println!("First error encountered: {}", e);
        // In practice, might want to collect all errors
    }
}
```

## 🔗 Integration with Asmodeus Pipeline

Lexariel is the first stage in the Asmodeus compilation pipeline:

```
┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│   Source    │───▶│   Lexariel  │───▶│   Parseid   │
│    Code     │    │   (Lexer)   │    │  (Parser)   │
│  (.asmod)   │    │             │    │             │
└─────────────┘    └─────────────┘    └─────────────┘
                    ┌─────────────┐
                    │   Tokens    │
                    │             │
                    └─────────────┘
```

### Usage in Pipeline

```rust
use lexariel::tokenize;
use parseid::parse;

// Complete pipeline from source to AST
let source = "POB #42\nWYJSCIE\nSTP";
let tokens = tokenize(source)?;      // Lexariel
let ast = parse(tokens)?;            // Parseid
```

## 🎨 Token Visualization

For debugging and development:

```rust
use lexariel::{tokenize, TokenKind};

fn visualize_tokens(source: &str) -> Result<(), Box<dyn std::error::Error>> {
    let tokens = tokenize(source)?;
    
    println!("┌─────┬────────────┬─────────────┬──────────┐");
    println!("│ Pos │    Type    │    Value    │ Location │");
    println!("├─────┼────────────┼─────────────┼──────────┤");
    
    for (i, token) in tokens.iter().enumerate() {
        println!("│{:4} │{:11} │{:12} │ {:2}:{:<3}   │", 
                 i, 
                 format!("{:?}", token.kind),
                 format!("'{}'", token.value),
                 token.line, 
                 token.column);
    }
    
    println!("└─────┴────────────┴─────────────┴──────────┘");
    Ok(())
}
```

## 🤝 Contributing

### Adding New Token Types

1. Add new variant to `TokenKind` enum
2. Update the lexer logic in `lexer.rs`
3. Add tests for the new token type
4. Update documentation

### Parser Integration

When adding new syntax to Asmodeus:
1. Define tokens in Lexariel
2. Update parser in Parseid to handle new tokens
3. Add assembler support in Hephasm if needed

## 📜 License

This crate is part of the Asmodeus project and is licensed under the MIT License.

## 🔗 Related Components

- **[Parseid]../parseid/** - Parser that consumes Lexariel tokens
- **[Shared]../shared/** - Common types and utilities
- **[Main Asmodeus]../** - Complete language toolchain