lexariel 0.1.0 - Docs.rs

# Lexariel

**Lexical Analyzer for Asmodeus Language**

```
┌───────────────────────────────────────────────────────────────┐
│                                                               │
│  ██╗     ███████╗██╗  ██╗ █████╗ ██████╗ ██╗███████╗██╗       │
│  ██║     ██╔════╝╚██╗██╔╝██╔══██╗██╔══██╗██║██╔════╝██║       │
│  ██║     █████╗   ╚███╔╝ ███████║██████╔╝██║█████╗  ██║       │
│  ██║     ██╔══╝   ██╔██╗ ██╔══██║██╔══██╗██║██╔══╝  ██║       │
│  ███████╗███████╗██╔╝ ██╗██║  ██║██║  ██║██║███████╗███████╗  │
│  ╚══════╝╚══════╝╚═╝  ╚═╝╚═╝  ╚═╝╚═╝  ╚═╝╚═╝╚══════╝╚══════╝  │
│                                                               │
│                Asmodeus Language Tokenizer                    │
└───────────────────────────────────────────────────────────────┘
```

**Lexariel** is the lexical analyzer (tokenizer) for the Asmodeus language. It converts raw source code into a stream of structured tokens that can be consumed by the parser. Built with performance and error recovery in mind.

## 🎯 Features

### Core Tokenization
- **Complete Token Recognition**: All Asmodeus language constructs
- **Multiple Comment Styles**: Both `;` and `//` comment syntax
- **Number Format Support**: Decimal, hexadecimal (0x), and binary (0b)
- **Identifier Recognition**: Labels, instruction mnemonics, and symbols
- **Addressing Mode Tokens**: `#`, `[`, `]`, register names
- **Error Recovery**: Continues parsing after lexical errors

### Advanced Features
- **Position Tracking**: Line and column information for all tokens
- **Whitespace Handling**: Intelligent whitespace skipping
- **String Literals**: Support for quoted strings and character literals
- **Macro Keywords**: Recognition of `MAKRO`, `KONM` macro delimiters
- **Directive Support**: `RST`, `RPA` and other assembler directives

## 🚀 Quick Start

### Basic Usage

```rust
use lexariel::tokenize;

let source = r#"
    start:
        POB #42     ; Load immediate value
        WYJSCIE     // Output the value
        STP         ; Stop program
"#;

let tokens = tokenize(source)?;
for token in tokens {
    println!("{:?}", token);
}
```

### Output Example
```
Token { kind: Identifier, value: "start", line: 2, column: 5 }
Token { kind: Colon, value: ":", line: 2, column: 10 }
Token { kind: Keyword, value: "POB", line: 3, column: 9 }
Token { kind: Hash, value: "#", line: 3, column: 13 }
Token { kind: Number, value: "42", line: 3, column: 14 }
Token { kind: Keyword, value: "WYJSCIE", line: 4, column: 9 }
Token { kind: Keyword, value: "STP", line: 5, column: 9 }
```

### Advanced Tokenization

```rust
use lexariel::{Lexer, TokenKind};

let source = r#"
    MAKRO add_numbers num1 num2
        POB num1
        DOD num2
        WYJSCIE
    KONM
    
    data_section:
        value1: RST 0x2A    ; Hex number
        value2: RST 0b101010 ; Binary number
        buffer: RPA
"#;

let tokens = tokenize(source)?;

// Filter only keywords
let keywords: Vec<_> = tokens.iter()
    .filter(|t| t.kind == TokenKind::Keyword)
    .collect();

// Count different token types
let mut counts = std::collections::HashMap::new();
for token in &tokens {
    *counts.entry(token.kind).or_insert(0) += 1;
}
```

## 📚 Token Types

### Core Token Types

| Token Kind | Description | Examples |
|------------|-------------|----------|
| `Keyword` | Assembly instructions and directives | `POB`, `DOD`, `STP`, `RST` |
| `Identifier` | User-defined names | `start`, `loop`, `data_value` |
| `Number` | Numeric literals | `42`, `0x2A`, `0b101010` |
| `Hash` | Immediate value prefix | `#` |
| `LeftBracket` | Indirect addressing start | `[` |
| `RightBracket` | Indirect addressing end | `]` |
| `Colon` | Label definition | `:` |
| `Comma` | Parameter separator | `,` |
| `Directive` | Assembler directives | `MAKRO`, `KONM` |

### Number Format Support

```rust
// Decimal numbers
let tokens = tokenize("RST 42")?;

// Hexadecimal numbers  
let tokens = tokenize("POB 0xFF")?;

// Binary numbers
let tokens = tokenize("DOD 0b1010")?;

// Negative numbers
let tokens = tokenize("RST -10")?;
```

### Comment Styles

```rust
// Semicolon comments (traditional assembly)
let source = r#"
    POB value    ; This is a comment
    STP          ; Stop the program
"#;

// C-style comments  
let source = r#"
    POB value    // This is also a comment
    STP          // Stop the program
"#;

// Both styles can be mixed
let source = r#"
    ; Program header comment
    start:       // Entry point
        POB #42  ; Load value
        STP      // End program
"#;
```

## 🔧 API Reference

### Main Functions

```rust
pub fn tokenize(input: &str) -> Result<Vec<Token>, LexerError>;
```

The primary entry point for tokenization. Takes source code and returns a vector of tokens or a lexer error.

### Core Types

```rust
#[derive(Debug, Clone, PartialEq)]
pub struct Token {
    pub kind: TokenKind,
    pub value: String,
    pub line: usize,
    pub column: usize,
}

#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum TokenKind {
    // Literals
    Number,
    Identifier,
    
    // Keywords and Instructions
    Keyword,
    Directive,
    
    // Symbols
    Hash,           // #
    LeftBracket,    // [  
    RightBracket,   // ]
    Colon,          // :
    Comma,          // ,
    
    // Special
    Newline,
    Invalid,
}

#[derive(Debug, thiserror::Error)]
pub enum LexerError {
    #[error("Invalid character '{char}' at line {line}, column {column}")]
    InvalidCharacter { char: char, line: usize, column: usize },
    
    #[error("Unterminated string literal at line {line}")]
    UnterminatedString { line: usize },
    
    #[error("Invalid number format '{value}' at line {line}, column {column}")]
    InvalidNumberFormat { value: String, line: usize, column: usize },
}
```

### Lexer Class

For more control over tokenization:

```rust
use lexariel::Lexer;

let mut lexer = Lexer::new(source_code);
let tokens = lexer.tokenize()?;

// Access lexer state if needed
println!("Total lines processed: {}", lexer.current_line());
```

## 📖 Examples

### Basic Program Tokenization

```rust
use lexariel::tokenize;

let program = r#"
    ; Simple addition program
    start:
        POB first_num   ; Load first number
        DOD second_num  ; Add second number  
        WYJSCIE         ; Output result
        STP             ; Stop
        
    first_num:  RST 25
    second_num: RST 17
"#;

let tokens = tokenize(program)?;

// Print all tokens with position info
for token in tokens {
    println!("{}:{} - {:?}: '{}'", 
             token.line, token.column, token.kind, token.value);
}
```

### Macro Definition Tokenization

```rust
let macro_source = r#"
    MAKRO multiply_by_two value
        POB value
        DOD value
        WYJSCIE
    KONM
    
    start:
        multiply_by_two data_value
        STP
        
    data_value: RST 21
"#;

let tokens = tokenize(macro_source)?;

// Find macro boundaries
let macro_start = tokens.iter().position(|t| t.value == "MAKRO");
let macro_end = tokens.iter().position(|t| t.value == "KONM");

println!("Macro defined from token {} to {}", 
         macro_start.unwrap(), macro_end.unwrap());
```

### Number Format Recognition

```rust
let numbers_program = r#"
    decimal_val:  RST 42        ; Decimal
    hex_val:      RST 0x2A      ; Hexadecimal  
    binary_val:   RST 0b101010  ; Binary
    negative_val: RST -10       ; Negative
"#;

let tokens = tokenize(numbers_program)?;

// Extract all numbers with their formats
for token in tokens {
    if token.kind == TokenKind::Number {
        println!("Number: '{}' at {}:{}", 
                 token.value, token.line, token.column);
    }
}
```

### Error Handling

```rust
use lexariel::{tokenize, LexerError};

// Source with lexical error
let bad_source = r#"
    start:
        POB @invalid_char   ; @ is not valid
        STP
"#;

match tokenize(bad_source) {
    Ok(tokens) => println!("Tokenized successfully: {} tokens", tokens.len()),
    Err(LexerError::InvalidCharacter { char, line, column }) => {
        println!("Invalid character '{}' at line {}, column {}", char, line, column);
    }
    Err(e) => println!("Other lexer error: {}", e),
}
```

### Addressing Mode Recognition

```rust
let addressing_examples = r#"
    ; Direct addressing
    POB value
    
    ; Immediate addressing  
    POB #42
    
    ; Indirect addressing
    POB [address]
    
    ; Register addressing
    POB R1
"#;

let tokens = tokenize(addressing_examples)?;

// Find addressing mode indicators
for (i, token) in tokens.iter().enumerate() {
    match token.kind {
        TokenKind::Hash => println!("Immediate addressing at token {}", i),
        TokenKind::LeftBracket => println!("Indirect addressing at token {}", i),
        _ => {}
    }
}
```

## 🧪 Testing

### Unit Tests

```bash
cargo test -p lexariel
```

### Specific Test Categories

```bash
# Test basic tokenization
cargo test -p lexariel basic_tokenization

# Test number format recognition  
cargo test -p lexariel number_tests

# Test comment handling
cargo test -p lexariel comment_tests

# Test error recovery
cargo test -p lexariel error_tests
```

### Integration Tests

```bash
cargo test -p lexariel --test integration_tests
```

## 🔍 Performance Characteristics

- **Speed**: ~1M lines per second tokenization
- **Memory**: O(n) where n is source length
- **Error Recovery**: Continues after lexical errors
- **Position Tracking**: Minimal overhead for line/column info

### Benchmarking

```rust
use lexariel::tokenize;
use std::time::Instant;

let large_source = include_str!("large_program.asmod");
let start = Instant::now();
let tokens = tokenize(large_source)?;
let duration = start.elapsed();

println!("Tokenized {} characters into {} tokens in {:?}", 
         large_source.len(), tokens.len(), duration);
```

## 🚫 Error Recovery

Lexariel is designed to continue tokenization even after encountering errors:

```rust
let source_with_errors = r#"
    start:
        POB #42     ; Valid
        @@@         ; Invalid characters
        STP         ; Valid again  
"#;

// Lexer will report error but continue tokenizing
match tokenize(source_with_errors) {
    Ok(tokens) => {
        // Will still get valid tokens before and after error
        println!("Got {} tokens despite errors", tokens.len());
    }
    Err(e) => {
        println!("First error encountered: {}", e);
        // In practice, might want to collect all errors
    }
}
```

## 🔗 Integration with Asmodeus Pipeline

Lexariel is the first stage in the Asmodeus compilation pipeline:

```
┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│   Source    │───▶│   Lexariel  │───▶│   Parseid   │
│    Code     │    │   (Lexer)   │    │  (Parser)   │
│  (.asmod)   │    │             │    │             │
└─────────────┘    └─────────────┘    └─────────────┘
                            │
                            ▼
                    ┌─────────────┐
                    │   Tokens    │
                    │             │
                    └─────────────┘
```

### Usage in Pipeline

```rust
use lexariel::tokenize;
use parseid::parse;

// Complete pipeline from source to AST
let source = "POB #42\nWYJSCIE\nSTP";
let tokens = tokenize(source)?;      // Lexariel
let ast = parse(tokens)?;            // Parseid
```

## 🎨 Token Visualization

For debugging and development:

```rust
use lexariel::{tokenize, TokenKind};

fn visualize_tokens(source: &str) -> Result<(), Box<dyn std::error::Error>> {
    let tokens = tokenize(source)?;
    
    println!("┌─────┬────────────┬─────────────┬──────────┐");
    println!("│ Pos │    Type    │    Value    │ Location │");
    println!("├─────┼────────────┼─────────────┼──────────┤");
    
    for (i, token) in tokens.iter().enumerate() {
        println!("│{:4} │{:11} │{:12} │ {:2}:{:<3}   │", 
                 i, 
                 format!("{:?}", token.kind),
                 format!("'{}'", token.value),
                 token.line, 
                 token.column);
    }
    
    println!("└─────┴────────────┴─────────────┴──────────┘");
    Ok(())
}
```

## 🤝 Contributing

### Adding New Token Types

1. Add new variant to `TokenKind` enum
2. Update the lexer logic in `lexer.rs`
3. Add tests for the new token type
4. Update documentation

### Parser Integration

When adding new syntax to Asmodeus:
1. Define tokens in Lexariel
2. Update parser in Parseid to handle new tokens
3. Add assembler support in Hephasm if needed

## 📜 License

This crate is part of the Asmodeus project and is licensed under the MIT License.

## 🔗 Related Components

- **[Parseid](../parseid/)** - Parser that consumes Lexariel tokens
- **[Shared](../shared/)** - Common types and utilities
- **[Main Asmodeus](../)** - Complete language toolchain