runmat-lexer 0.2.2

# RunMat Lexer

This crate tokenizes MATLAB/Octave source code into a stream of tokens for the parser.
It uses the `logos` library to define a fast, zero-copy DFA with a small amount of
context via `LexerExtras` to handle MATLAB-specific ambiguities.

## Design goals

- Correct tokenization for the full MATLAB language surface
- Minimal, explicit state for disambiguation (apostrophe transpose vs string, section markers, etc.)
- Compatibility with the rest of the toolchain (parser, HIR, interpreter, JIT)
- Predictable tokens: avoid over-encoding semantics at the lexing stage

## Context-aware lexing

We track two pieces of context in `LexerExtras`:

- `last_was_value: bool` — true if the previous emitted token forms a value.
  Used to disambiguate `'` as transpose vs string start.
- `line_start: bool` — true if we are at the beginning of a logical line.
  Used for `%%` section markers.

## Tokens overview

- Keywords: `function if elseif else for while break continue return end`
- Additional keywords: `switch case otherwise try catch global persistent true false`
- OOP keywords: `classdef properties methods events enumeration arguments`
- Import: `import`
- Identifiers: `[A-Za-z_][A-Za-z0-9_]*`
- Numbers: integers and floats with optional exponents
- Strings:
  - Single-quoted character arrays: `'...'` with doubled quotes `''` inside
  - Double-quoted string scalars: `"..."` with doubled quotes `""` inside
- Operators and punctuation:
  - Arithmetic: `+ - * / \ ^`
  - Element-wise: `.* ./ .\ .^`
  - Relational: `== ~= < <= > >=`
  - Logical: `&& || & | ~`
  - Transpose: `'` (contextual)
  - Colon: `:`
  - Dotted member access: `.`
  - Function handle/anonymous: `@`
  - Meta-class query: `?` (e.g., `?MyClass`)
  - Assignment and separators: `= , ;`
  - Grouping and containers: `() [] {}`
- Comments & layout:
  - Line comment: `%` to end of line
  - Section marker: `%%` at start of line
  - Block comment: `%{ ... %}` (non-nesting)
  - Line continuation: `...` (skips remainder of physical line)
  - Newlines reset `line_start`

## Notable disambiguations

- Apostrophe `'`:
  - If previous token was a value (identifier, number, `) ] }`), emit `Transpose`
  - Otherwise, let the string regex capture a full single-quoted character array
- Section `%%`:
  - Only emitted when `line_start == true`; otherwise `%` starts a normal line comment
- Line continuation `...`:
  - Emits `Ellipsis` and consumes the remainder of the physical line, including any `%` comment following it

## Non-goals at lexing time

The lexer purposefully does not encode high-level semantics:
- Integer class names like `int8`/`uint64` are identifiers
- Special variables like `varargin`/`varargout`/`ans` are identifiers
- OOP features (`handle` inheritance, method attributes) are parsed/handled later
- Command/function syntax duality is resolved in parsing/semantic phases

## Tests

See `tests/` for comprehensive coverage, organized by topic:
- `lexer.rs`: core tokens, operators, single-quoted strings, comments, ellipsis
- `transpose.rs`: detailed diagnostics and assertions for apostrophe (`'`) transpose cases
- `comments_continuation.rs`: `%` line comments, `%{...%}` block comments, `%%` section markers, `...` continuation
- `operators.rs`: logical and element-wise operators (e.g., `.* ./ .\ .^ && || & | ~`)
- `namespaces.rs`: `import` paths (including wildcard) and metaclass `?ClassName`
- `oop_tokens.rs`: OOP keywords (`classdef`, `properties`, `methods`, `events`, `enumeration`, `arguments`) and function handles `@`
- `strings_chars.rs`: double-quoted string scalars and apostrophe disambiguation exercises
- `tokens_basic.rs`: identifiers, numbers, separators (`; ,`), and simple keyword smoke tests

All lexer tests pass when running the crate tests on their own.

## Guidelines for extending the lexer

- Prefer adding new tokens only when lexical distinctions are required.
- When in doubt, keep ambiguous terms as identifiers and resolve in the parser.
- If you need context to disambiguate, add a boolean/flag in `LexerExtras` and
  use a Logos callback to `Emit` or `Skip` appropriately.
- Keep regular expressions simple (no look-around) and rely on token priority
  and callbacks for precedence and control.

## Known compatibility notes

- Non-conjugate transpose `.'` is tokenized as `Dot` then `Transpose`.
  The parser should interpret this pair as the non-conjugating transpose.
- Block comments `%{...%}` are treated as non-nesting by design.
- Error-recovery is implemented to keep producing useful tokens after invalid input; in recovery mode
  double-quoted strings are recognized as a single `Str` token, while malformed single-quoted sequences may
  be split to allow downstream error reporting.

## Remaining edges

- Apostrophe vs string: extreme adjacency cases across `...` continuation and `%` comments are covered by tests; a few rare permutations may still be added as seeds (parser semantics unaffected).
- Block comments are intentionally non-nesting; any future change would be a parser/runtime decision, not lexing.
- Command-form is resolved in the parser; lexer's role is complete for milestone.

## Crate integration

- This crate only produces tokens; it does not attempt to validate grammar.
- Downstream crates (`runmat-parser`, `runmat-hir`, `runmat-ignition`, `runmat-turbine`) are responsible for structure and semantics.