cljrs-reader 0.1.43

Lexer and parser producing Form AST nodes for clojurust
Documentation

cljrs-reader

Lexer (tokenizer) and recursive-descent parser for the clojurust language. Turns raw source text into a Form AST that the evaluator and compiler consume.

Phase: 2 — lexer and parser fully implemented.


File layout

src/
  lib.rs      — module declarations and re-exports
  token.rs    — Token enum: one variant per Clojure lexical form
  lexer.rs    — Lexer struct: byte-oriented, UTF-8-safe tokenizer
  form.rs     — Form struct + FormKind enum: the reader AST
  parser.rs   — Parser struct: recursive-descent parser + Iterator impl

Public API

token::Token

Every distinct lexical form the reader can produce:

Variant Clojure source Notes
Nil nil
Bool(bool) true / false
Int(i64) 42, -7, 16rFF, 2r1010 decimal or radix literal that fits i64
BigInt(String) 42N, overflowing radix decimal digits; sign included when negative
Float(f64) 3.14, 1e10, 1.5e-3
BigDecimal(String) 3.14M raw text without trailing M
Ratio(String) 3/4, -1/2 full text including /
Char(char) \a, \newline, \u0041 named chars and \uXXXX resolved
Str(String) "hello\n" escape sequences fully processed
Symbol(String) foo, ns/name, /, ..
Keyword(String) :foo, :ns/name leading : stripped
AutoKeyword(String) ::foo, ::ns/alias leading :: stripped
LParen / RParen ( / )
LBracket / RBracket [ / ]
LBrace / RBrace { / }
Quote '
SyntaxQuote `
Unquote ~
UnquoteSplice ~@
Deref @
Meta ^
HashFn #(
HashSet #{
HashVar #'
HashDiscard #_
Regex(String) #"[a-z]+" raw pattern; no escape processing
ReaderCond #?
ReaderCondSplice #?@
Symbolic(String) ##Inf, ##NaN stores suffix after ##
TaggedLiteral(String) #inst, #uuid stores tag name without #
Eof end-of-file sentinel

lexer::Lexer

A byte-oriented, UTF-8-safe tokenizer. Tracks byte position, 1-based line, and 1-based byte column so every token carries a precise Span.

pub struct Lexer { /* private */ }

impl Lexer {
    /// Create a new lexer for `source` text from `file` (path or `"<repl>"`).
    pub fn new(source: String, file: String) -> Self

    /// Return the next `(Token, Span)` pair.
    /// Returns `Ok((Token::Eof, _))` at end of input.
    /// Returns `Err(CljxError::ReadError { … })` on invalid input.
    pub fn next_token(&mut self) -> CljxResult<(Token, Span)>

    pub fn source(&self) -> &Arc<String>
    pub fn file(&self) -> &Arc<String>
}

impl Iterator for Lexer {
    type Item = CljxResult<(Token, Span)>;
    // Yields None when next_token returns Token::Eof.
}

Whitespace and comment handling

  • ASCII spaces, tabs, carriage returns, newlines, and commas are skipped.
  • ; through end-of-line is a line comment.
  • #! at the very start of the file (byte offset 0) is a shebang; the rest of that line is skipped.

Number parsing rules

  • +/- are only routed to the number path when immediately followed by an ASCII digit; otherwise they lex as symbols.
  • 3/foo lexes as Int(3) then Symbol("/foo"), not a ratio — the / is only consumed as part of a ratio when the character immediately after it is a digit.
  • Radix literals: NNrDIGITS where NN is 2–36. Overflow of i64 yields BigInt.

form::Form / form::FormKind

The reader AST. Every Form carries a Span for diagnostics.

PartialEq on Form ignores spans — equality tests compare only FormKind.

#[derive(Debug, Clone)]
pub struct Form {
    pub kind: FormKind,
    pub span: Span,
}

impl Form {
    pub fn new(kind: FormKind, span: Span) -> Self
}

#[derive(Debug, Clone, PartialEq)]
pub enum FormKind {
    // Atoms
    Nil,
    Bool(bool),
    Int(i64),
    BigInt(String),
    Float(f64),
    BigDecimal(String),
    Ratio(String),
    Char(char),
    Str(String),
    Regex(String),
    Symbolic(f64),        // ##Inf→INFINITY  ##-Inf→NEG_INFINITY  ##NaN→NAN

    // Identifiers
    Symbol(String),
    Keyword(String),
    AutoKeyword(String),

    // Collections
    List(Vec<Form>),
    Vector(Vec<Form>),
    Map(Vec<Form>),       // flat key/value pairs; length always even
    Set(Vec<Form>),

    // Wrapping reader macros
    Quote(Box<Form>),
    SyntaxQuote(Box<Form>),
    Unquote(Box<Form>),
    UnquoteSplice(Box<Form>),
    Deref(Box<Form>),
    Var(Box<Form>),                      // #'symbol
    Meta(Box<Form>, Box<Form>),          // raw meta-form, annotated-form

    // Dispatch forms
    AnonFn(Vec<Form>),                   // #(...)
    TaggedLiteral(String, Box<Form>),    // #tag form

    // Reader conditionals — all branches kept; evaluator filters by :rust
    // clauses is flat: [keyword, form, keyword, form, …]
    ReaderCond { splicing: bool, clauses: Vec<Form> },
}

parser::Parser

A recursive-descent parser that consumes (Token, Span) pairs from a Lexer and produces Form nodes.

pub struct Parser { /* private */ }

impl Parser {
    /// Create a parser for `source` text labelled with `file`.
    pub fn new(source: String, file: String) -> Self

    /// Return the next form (skipping `#_` discards).
    /// Returns `Ok(None)` at EOF.
    pub fn parse_one(&mut self) -> CljxResult<Option<Form>>

    /// Parse all forms until EOF.
    pub fn parse_all(&mut self) -> CljxResult<Vec<Form>>
}

impl Iterator for Parser {
    type Item = CljxResult<Form>;
    // Yields None at EOF, Err on parse error.
}

#_ discard semantics

#_ consumes itself plus the next form and produces nothing. Discards can be chained: [#_ #_ 1 2 3][2, 3] (outer #_ discards the #_ 1 group, leaving 2 and 3).

Reader conditionals

All branches of #?(…) and #?@(…) are parsed and stored as FormKind::ReaderCond { splicing, clauses } with a flat clauses vec. The evaluator is responsible for filtering by :rust.


Error construction

On any read or parse error the crate produces a CljxError::ReadError containing the offending Span and the full source text, which miette uses to render a pointed diagnostic in the terminal.


Re-exports from lib.rs

pub use form::{Form, FormKind};
pub use lexer::Lexer;
pub use parser::Parser;
pub use token::Token;

Dependencies

Crate Role
cljrs-types (workspace) Span, CljxError, CljxResult
miette (workspace) NamedSource used in error construction