jzero 0.1.2

A compiler and bytecode VM for the Jzero language (a subset of Java)
Documentation
# jzero-rs

[![Crates.io](https://img.shields.io/crates/v/jzero.svg)](https://crates.io/crates/jzero)
[![License: MIT](https://img.shields.io/badge/license-MIT-blue.svg)](LICENSE)

A compiler + bytecode VM for **Jzero** — a small subset of Java — written entirely in Rust.

This project follows the book *Build Your Own Programming Language (Edition 2)* by Clinton L. Jeffery, replacing the original Java/Unicon tooling with idiomatic Rust equivalents.

## Using as a library

Add to your `Cargo.toml`:

```toml
[dependencies]
jzero = "0.1.0"
```

Then use the `Compiler` API:

```rust
use jzero::Compiler;

let out = Compiler::new()
    .source(r#"
        public class hello {
            public static void main(String argv[]) {
                System.out.println("hello, jzero!");
            }
        }
    "#)
    .run(&[])
    .unwrap();

println!("{}", out.stdout); // hello, jzero!
```

## What is Jzero?

Jzero is a strict subset of Java designed for teaching compiler construction. Every valid Jzero program is also a valid Java program. It supports a minimal but complete set of features: classes, methods, control flow, basic types (`int`, `double`, `bool`, `string`), arrays, and simple I/O.

## Roadmap

| Phase | Crate | Book Chapter | Status |
|---|---|---|---|
| Lexical Analysis | `jzero-lexer` | Ch. 3 | ✅ Done |
| Parsing (accept/reject) | `jzero-parser` | Ch. 4 | ✅ Done |
| Syntax Tree Construction | `jzero-ast`, `jzero-parser` | Ch. 5 | ✅ Done |
| Symbol Tables | `jzero-symtab`, `jzero-semantic` | Ch. 6 | ✅ Done |
| Type Checking (base types) | `jzero-semantic` | Ch. 7 | ✅ Done |
| Type Checking (arrays, methods, structs) | `jzero-semantic` | Ch. 8 | ✅ Done |
| Intermediate Code Generation | `jzero-codegen` | Ch. 9 | ✅ Done |
| Bytecode Generation | `jzero-codegen` | Ch. 13 | ✅ Done |
| Bytecode Interpreter / VM | `jzero-vm` | Ch. 12 | ✅ Done |
| Operators & Built-in Functions | `jzero-codegen`, `jzero-vm` | Ch. 15 | ✅ Done |

Chapters 10 (IDE/syntax coloring), 11 (transpiler), 14 (native code), 16 (domain control structures), and 17 (garbage collection) were intentionally skipped as detours unrelated to the core compiler pipeline or beyond the scope of Jzero.

## Architecture

```
jzero-rs/
├── Cargo.toml
├── crates/
│   ├── jzero/              # Public facade crate (published to crates.io)
│   ├── jzero-lexer/        # Lexical analysis (Logos)
│   ├── jzero-parser/       # Parsing & syntax tree construction (LALRPOP)
│   ├── jzero-ast/          # Syntax tree data structures & DOT output
│   ├── jzero-symtab/       # Symbol table types (SymTab, SymTabEntry, TypeInfo)
│   ├── jzero-semantic/     # Symbol table construction & type checking
│   ├── jzero-codegen/      # TAC + bytecode generation
│   ├── jzero-vm/           # Bytecode interpreter + string pool
│   └── jzero-cli/          # CLI tool (j0, not published)
└── tests/
    └── examples/
        ├── hello.java           # Minimal hello-world
        ├── hello_loop.java      # Loop + array + I/O
        ├── concat.java          # String concatenation
        ├── countdown.java       # While loop + argv.length
        ├── fizzbuzz.java        # Nested if/else + modulo + String.valueOf
        └── greet.java           # String concatenation + String.valueOf
```

Clean one-way dependency chain:

```
jzero-lexer → jzero-symtab → jzero-ast → jzero-parser → jzero-semantic → jzero-codegen → jzero-vm → jzero
```

## Tool Mapping

| Original (Book) | Rust Equivalent | Notes |
|---|---|---|
| JFlex | [Logos]https://github.com/maciejhirsz/logos | Derive-macro based, zero-copy |
| BYACC/J | [LALRPOP]https://github.com/lalrpop/lalrpop | LR(1) with lane table algorithm |
| `tree.java` | `jzero-ast` crate | Uniform `Tree` struct with DOT output |
| `symtab.java` | `jzero-symtab` crate | Rust enum-based type hierarchy |
| `typeinfo.java` + subclasses | `TypeInfo` enum | `Base`, `Array`, `Method`, `Class` variants |
| `address.java` + `tac.java` | `jzero-codegen` crate | `Address` enum + `Tac` struct |
| `byc.java` + `j0machine.java` | `jzero-codegen` + `jzero-vm` | Bytecode format + stack-machine VM |
| `stringpool.java` | `StringPool` in `jzero-vm` | Runtime string interning for concatenation |

## Building & Testing

```bash
# Build everything
cargo build

# Run all tests (single-threaded to avoid global ID counter races)
cargo test --workspace -- --test-threads=1

# Parse a file and visualize the syntax tree
cargo run --bin j0 -- tests/examples/hello.java --png

# Print TAC intermediate code (Chapter 9)
cargo run --bin j0 -- tests/examples/hello_loop.java --codegen

# Compile to bytecode and print assembler listing (Chapter 13)
cargo run --bin j0 -- tests/examples/hello_loop.java --bytecode

# Compile and execute in the VM (Chapters 12+13)
cargo run --bin j0 -- tests/examples/hello_loop.java --run a b c d e

# String concatenation example (Chapter 15)
cargo run --bin j0 -- tests/examples/concat.java --run
```

## End-to-End Examples

### Loop + array access

```java
// tests/hello_loop.java
public class hello_loop {
   public static void main(String argv[]) {
      int x;
      x = argv.length;
      x = x + 2;
      while (x > 3) {
         System.out.println("hello, jzero!");
         x = x - 1;
      }
   }
}
```

Running `j0 tests/examples/hello_loop.java --run a b c d e` executes the program end-to-end:

```
hello, jzero!
hello, jzero!
hello, jzero!
hello, jzero!
no errors
```

### String concatenation

```java
// tests/concat.java
public class concat {
    public static void main(String argv[]) {
        String s;
        s = "hello, " + "jzero!";
        System.out.println(s);
    }
}
```

Running `j0 tests/examples/concat.java --run` produces:

```
hello, jzero!
no errors
```

## Semantic Analysis (Chapters 6–8)

### Symbol table construction (Ch. 6)

A two-pass tree traversal builds one `SymTab` per scope:

- **First pass** registers all field and method signatures, enabling forward references within a class.
- **Second pass** walks method bodies to insert parameters and local variables.
- `System.out.println` is pre-registered in the global scope.
- Each `Tree` node gets its `stab` field set to the nearest enclosing scope (inherited, top-down).

### Type checking (Ch. 7–8 + 15)

Three cooperative functions — `check_type`, `check_kids`, `check_types` — perform a selective post-order traversal of method bodies.

Key design decisions:
- **`TypeInfo` enum** replaces the book's `typeinfo` class hierarchy — `Base`, `Array`, `Method`, `Class` variants with exhaustive matching.
- **`"return"` dummy symbol** — each method's stab gets a `"return"` entry with the declared return type. `ReturnStmt` nodes look it up directly.
- **`mkcls` pass** — after symbol tables are fully built, a dedicated pass constructs a complete `ClassType` for each class.
- **String concatenation**`String + String` is allowed by the type checker and emits `SADD` in the TAC layer. `String - String` is rejected.

## Intermediate Code Generation (Chapter 9)

Five passes over the typed syntax tree produce TAC:

```
assign_addresses → genfirst → genfollow → gentargets → gencode → emit
```

## Bytecode Generation + VM (Chapters 12 + 13)

### Bytecode format

Each instruction is a fixed 8-byte word: `[opcode:1][region:1][operand:6, little-endian]`.

The `.j0` binary file layout:
```
Word 0:   magic   "Jzero!!\0"
Word 1:   version "1.0\0\0\0\0\0"
Word 2:   first-instruction word offset
Word 3…:  data section (string literals, NUL-terminated)
Word N…:  startup sequence + instructions
```

The startup sequence calls `main` with the real `argc` (number of CLI arguments passed after `--run`), so `argv.length` behaves correctly without hardcoding.

### Instruction set (27 opcodes)

```
HALT NOOP ADD SUB MUL DIV MOD NEG PUSH POP CALL RETURN GOTO BIF
LT LE GT GE EQ NEQ LOCAL LOAD STORE
SPUSH SPOP SADD ITOS
```

`SPUSH`, `SPOP`, `SADD` were added in Chapter 15 for string operations.
`ITOS` converts an integer to a string pool key (`String.valueOf`).

### TAC → bytecode translation

| TAC | Bytecode |
|---|---|
| `ADD op1,op2,op3` | `PUSH op2, PUSH op3, ADD, POP op1` |
| `ASN op1,op2` | `PUSH op2, POP op1` |
| `BGT label,op2,op3` | `PUSH op2, PUSH op3, GT, BIF label` |
| `PARM arg` + `CALL fn,n` | `PUSH fn_addr, PUSH arg, …, CALL n` |
| `Proc` (method entry) | `LOCAL n` (pre-allocates local slots) |
| `SADD op1,op2,op3` | `SPUSH op2, SPUSH op3, SADD, SPOP op1` |
| `String.valueOf(x)` | `ITOS op1,op2` (int → string pool key) |

Label resolution is two-pass: pass 1 records byte offsets, pass 2 patches branch targets. All GOTO/BIF targets are relocated by `code_base_bytes` to be absolute offsets from word 0.

### VM calling convention

Saved registers `(ip, bp, fn_slot)` are kept in an off-stack `call_stack: Vec<(usize, i64, i64)>`, leaving the data stack clean for locals:

```
stack[bp+0] = fn_addr  (loc:0)
stack[bp+1] = arg0     (loc:8)
stack[bp+2] = local0   (loc:16)
…
```

`LOCAL n` pre-allocates all local slots at function entry so expression temporaries never overwrite them.

## String Operations (Chapter 15)

String concatenation (`+`) is implemented via a runtime `StringPool`:

- **`SPUSH offset`** — reads a NUL-terminated string from the data section at `offset` (or resolves a pool key if the value is negative), interns it in the pool, and pushes a negative integer key onto the stack.
- **`SADD`** — pops two pool keys, concatenates the strings, interns the result, and pushes the new key.
- **`SPOP dst`** — pops a pool key and stores it in the destination stack slot.
- **`ITOS`** — pops an integer, converts it to a decimal string via `String.valueOf`, interns it in the pool, and pushes the pool key.
- **`do_println`** — resolves its argument via `resolve_string()`, which handles both data-section offsets (≥ 0) and pool keys (< 0) transparently.

Pool keys are negative integers (`-1, -2, …`) so they are visually distinct from data-section offsets on the stack.

## Parser Design Notes

**Why LALRPOP over grmtools/lrpar?** The original grammar has inherent LALR(1) ambiguities. grmtools resolved conflicts silently in ways that broke dotted method calls like `System.out.println(...)`. LALRPOP's LR(1) lane table algorithm handles more grammars without conflicts, and its explicit conflict reporting made it easier to restructure the grammar correctly.

## References

- [*Build Your Own Programming Language (Edition 2)*]https://a.co/d/0hHvJYWA — Clinton L. Jeffery
- [Logos]https://github.com/maciejhirsz/logos — fast lexer generator for Rust
- [LALRPOP]https://github.com/lalrpop/lalrpop — LR(1) parser generator for Rust