jzero-rs
A compiler + bytecode VM for Jzero — a small subset of Java — written entirely in Rust.
This project follows the book Build Your Own Programming Language (Edition 2) by Clinton L. Jeffery, replacing the original Java/Unicon tooling with idiomatic Rust equivalents.
Using as a library
Add to your Cargo.toml:
[]
= "0.1.0"
Then use the Compiler API:
use Compiler;
let out = new
.source
.run
.unwrap;
println!; // hello, jzero!
What is Jzero?
Jzero is a strict subset of Java designed for teaching compiler construction. Every valid Jzero program is also a valid Java program. It supports a minimal but complete set of features: classes, methods, control flow, basic types (int, double, bool, string), arrays, and simple I/O.
Roadmap
| Phase | Crate | Book Chapter | Status |
|---|---|---|---|
| Lexical Analysis | jzero-lexer |
Ch. 3 | ✅ Done |
| Parsing (accept/reject) | jzero-parser |
Ch. 4 | ✅ Done |
| Syntax Tree Construction | jzero-ast, jzero-parser |
Ch. 5 | ✅ Done |
| Symbol Tables | jzero-symtab, jzero-semantic |
Ch. 6 | ✅ Done |
| Type Checking (base types) | jzero-semantic |
Ch. 7 | ✅ Done |
| Type Checking (arrays, methods, structs) | jzero-semantic |
Ch. 8 | ✅ Done |
| Intermediate Code Generation | jzero-codegen |
Ch. 9 | ✅ Done |
| Bytecode Generation | jzero-codegen |
Ch. 13 | ✅ Done |
| Bytecode Interpreter / VM | jzero-vm |
Ch. 12 | ✅ Done |
| Operators & Built-in Functions | jzero-codegen, jzero-vm |
Ch. 15 | ✅ Done |
Chapters 10 (IDE/syntax coloring), 11 (transpiler), 14 (native code), 16 (domain control structures), and 17 (garbage collection) were intentionally skipped as detours unrelated to the core compiler pipeline or beyond the scope of Jzero.
Architecture
jzero-rs/
├── Cargo.toml
├── crates/
│ ├── jzero/ # Public facade crate (published to crates.io)
│ ├── jzero-lexer/ # Lexical analysis (Logos)
│ ├── jzero-parser/ # Parsing & syntax tree construction (LALRPOP)
│ ├── jzero-ast/ # Syntax tree data structures & DOT output
│ ├── jzero-symtab/ # Symbol table types (SymTab, SymTabEntry, TypeInfo)
│ ├── jzero-semantic/ # Symbol table construction & type checking
│ ├── jzero-codegen/ # TAC + bytecode generation
│ ├── jzero-vm/ # Bytecode interpreter + string pool
│ └── jzero-cli/ # CLI tool (j0, not published)
└── tests/
└── examples/
├── hello.java # Minimal hello-world
├── hello_loop.java # Loop + array + I/O
├── concat.java # String concatenation
├── countdown.java # While loop + argv.length
├── fizzbuzz.java # Nested if/else + modulo + String.valueOf
└── greet.java # String concatenation + String.valueOf
Clean one-way dependency chain:
jzero-lexer → jzero-symtab → jzero-ast → jzero-parser → jzero-semantic → jzero-codegen → jzero-vm → jzero
Tool Mapping
| Original (Book) | Rust Equivalent | Notes |
|---|---|---|
| JFlex | Logos | Derive-macro based, zero-copy |
| BYACC/J | LALRPOP | LR(1) with lane table algorithm |
tree.java |
jzero-ast crate |
Uniform Tree struct with DOT output |
symtab.java |
jzero-symtab crate |
Rust enum-based type hierarchy |
typeinfo.java + subclasses |
TypeInfo enum |
Base, Array, Method, Class variants |
address.java + tac.java |
jzero-codegen crate |
Address enum + Tac struct |
byc.java + j0machine.java |
jzero-codegen + jzero-vm |
Bytecode format + stack-machine VM |
stringpool.java |
StringPool in jzero-vm |
Runtime string interning for concatenation |
Building & Testing
# Build everything
# Run all tests (single-threaded to avoid global ID counter races)
# Parse a file and visualize the syntax tree
# Print TAC intermediate code (Chapter 9)
# Compile to bytecode and print assembler listing (Chapter 13)
# Compile and execute in the VM (Chapters 12+13)
# String concatenation example (Chapter 15)
End-to-End Examples
Loop + array access
// tests/hello_loop.java
Running j0 tests/examples/hello_loop.java --run a b c d e executes the program end-to-end:
hello, jzero!
hello, jzero!
hello, jzero!
hello, jzero!
no errors
String concatenation
// tests/concat.java
Running j0 tests/examples/concat.java --run produces:
hello, jzero!
no errors
Semantic Analysis (Chapters 6–8)
Symbol table construction (Ch. 6)
A two-pass tree traversal builds one SymTab per scope:
- First pass registers all field and method signatures, enabling forward references within a class.
- Second pass walks method bodies to insert parameters and local variables.
System.out.printlnis pre-registered in the global scope.- Each
Treenode gets itsstabfield set to the nearest enclosing scope (inherited, top-down).
Type checking (Ch. 7–8 + 15)
Three cooperative functions — check_type, check_kids, check_types — perform a selective post-order traversal of method bodies.
Key design decisions:
TypeInfoenum replaces the book'stypeinfoclass hierarchy —Base,Array,Method,Classvariants with exhaustive matching."return"dummy symbol — each method's stab gets a"return"entry with the declared return type.ReturnStmtnodes look it up directly.mkclspass — after symbol tables are fully built, a dedicated pass constructs a completeClassTypefor each class.- String concatenation —
String + Stringis allowed by the type checker and emitsSADDin the TAC layer.String - Stringis rejected.
Intermediate Code Generation (Chapter 9)
Five passes over the typed syntax tree produce TAC:
assign_addresses → genfirst → genfollow → gentargets → gencode → emit
Bytecode Generation + VM (Chapters 12 + 13)
Bytecode format
Each instruction is a fixed 8-byte word: [opcode:1][region:1][operand:6, little-endian].
The .j0 binary file layout:
Word 0: magic "Jzero!!\0"
Word 1: version "1.0\0\0\0\0\0"
Word 2: first-instruction word offset
Word 3…: data section (string literals, NUL-terminated)
Word N…: startup sequence + instructions
The startup sequence calls main with the real argc (number of CLI arguments passed after --run), so argv.length behaves correctly without hardcoding.
Instruction set (27 opcodes)
HALT NOOP ADD SUB MUL DIV MOD NEG PUSH POP CALL RETURN GOTO BIF
LT LE GT GE EQ NEQ LOCAL LOAD STORE
SPUSH SPOP SADD ITOS
SPUSH, SPOP, SADD were added in Chapter 15 for string operations.
ITOS converts an integer to a string pool key (String.valueOf).
TAC → bytecode translation
| TAC | Bytecode |
|---|---|
ADD op1,op2,op3 |
PUSH op2, PUSH op3, ADD, POP op1 |
ASN op1,op2 |
PUSH op2, POP op1 |
BGT label,op2,op3 |
PUSH op2, PUSH op3, GT, BIF label |
PARM arg + CALL fn,n |
PUSH fn_addr, PUSH arg, …, CALL n |
Proc (method entry) |
LOCAL n (pre-allocates local slots) |
SADD op1,op2,op3 |
SPUSH op2, SPUSH op3, SADD, SPOP op1 |
String.valueOf(x) |
ITOS op1,op2 (int → string pool key) |
Label resolution is two-pass: pass 1 records byte offsets, pass 2 patches branch targets. All GOTO/BIF targets are relocated by code_base_bytes to be absolute offsets from word 0.
VM calling convention
Saved registers (ip, bp, fn_slot) are kept in an off-stack call_stack: Vec<(usize, i64, i64)>, leaving the data stack clean for locals:
stack[bp+0] = fn_addr (loc:0)
stack[bp+1] = arg0 (loc:8)
stack[bp+2] = local0 (loc:16)
…
LOCAL n pre-allocates all local slots at function entry so expression temporaries never overwrite them.
String Operations (Chapter 15)
String concatenation (+) is implemented via a runtime StringPool:
SPUSH offset— reads a NUL-terminated string from the data section atoffset(or resolves a pool key if the value is negative), interns it in the pool, and pushes a negative integer key onto the stack.SADD— pops two pool keys, concatenates the strings, interns the result, and pushes the new key.SPOP dst— pops a pool key and stores it in the destination stack slot.ITOS— pops an integer, converts it to a decimal string viaString.valueOf, interns it in the pool, and pushes the pool key.do_println— resolves its argument viaresolve_string(), which handles both data-section offsets (≥ 0) and pool keys (< 0) transparently.
Pool keys are negative integers (-1, -2, …) so they are visually distinct from data-section offsets on the stack.
Parser Design Notes
Why LALRPOP over grmtools/lrpar? The original grammar has inherent LALR(1) ambiguities. grmtools resolved conflicts silently in ways that broke dotted method calls like System.out.println(...). LALRPOP's LR(1) lane table algorithm handles more grammars without conflicts, and its explicit conflict reporting made it easier to restructure the grammar correctly.
References
- Build Your Own Programming Language (Edition 2) — Clinton L. Jeffery
- Logos — fast lexer generator for Rust
- LALRPOP — LR(1) parser generator for Rust