# H2 Language Specification v0.5.3
## 1. Overview
H2 Language is a domain-specific language for robot control, inspired by HOJ (Herbert Online Judge) H language,
with extensions for multi-agent support and execution limits.
**Design Philosophy**:
- Simple, unambiguous syntax
- Predictable type system with compile-time checking
- Safe execution with configurable limits
---
## 2. Lexical Structure
### 2.1 Character Sets
```
letter_lower ::= 'a'..'z'
letter_upper ::= 'A'..'Z'
digit ::= '0'..'9'
parameter ::= letter_upper (* A-Z *)
```
### 2.2 Tokens
| `COMMAND` | `s \| r \| l` | Basic movement commands |
| `IDENT` | `[a-z]` except `s,r,l` | Function identifiers |
| `PARAM` | `[A-Z]` | Function parameters |
| `NUMBER` | `[0-9]+` | Non-negative integer literals |
| `AGENT_ID` | `[0-9]+` at line start **immediately** followed by `:` | Agent identifier |
| Symbols | `: ( ) , + -` | Punctuation |
| `NEWLINE` | `\n \| \r\n` | Line terminator |
| `SPACE` | `[ \t]+` | Whitespace (H2 extension) |
| `COMMENT` | `#...` or `//...` | Comment to end of line (H2 extension) |
### 2.3 Lexical Rules
- **Longest match**: `12` is read as a single `NUMBER(12)` token.
- **AGENT_ID**: Valid only at line start AND **immediately** followed by `:` (no spaces).
- `0:` → `AGENT_ID(0)`
- `0 :` → `NUMBER(0)`, `SPACE`, `COLON` (NOT AgentId)
- **SPACE**: Whitespace between tokens is allowed (H2 extension, not HOJ-compatible).
- **COMMENT**: `#` or `//` starts a comment until end of line (H2 extension).
---
## 3. Syntax Grammar (EBNF)
### 3.1 Program Structure
```ebnf
program ::= directives agent_content
directives ::= (SPACE? directive NEWLINE)*
directive ::= directive_name '=' directive_value
agent_block ::= agent (NEWLINE agent)*
agent ::= AGENT_ID agent_body
single_agent_block ::= agent_body (* treated as agent 0 *)
```
### 3.2 Agent Body (H2 Mode)
```ebnf
(* H2 Mode: definitions and expressions can be mixed *)
agent_body ::= statement*
definition ::= func_def
func_def ::= IDENT '(' param_list? ')' ':' expression
| IDENT ':' expression
param_list ::= PARAM (',' PARAM)*
```
**H2 Mode Features**:
- Definitions and expressions can be **mixed** on the same line or across lines.
- No "last line = main" constraint.
- `f():ss` (0-arg with parentheses) is valid and equivalent to `f:ss`.
### 3.3 Expressions
```ebnf
expression ::= term+
term ::= COMMAND
| IDENT '(' arg_list? ')' (* function call *)
| IDENT (* 0-arg function call *)
| PARAM (* parameter reference *)
arg_list ::= argument (',' argument)*
argument ::= num_expr (* numeric expression *)
| cmd_expr (* command expression *)
cmd_expr ::= term+
```
**Important**: `IDENT` alone (e.g., `x`) is a **0-arg function call**, not a special "macro reference".
### 3.4 Numeric Expressions
```ebnf
num_atom ::= NUMBER
| PARAM (* must be Int-typed parameter *)
```
**Parsing Rule**: If an argument starts with `NUMBER` or `PARAM` followed by `+`/`-`, it is parsed as `num_expr`.
---
## 4. Semantics
### 4.1 Evaluation Order
1. Parse directives and build limit configuration
2. Parse all definitions and register in function table
3. Parse main expression (all non-definition terms)
4. Expand main expression, generating command sequence
### 4.2 Function Definitions (Unified Model)
All definitions are functions:
| `x:ss` | 0 | 0-arg function |
| `f():ss` | 0 | 0-arg function (explicit) |
| `f(X):XXX` | 1 | 1-arg function |
| `g(A,B):AB` | 2 | 2-arg function |
**Constraints**:
- Same-name collision is **forbidden** (one definition per identifier).
- There is no separate "macro" concept; `x:ss` is simply a 0-arg function.
### 4.3 Type System
#### 4.3.1 Parameter Types
Each parameter has exactly one type, determined at **definition time**:
| `CmdSeq` | Command sequence | Any sequence of s/r/l |
| `Int` | Integer value | -255..255 |
#### 4.3.2 Type Inference Rules (Definition Time)
Type is inferred by analyzing parameter usage in the function body:
| `PARAM` as term (e.g., `X` alone in body) | CmdSeq |
| `PARAM` in num_expr (e.g., `X-1`, `X+2`) | Int |
**Type Conflict**: If a parameter is used both ways → **E010** (definition-time error).
```
# Error: X used as both CmdSeq and Int
f(X):Xf(X-1) # E010: Type conflict for parameter 'X'
```
#### 4.3.3 Type Storage
```rust
struct FuncDef {
name: char,
params: Vec<char>,
param_types: HashMap<char, ParamType>, // Inferred at definition
body: Expr,
span: Span,
}
```
#### 4.3.4 Call-Site Type Checking
At call site, argument type must match parameter type:
| CmdSeq | Command expression (terms) |
| Int | Numeric expression |
Type mismatch → **E008**.
```
f(X):XX f(sr) # OK: X is CmdSeq, sr is CmdSeq
f(X):sf(X-1) f(3) # OK: X is Int, 3 is Int
f(X):XX f(3) # E008: X is CmdSeq, but 3 is Int
f(X):sf(X-1) f(sr)# E008: X is Int, but sr is CmdSeq
```
### 4.4 Function Call Semantics
#### 4.4.1 Arity Check
Argument count must match parameter count, with one exception:
```
f(X,Y):XY f(s) # E003: expects 2, got 1
f(X,Y):XY f(s,r,l)# E003: expects 2, got 3
```
**HOJ Compatibility - Empty Call Exception**:
When `f()` is called on a function with parameters, default values are bound:
- `CmdSeq` parameters → empty sequence
- `Int` parameters → 0 (triggers ≤0 termination)
```
a(X):X a() # OK: X=empty → returns empty
a(X):Xrra(sX) a() # OK: generates rr s rr ss rr sss rr ...
a(X):sa(X-1) a() # OK: X=0 → immediate termination → empty
```
This is a known HOJ technique for generating increasing sequences from empty.
#### 4.4.2 Numeric Termination
If ANY Int-typed parameter has value `<= 0`, the function call returns empty:
```
a(X):sa(X-1) a(4)
→ s a(3) → ss a(2) → sss a(1) → ssss a(0)
→ ssss (empty) # X=0 terminates
```
#### 4.4.3 Expansion
1. Check arity: mismatch → E003
2. Type-check each argument: mismatch → E008
3. Check numeric termination: any Int param ≤ 0 → return empty
4. Substitute parameters and expand body recursively
### 4.5 Numeric Expression Evaluation
```
1. Substitute all `PARAM` with their bound Int values
2. Evaluate left-to-right: `12-3+4` → `9+4` → `13`
3. Range check at each step: `-255 ≤ result ≤ 255`, otherwise E007
---
## 5. Execution Limits
### 5.1 Directives
| `MAX_STEP` | int | 1,000,000 | 1..10,000,000 |
| `MAX_DEPTH` | int | 100 | 1..10,000 |
| `ON_LIMIT` | enum | TRUNCATE | ERROR / TRUNCATE |
**Note**: `MAX_MEMORY` is reserved for future use (not implemented in v0.5.0).
### 5.2 ON_LIMIT Behavior
| `ERROR` | Return error (E004/E005) and stop |
| `TRUNCATE` | Return commands generated so far |
**Default**: `TRUNCATE` (for HOJ compatibility).
### 5.3 Step Counting
H2 counts generated commands (`s`, `r`, `l`) only.
---
## 6. Multi-Agent Extension
### 6.1 Agent Definition
```
0: x:ss xx # Agent 0
1: srl # Agent 1
2: f(X):XX f(s) # Agent 2
```
### 6.2 Single Agent Mode
When no `AGENT_ID:` prefix is present, the entire program is agent 0:
```
# These are equivalent:
0: f(X):XXXX f(sssr)
f(X):XXXX f(sssr)
```
### 6.3 Parallel Timeline
- Each agent executes independently
- Timestep `t`: each agent executes `cmd[agent][t]`
- If sequence exhausted: no-op
- Timeline length: `max(len(cmd[agent]))`
---
## 7. Error Codes
| E001 | Undefined function | 0-arg call to undefined function |
| E002 | Undefined function | N-arg call to undefined function |
| E003 | Argument count mismatch | `f(X,Y)` called with wrong arity |
| E004 | MAX_STEP exceeded | Step count > MAX_STEP |
| E005 | MAX_DEPTH exceeded | Recursion depth > MAX_DEPTH |
| E007 | Numeric out of range | Value outside -255..255 |
| E008 | Type error | CmdSeq/Int mismatch at call site |
| E009 | Invalid directive | Unknown directive name/value |
| E010 | Type conflict | Parameter used as both CmdSeq and Int |
**Note**: E006 (MAX_MEMORY exceeded) is reserved for future use.
---
## 8. Compatibility Notes
### 8.1 HOJ Compatibility
| Commands `s`, `r`, `l` | ✓ | ✓ |
| Function definitions | ✓ | ✓ |
| 0-arg functions | ✓ | ✓ |
| Numeric recursion | ✓ | ✓ |
| ≤0 termination | ✓ | ✓ |
| Complex num_expr | ✓ | ✓ |
| Fixed parameter types | ✓ | ✓ |
### 8.2 H2 Extensions
| Multi-agent | `0:`, `1:`, etc. |
| Directives | `MAX_STEP`, `MAX_DEPTH`, `ON_LIMIT` |
| Whitespace | Spaces allowed between tokens |
| Comments | `#` and `//` |
| Mixed structure | Definitions and expressions can be mixed |
### 8.3 Breaking Changes from v0.4.0
1. **AgentId requires immediate `:`**: `0 :` is no longer AgentId
2. **No `a()` special case**: Empty args always requires 0-param function
3. **E010 added**: Type conflict error for definition-time checking
4. **MAX_MEMORY removed**: Reserved for future implementation
---
## 9. Conformance Tests
See `tests/spec_conformance.rs` and `tests/hoj_conformance.rs`.
---
## 10. References
- [Herbert Online Judge](http://herbert.tealang.info/)
- [HOJ GitHub Repository](https://github.com/quolc/hoj)
---
## Appendix A: Migration from v0.4.0
### Breaking Changes
1. **AgentId lexing**: `0 :` is now `NUMBER SPACE COLON`, not `AGENT_ID`
2. **Arity check strict**: `f()` on function with params is E003 (no special case)
3. **Type conflict error**: E010 added for definition-time type conflicts
4. **MAX_MEMORY removed**: Directive parsing will reject it
### AST Changes
```rust
// v0.5.0 FuncDef with type information
struct FuncDef {
name: char,
params: Vec<char>,
param_types: HashMap<char, ParamType>, // NEW
body: Expr,
span: Span,
}
// v0.5.0 unified Definition (no Macro variant)
enum Definition {
Function(FuncDef),
}
// Expr::Ident removed, use FuncCallArgs with empty args
enum Expr {
Primitive(Primitive, Span),
Param(char, Span),
FuncCallArgs { name: char, args: Vec<Arg>, span: Span },
Sequence(Vec<Expr>),
}
```
### Error Handling
```rust
// New error
ExpandError::type_conflict(param: char, span: Span) -> Self {
// E010: Parameter 'X' used as both CmdSeq and Int
}
```
---
## Appendix B: Byte Count Specification (HOJ Golf Scoring)
### B.1 Overview
HOJ (Herbert Online Judge) uses a byte count metric for code golf scoring.
This specification defines how H2 Language counts bytes for HOJ compatibility.
### B.2 Counting Rules
| Lowercase letters (a-z) | 1 byte each | `s` → 1, `abc` → 3 |
| Uppercase letters (A-Z) | 1 byte each | `X` → 1, `XY` → 2 |
| Numeric literals | 1 byte per literal | `12` → 1, `3` → 1 |
| Colon `:` | 0 bytes | Not counted |
| Parentheses `( )` | 0 bytes | Not counted |
| Comma `,` | 0 bytes | Not counted |
| Plus/Minus `+ -` | 0 bytes | Not counted |
| Whitespace (space, tab) | 0 bytes | Not counted |
| Newline (`\n`, `\r\n`) | 0 bytes | Not counted |
| Comments (`#`, `//`) | 0 bytes | Entire comment ignored |
| Directives | 0 bytes | `MAX_STEP=100` not counted |
| Agent ID prefix | 0 bytes | `0:` not counted |
### B.3 Examples
```
Input: a:sa a
Count: a + s + a + a = 4 bytes
^ ^ ^ ^
def cmd def call
Input: f(X):sa(X-1) f(10)
Count: f + X + s + a + X + 1 + f + 10 = 8 bytes
^ ^ ^ ^ ^ ^ ^ ^^
def prm cmd call prm num call num
Input: 0: x:ss xrx
Count: x + s + s + x + r + x = 6 bytes
(agent id "0:" is not counted)
Input: f(X,Y):XYf(X-1,Y) f(3,sr)
Count: f + X + Y + X + Y + f + X + 1 + Y + f + 3 + s + r = 13 bytes
```
### B.4 Formal Definition
```
byte_count(source) = sum of:
- count(letter) for each letter in [a-zA-Z]
- count(number) for each numeric literal (contiguous digits = 1)
Excluded:
- All punctuation: : ( ) , + - = _
- All whitespace: space, tab, newline
- All comments: from # or // to end of line
- Directive lines: lines starting with directive names
- Agent ID prefix: NUMBER followed immediately by :
```
### B.5 API
#### Native Rust
```rust
use h2lang::count_bytes;
let bytes = count_bytes("a:sa a");
assert_eq!(bytes, 4);
```
#### WebAssembly (JavaScript)
```javascript
import { count_bytes } from 'h2lang';
const bytes = count_bytes("f(X):sa(X-1) f(10)");
console.log(bytes); // 8
```
### B.6 Implementation Notes
1. **Lexer-based counting**: Use the lexer to tokenize, then count relevant tokens.
2. **Number handling**: Each `NUMBER` token counts as 1, regardless of digit count.
3. **Comment stripping**: Comments are removed during lexing, not counted.
4. **Directive handling**: Directive lines are parsed separately and excluded.
5. **Agent ID**: The `AGENT_ID` token (number + colon) is not counted.