# Nightjar Language — Supplement
Comprehensive reference for contributors, maintainers, and AI coding agents.
This document is a superset of [README.md](README.md): it reproduces the formal
grammar, defines every operator's exact semantics (including edge cases),
describes the crate's internal architecture, catalogs every error code, and
provides step-by-step recipes for extending the language.
A new user of the library does **not** need to read this document to use
Nightjar. A contributor designing a new operator, a new quantifier, a new data
type, or a new parser subsystem **does**.
---
## Table of contents
1. [Design philosophy](#design-philosophy)
2. [Formal language specification](#formal-language-specification)
3. [Operator semantics](#operator-semantics)
4. [Symbol table and flattening](#symbol-table-and-flattening)
5. [Execution pipeline](#execution-pipeline)
6. [Public API reference](#public-api-reference)
7. [Error codes — full reference](#error-codes--full-reference)
8. [Architecture and module layout](#architecture-and-module-layout)
9. [Extending the language](#extending-the-language)
10. [Testing strategy](#testing-strategy)
11. [Design decisions and rationale](#design-decisions-and-rationale)
12. [Known limitations and deferred features](#known-limitations-and-deferred-features)
13. [License](#license)
---
## Design philosophy
Nightjar is built around five non-negotiable principles. Every design decision
below follows from one of them; when a change proposes to break one, treat it
as a substantial change that requires explicit justification.
1. **Correctness first.** Every syntactically valid expression reduces to a
well-defined result in `{ True, False, Error }`. There is no "undefined",
no "maybe", no silent fallback to a default value. Errors are first-class,
carry spans and codes, and are distinguishable from a well-formed `False`.
2. **Minimal surface.** The language is intentionally tiny. It has no
variables, no lambdas, no user-defined functions, no I/O, no modules,
no side effects, no state. Complexity belongs in the host application,
not in the DSL.
3. **Formal foundations.** Every operator has a precise mathematical
definition. Quantifiers mean what they mean in first-order logic;
verifiers are binary relations; connectives are the Boolean algebra.
The authoritative EBNF lives in-tree ([src/language/grammar.rs](src/language/grammar.rs))
and this document reproduces it verbatim.
4. **Strictness.** No implicit type coercion, with one carefully bounded
exception: `Int` promotes to `Float` when the other operand of an
arithmetic function or comparison verifier is a `Float`. Every other
cross-type combination is a `TypeError`. Missing symbols are
`SymbolNotFound`; they do not default to `Null`.
5. **Safety.** Nightjar is always embedded in someone else's application.
It must not crash the host on adversarial input. The parser enforces a
configurable depth limit; integer arithmetic is checked; list unrolling
is O(N) and the host is responsible for bounding the input.
Unicode is not a separate principle — it is a direct consequence of
correctness. Identifiers, string content, and map keys all go through
`char::is_alphanumeric`, which recognises Unicode letter and number
categories (L* and N*), so `.營收`, `.données.résultat`, and
`"紅嘴黑鵯"` are all first-class.
---
## Formal language specification
### EBNF grammar
This block is reproduced verbatim from
[src/language/grammar.rs:23-110](src/language/grammar.rs#L23-L110), which is the
authoritative copy. When editing the grammar, update both places in the same
commit; see the EBNF drift check in [Extending the language](#extending-the-language).
```ebnf
(* A program is a single expression that must reduce to Boolean. *)
program = bool_expr ;
bool_expr = bool_literal
| verifier_expr
| connective_expr
| not_expr
| unary_check_expr
| quantifier_expr ;
verifier_expr = "(" verifier_op value_expr value_expr ")" ;
verifier_op = "EQ" | "NE" | "LT" | "LE" | "GT" | "GE" ;
connective_expr = "(" connective_op bool_expr bool_expr ")" ;
connective_op = "AND" | "OR" ;
not_expr = "(" "NOT" bool_expr ")" ;
unary_check_expr = "(" "NonEmpty" value_expr ")" ;
(* Quantifiers: assert a predicate over a List-typed Entity. *)
quantifier_expr = "(" quantifier_op predicate value_expr ")" ;
quantifier_op = "ForAll" | "Exists" ;
(* Predicates are only legal inside quantifiers. *)
(* Partial vs. full predicate is disambiguated by operand count *)
(* at parse time, not by a syntactic marker: *)
(* (VerifierOp x) — partial verifier (1 operand) *)
(* (VerifierOp x y) — full bool_expr (2 operands) *)
(* NonEmpty (bare) — unary check *)
(* any other bool_expr — full bool_expr *)
(* The body of a full predicate may use the "@" element-rooted *)
(* symbol form to refer to the current iteration element. *)
predicate = partial_verifier | "NonEmpty" | bool_expr ;
partial_verifier = "(" verifier_op value_expr ")" ;
(* Value expressions produce an Entity. *)
value_expr = literal
| symbol
| func_expr ;
(* Arity is enforced at parse time from FuncOp::expected_arity(): *)
(* 1-ary: Neg, Abs, Length, Upper, Lower, *)
(* Head, Tail, Count, GetKeys, GetValues *)
(* 2-ary: Add, Sub, Mul, Div, Mod, Concat, Get *)
(* 3-ary: Substring *)
func_expr = "(" func_op value_expr { value_expr } ")" ;
| "Mod" | "Neg" | "Abs" ;
string_op = "Concat" | "Length" | "Substring"
| "Upper" | "Lower" ;
collection_op = "Head" | "Tail" | "Get" | "Count"
| "GetKeys" | "GetValues" ;
(* Terminals. *)
literal = int_literal | float_literal
| string_literal | bool_literal | null_literal ;
(* A leading "-" is part of the numeric literal *)
int_literal = [ "-" ] digit { digit } ;
float_literal = [ "-" ] digit { digit } "." digit { digit } ;
string_literal = '"' { any_char } '"' ;
bool_literal = "True" | "False" ;
null_literal = "Null" ;
(* Symbols have two namespaces: *)
(* "." — root-rooted (resolved against the whole input). *)
(* "@" — element-rooted (resolved against the current iteration *)
(* element of the nearest enclosing ForAll/Exists). *)
(* Bare "." is the whole input; bare "@" is the current element. *)
(* "@" is only legal inside a quantifier predicate; the parser *)
(* rejects it elsewhere with a ParseError. *)
(* Segment characters are Unicode-aware: *)
(* char::is_alphanumeric() covers Unicode categories L* and N*, *)
(* so keys like ".營收" and ".données.résultat" are valid. *)
segment = ident_start { ident_char } ;
ident_start = unicode_letter | "_" ;
ident_char = unicode_letter | unicode_digit | "_" ;
```
### Lexical rules
- **Whitespace-insensitive.** Spaces, tabs, and newlines only separate tokens.
`(GT 1 2)`, `(GT 1 2)`, and `( GT\n 1\n 2\n)` are identical.
- **No comments.** Nightjar expressions have no comment syntax. Comments and
documentation live in the host program, not in the rule text.
- **Numeric literals.** A `-` immediately followed by a digit is part of the
number: `-5` and `-3.14` are single tokens. `- 5` (with a space) is a
`ParseError` because `-` is not a standalone token.
- **String literals.** Double-quoted. There are no escape sequences defined;
every character between the opening and closing quotes is literal, including
any whitespace and any Unicode scalar. An unterminated string (missing
closing quote before EOF) is a `ParseError`.
- **Keywords vs identifiers.** Operator names (`EQ`, `Add`, `ForAll`, …),
boolean literals (`True`, `False`), and `Null` are keywords. They are case-
sensitive — `true` is not a literal, `add` is not an operator. Keywords
cannot be used as symbol segments because they do not start with `.` or `@`.
- **Symbol segments.** A segment starts with `ident_start` (Unicode letter or
`_`) and continues with `ident_char` (Unicode letter or digit or `_`). This
means `.1x` is a `ParseError` but `._1` is a valid list-index segment.
Segments are joined by literal `.` characters.
- **Symbol sigils.** `.` starts a root-rooted symbol; `@` starts an element-
rooted symbol. Bare `.` and bare `@` (with no segments) refer to the whole
payload and the whole current element respectively.
### Data types
The seven runtime types are defined by the `Entity` enum in
[src/context/entity.rs:53-61](src/context/entity.rs#L53-L61):
| `Entity::Int` | `i64` | `42`, `-7` | never empty |
| `Entity::Float` | `f64` | `3.14`, `-0.5` | never empty |
| `Entity::String` | `String` | `"hello"`, `"營收"` | empty iff `""` |
| `Entity::Bool` | `bool` | `True`, `False` | never empty |
| `Entity::List` | `Vec<Entity>` | (from host data) | empty iff `[]` |
| `Entity::Map` | `HashMap<String, Entity>` | (from host data) | empty iff `{}` |
| `Entity::Null` | (unit) | `Null` | always empty |
`Entity::type_tag()` projects to a `TypeTag` enum, used throughout the
runtime for type checks and error messages.
### Type coercion
The only implicit coercion in the language is **Int → Float auto-promotion**,
and it applies in exactly two places:
- **Arithmetic functions** `Add`, `Sub`, `Mul`, `Div`, `Mod`: if either
operand is `Float`, the other (if `Int`) is promoted, and the result is
`Float`. Both `Int` → `Int` arithmetic. Both `Float` → `Float` arithmetic.
- **Comparison verifiers** `EQ`, `NE`, `LT`, `LE`, `GT`, `GE`: when one side
is `Int` and the other is `Float`, the `Int` is promoted before comparison.
Every other type mismatch is a `TypeError` (E002). `(Add 1 "abc")`,
`(GT "a" 1)`, `(Concat 1 2)`, `(Head 42)` are all errors.
`Null` is never silently converted. A `Null` operand to an arithmetic op is
a `TypeError`, a `Null` operand to `NonEmpty` is always `False`, and
`SymbolNotFound` is the rule for missing keys (not `Null`).
---
## Operator semantics
Every operator below is listed with arity, input types, output type, and
every edge case worth documenting. Operators are grouped by family, matching
the AST enums in [src/language/grammar.rs](src/language/grammar.rs).
### Verifiers — `EQ NE LT LE GT GE`
Binary, two value expressions → `Bool`. Implemented in
[src/context/verifier.rs](src/context/verifier.rs).
- **Equality (`EQ`, `NE`) on `Float`** uses **epsilon-based comparison**:
`EQ(a, b) ⇔ |a − b| < ε`, where `ε = ExecOptions::float_epsilon`
(default `1e-10`). This is what makes `(EQ (Add 0.1 0.2) 0.3)` evaluate to
`True`, despite IEEE 754 representation error. `NE` is the negation.
- **Ordering verifiers (`LT`, `LE`, `GT`, `GE`) on `Float`** use standard
IEEE 754 comparison (`partial_cmp`). Epsilon does not apply.
- **NaN** — any comparison involving NaN (EQ, NE, LT, LE, GT, GE) returns
`false`. This matches Rust's `partial_cmp` semantics and IEEE 754.
Specifically, `(EQ NaN NaN)` is `False` (because
`|NaN − NaN|` is `NaN`, not `< ε`).
- **Int ↔ Float promotion** applies for mixed-type compares.
- **String equality** is exact byte equality (which is also Unicode scalar
equality for canonicalised UTF-8 strings).
- **Bool equality** is the obvious thing.
- **Cross-type comparisons** (e.g. `(GT "a" 1)`, `(EQ .list .int)`) are a
`TypeError`.
- **Null equality** — `(EQ Null Null)` is `True`. `(EQ Null anything_else)`
is a `TypeError`: we deliberately do not let `Null` silently equal scalars.
### Unary check — `NonEmpty`
Unary, one value → `Bool`. Returns the result of `Entity::is_non_empty()`
([src/context/entity.rs:81-89](src/context/entity.rs#L81-L89)):
| `Int`, `Float`, `Bool` (any value) | `True` |
| `String ""` | `False` |
| `String "anything else"` | `True` |
| `List []` | `False` |
| `List [ … ]` | `True` |
| `Map {}` | `False` |
| `Map { … }` | `True` |
| `Null` | `False` |
### Connectives — `AND OR NOT`
`AND` and `OR` are binary boolean-in, boolean-out; `NOT` is unary.
Implemented in [src/context/connective.rs](src/context/connective.rs).
- **No short-circuit evaluation.** Both operands of `AND`/`OR` are always
evaluated. If one branch produces an `Error`, the error surfaces immediately
(regardless of whether the other branch would have decided the result).
This keeps error behaviour deterministic — every error in a rule is
surfaced, never masked by a short-circuit.
- **Adding short-circuit later is compatible** with the API shape, but would
change the observable error behaviour. If it is ever added, it must be
opt-in (e.g. via `ExecOptions`) so existing rules keep their diagnostic
behaviour.
### Quantifiers — `ForAll Exists`
`(QuantifierOp predicate operand)`. Implemented in
[src/context/quantifier.rs](src/context/quantifier.rs).
- **Predicate forms.** Three shapes are accepted, disambiguated at parse time
by operand count:
- `NonEmpty` (bare) — unary check, applied to the element.
- `(VerifierOp x)` — *partial verifier*: the bound value `x` is the second
operand of the verifier; the element fills the first. So
`(ForAll (GT 0) xs)` means "∀e ∈ xs. e > 0".
- Any other `bool_expr` — *full predicate*: re-evaluated once per element
with the element bound as `@` in scope. The body can use `@`, `@.field`,
`@._i`, etc.
- **Operand must be a `List`.** Passing a `Map` is a `TypeError` — quantifiers
iterate over ordered sequences. For Maps, convert explicitly with
`GetKeys` or `GetValues`: `(ForAll (GT 0) (GetValues .m))`.
- **Scalar fallback.** If the operand is a scalar (`Int`, `Float`, `String`,
`Bool`, `Null`), the quantifier reduces to a single predicate application
on that scalar. So `(ForAll (GT 0) 5)` is `True`, `(Exists (EQ 2) 10)` is
`False`. This is intentional and documented — it lets callers treat "one
value" and "many values" uniformly.
- **Empty list.** `(ForAll p [])` is `True` (vacuously true);
`(Exists p [])` is `False` (no witness exists).
- **Nested quantifiers.** `@` always refers to the **innermost** enclosing
element. Outer elements are accessible only through root-rooted paths
(e.g. `.outer.inner.field`). This is lexical, innermost-wins scoping.
- **`@` outside a quantifier predicate** is a `ScopeError` (E010), caught by
`validate_scope` during post-parse static analysis (see
[Execution pipeline](#execution-pipeline)).
- **Evaluation strategy.** Partial verifiers and `NonEmpty` use
`apply_quantifier`, which resolves the bound operand once and applies the
predicate per element. Full predicates use `apply_quantifier_full`, which
takes a closure that invokes `eval_bool` per element with the element
bound in the `scope` parameter. Full predicates therefore re-evaluate
their body N times for N elements.
### Arithmetic — `Add Sub Mul Div Mod Neg Abs`
Implemented in [src/context/function.rs](src/context/function.rs).
- **Input types.** `Int` or `Float`. Anything else is `TypeError`.
- **Int + Int → Int** using `checked_add`, `checked_sub`, `checked_mul`,
`checked_div`, `checked_rem`, `checked_neg`. Overflow → `IntegerOverflow`
(E009). In particular `Abs(i64::MIN)` and `Neg(i64::MIN)` are overflow.
- **Mixed Int/Float** → Int is promoted, result is `Float`.
- **Float + Float → Float** using native IEEE operations. No overflow error;
inputs that would overflow return `inf`/`-inf`, and NaN arithmetic
propagates in the usual IEEE way.
- **Integer division truncates.** `(Div 7 2)` is `Int(3)`. For real division,
promote explicitly: `(Div 7 2.0)` is `Float(3.5)`.
- **Division/modulo by zero** — both `Int 0` and `Float 0.0` divisors produce
`DivisionByZero` (E006). Nightjar does not produce `inf` or NaN from
`1.0 / 0.0`; we raise an error for consistency with integer semantics.
- **`Mod` works on floats.** `(Mod 3.5 1.5)` is `Float(0.5)` via Rust's `%`.
- **`Neg`, `Abs`** are unary; every other arithmetic op is binary.
### String — `Concat Length Substring Upper Lower`
- **`Concat`** (2-ary, `String × String → String`).
- **`Length`** (1-ary, `String → Int`). **Counts Unicode scalar values**, not
bytes. `(Length "abc")` is `3`; `(Length "營收")` is `2`. This is what
`Substring` indexes into — the two are consistent.
- **`Substring`** (3-ary, `String × Int × Int → String`). `(Substring s start
len)` returns `len` characters starting at character index `start` (0-based,
char-indexed). Going off the end of the string is an error; see
[src/context/function.rs](src/context/function.rs) for the exact bounds.
- **`Upper`, `Lower`** (1-ary) — Unicode-aware case folding via Rust's
`to_uppercase`/`to_lowercase`. Characters without a case variant pass
through unchanged.
- Any non-String argument is a `TypeError`.
### Collection — `Head Tail Get Count GetKeys GetValues`
- **`Head`** (1-ary) — first element of a list. Empty list → `IndexError`
(E008). Non-list input → `TypeError`.
- **`Tail`** (1-ary) — list of all but the first element. Empty list →
`IndexError`. Non-list input → `TypeError`.
- **`Get`** (2-ary) — polymorphic index:
- `(Get list Int)` returns the element at that 0-based index. Out of range
→ `IndexError`. Negative indices are not supported.
- `(Get map String)` returns the value at that key. Missing key →
`SymbolNotFound` with a message scoped to `Get`.
- Any other combination is a `TypeError`.
- **`Count`** (1-ary) — length of a `List` or size of a `Map`. Non-container
input is a `TypeError`.
- **`GetKeys`** (1-ary) — `Map → List<String>`, sorted by key for
determinism. Non-map input is a `TypeError`.
- **`GetValues`** (1-ary) — `Map → List<Entity>`, values sorted by key
(same ordering as `GetKeys`). Non-map input is a `TypeError`.
---
## Symbol table and flattening
Root-rooted (`.`) symbols are resolved against a **flattened symbol table**
built once per evaluation. The construction is in
[src/symbol_table.rs](src/symbol_table.rs).
### Flattening rules
Starting from the root `Entity`, every nested path is registered with its
fully qualified dotted key:
- The root itself is registered under `"."`.
- Each `Map` child is registered under `{parent}.{key}`.
- Each `List` element is registered under `{parent}._{i}` with `i` the
**0-based** index.
- Recursion continues into nested maps and lists.
- Scalars and `Null` are registered at their current prefix; they are not
descended into.
### Worked example
```json
{
"ids": [10, 20, 30],
"meta": {"name": "x"}
}
```
Flattens to (all entries live in the same `HashMap<String, Entity>`):
| `.` | the whole root `Map` |
| `.ids` | `List [10, 20, 30]` |
| `.ids._0` | `Int 10` |
| `.ids._1` | `Int 20` |
| `.ids._2` | `Int 30` |
| `.meta` | `Map { name: "x" }` |
| `.meta.name` | `String "x"` |
Nested containers chain naturally: `{m: [[1,2],[3,4]]}` produces `.m._0._0 =
1`, `.m._1._1 = 4`, etc.
### Resolution
- **Root-rooted (`.path`).** `HashMap::get` — O(1) amortised. Missing path
→ `SymbolNotFound`.
- **Element-rooted (`@path`).** Resolved by `resolve_in_entity` in
[src/symbol_table.rs](src/symbol_table.rs): walks the `path` directly
against the current element `Entity`. No flattening involved — cost is
O(path length), and there's no extra allocation of a per-element table.
`_N` segments are still list-index segments with the same 0-based convention.
### Invariants to preserve
Anything that touches the symbol table must preserve these invariants, or
quantifiers and lookups will silently disagree:
1. The flattening convention (`.` for maps, `._N` for lists, 0-based) must
match `resolve_in_entity`'s walking convention.
2. Intermediate containers must be registered (not only leaves), so
`(NonEmpty .data)` works on the container as a whole.
3. `HashMap` is allowed to iterate in arbitrary order internally, but any
operator that exposes ordering to the user (today: `GetKeys`, `GetValues`)
must sort.
---
## Execution pipeline
Nightjar is strictly two-phase. The entry points in
[src/executor.rs](src/executor.rs) drive both phases, but they are cleanly
separable — `parse` / `parse_with_config` give you Phase 1 alone.
```
source string ──► tokens ──► AST (Spanned<…>) ──► ExecResult
│ │ │
│ │ └── Phase 2: symbol table + scope
│ └── Phase 1b: parser + validate_scope
└── Phase 1a: tokenizer
```
### Phase 1a — Tokenizer
Located in [src/language/parser.rs](src/language/parser.rs). Walks the source
with `char_indices` so all byte offsets land on character boundaries
(UTF-8-safe). Produces `Spanned<Token>` values. Highlights:
- **Negative literals.** `-5` and `-3.14` are single tokens when the `-` is
immediately followed by a digit. `- 5` (with a space between) is a
`ParseError` because `-` is not a standalone token.
- **Strings.** No escape sequences. An unterminated string literal
(`"abc` with EOF before the closing quote) is a `ParseError` with a span
pointing at the opening quote.
- **Keywords.** Case-sensitive. The tokenizer has an explicit keyword table
for operator names and reserved literals.
- **Symbols.** `.` and `@` sigils with dot-separated segments. Segment
characters are validated against `char::is_alphanumeric` (Unicode L* and
N* categories) plus `_`.
### Phase 1b — Parser
Recursive-descent over the token stream. Key properties:
- Per-operator arity is enforced at parse time using `FuncOp::expected_arity`
([src/language/grammar.rs](src/language/grammar.rs)), so `(Add 1)` and
`(Substring "a" 0)` are caught before any evaluation.
- Depth tracking uses `ParserConfig::max_depth` (default 256). Exceeding it
produces `RecursionError` (E007). The default is tunable via
`ExecOptions::max_depth` → `ParserConfig::max_depth`.
- Every AST node is wrapped in `Spanned<T>` carrying the span of the
originating tokens, so runtime errors can point back into the source
string.
### Phase 1c — Scope validator
`validate_scope` ([src/language/parser.rs](src/language/parser.rs)) is a
post-parse AST walk that tracks an integer *predicate depth* counter.
- Entering the predicate position of a `Quantifier` increments the counter.
- Leaving it decrements.
- The quantifier's *operand* position stays at the current depth.
- Encountering an `@` symbol with counter `== 0` raises `ScopeError` (E010).
This catches `(EQ @.a 1)` at the top level, or `(AND (ForAll … .xs) (EQ @.a 1))`
where the second `@` is outside any predicate.
### Phase 2 — Executor
[src/executor.rs](src/executor.rs) drives evaluation through two mutually
recursive functions:
- `eval_bool(expr, symbols, opts, scope)` — evaluates a `SpannedBoolExpr` to
`Result<bool, NightjarLanguageError>`. Dispatches on the `BoolExpr` variant.
- `eval_value(expr, symbols, opts, scope)` — evaluates a `SpannedValueExpr` to
`Result<Entity, …>`. Dispatches on `ValueExpr`.
The `scope` parameter is `Option<&Entity>` — the current iteration element
bound inside a quantifier predicate, or `None` at the top level. Element-
rooted (`@`) symbol resolution reads from `scope`; a `None` `scope` combined
with an `@` symbol is a defensive `ScopeError` (in practice `validate_scope`
catches this first).
The quantifier arm branches on predicate kind:
- **Partial verifier / `NonEmpty`** → `resolve_predicate` pre-evaluates the
bound operand once, then calls
`quantifier::apply_quantifier(op, &EvalPredicate, &operand, epsilon, span)`.
- **Full predicate** → calls
`quantifier::apply_quantifier_full(op, &operand, span, closure)` where
`closure: &Entity → Result<bool, …>` invokes `eval_bool` with the element
bound in `scope`. Full predicates re-evaluate their body per element, which
is how `@` inside the body resolves.
Top-level evaluation always starts with `scope = None`.
---
## Public API reference
All of the following are re-exported from the crate root
([src/lib.rs](src/lib.rs)). Consumers should `use nightjar_lang::{…}`.
### Parser
```rust
pub fn parse(input: &str) -> Result<Program, NightjarLanguageError>;
pub fn parse_with_config(
input: &str,
config: &ParserConfig,
) -> Result<Program, NightjarLanguageError>;
pub struct ParserConfig {
pub max_depth: usize, // default 256
}
```
`parse` is a convenience wrapper around `parse_with_config` using the default
`ParserConfig`. Both return a `Program` whose top-level expression is a
`SpannedBoolExpr`.
### AST
```rust
pub struct Program { pub expr: SpannedBoolExpr; }
pub struct Spanned<T> { pub node: T, pub span: Span; }
pub type SpannedBoolExpr = Spanned<BoolExpr>;
pub type SpannedValueExpr = Spanned<ValueExpr>;
pub enum BoolExpr {
Literal(bool),
Verifier { op: VerifierOp, left: Box<SpannedValueExpr>,
right: Box<SpannedValueExpr> },
And(Box<SpannedBoolExpr>, Box<SpannedBoolExpr>),
Or (Box<SpannedBoolExpr>, Box<SpannedBoolExpr>),
Not(Box<SpannedBoolExpr>),
UnaryCheck { op: UnaryCheckOp, operand: Box<SpannedValueExpr> },
Quantifier { op: QuantifierOp,
predicate: Spanned<Predicate>,
operand: Box<SpannedValueExpr> },
}
pub enum ValueExpr {
Literal(Literal),
Symbol { root: SymbolRoot, path: String },
FuncCall { op: FuncOp, args: Vec<SpannedValueExpr> },
}
pub enum Predicate {
PartialVerifier { op: VerifierOp, bound: Box<SpannedValueExpr> },
UnaryCheck(UnaryCheckOp),
Full(Box<SpannedBoolExpr>),
}
pub enum Literal { Int(i64), Float(f64), String(String), Bool(bool), Null }
pub enum VerifierOp { EQ, NE, LT, LE, GT, GE }
pub enum UnaryCheckOp { NonEmpty }
pub enum QuantifierOp { ForAll, Exists }
pub enum FuncOp {
Add, Sub, Mul, Div, Mod, Neg, Abs,
Concat, Length, Substring, Upper, Lower,
Head, Tail, Get, Count, GetKeys, GetValues,
}
pub enum Keyword { /* unified keyword enum used by the tokenizer */ }
```
`Spanned<T>` exists so every AST node carries its source span for diagnostics;
future passes that want to annotate nodes should wrap in `Spanned` rather
than threading spans separately.
### Runtime
```rust
pub enum Entity {
Int(i64), Float(f64), String(String), Bool(bool),
List(Vec<Entity>), Map(std::collections::HashMap<String, Entity>), Null,
}
pub enum TypeTag { Int, Float, String, Bool, List, Map, Null }
impl Entity {
pub fn type_tag(&self) -> TypeTag;
pub fn is_non_empty(&self) -> bool;
}
// Always-on conversions:
impl From<i64> for Entity;
impl From<f64> for Entity;
impl From<bool> for Entity;
impl From<String> for Entity;
impl From<&str> for Entity;
// With the `json` feature:
#[cfg(feature = "json")]
impl From<serde_json::Value> for Entity;
```
```rust
pub struct SymbolTable { /* private */ }
impl SymbolTable {
pub fn from_entity(root: Entity) -> Self;
pub fn resolve(&self, symbol: &str, span: Span)
-> Result<Entity, NightjarLanguageError>;
pub fn resolve_root_path(&self, path: &str, span: Span)
-> Result<Entity, NightjarLanguageError>;
pub fn len(&self) -> usize;
pub fn is_empty(&self) -> bool;
pub fn contains(&self, symbol: &str) -> bool;
}
#[cfg(feature = "json")]
impl SymbolTable {
pub fn from_json(value: serde_json::Value) -> Self;
}
```
```rust
pub struct ExecOptions {
pub float_epsilon: f64, // default 1e-10
pub max_depth: usize, // default 256
}
impl Default for ExecOptions { /* the defaults above */ }
pub enum ExecResult { True, False, Error(NightjarLanguageError) }
impl ExecResult {
pub fn is_true(&self) -> bool;
pub fn is_false(&self) -> bool;
pub fn is_error(&self) -> bool;
}
impl From<Result<bool, NightjarLanguageError>> for ExecResult;
pub fn exec_entity(expression: &str, data: Entity, options: ExecOptions)
-> ExecResult;
#[cfg(feature = "json")]
pub fn exec(expression: &str, data: serde_json::Value, options: ExecOptions)
-> ExecResult;
```
### Errors
```rust
pub struct Span { pub start: usize, pub end: usize }
impl Span {
pub const fn new(start: usize, end: usize) -> Self;
pub const fn point(at: usize) -> Self;
}
pub enum ErrorCode { E001, E002, E003, E004, E005, E006, E007, E008, E009, E010 }
pub enum NightjarLanguageError {
ParseError { span: Span, code: ErrorCode, message: String },
TypeError { span: Span, code: ErrorCode, message: String },
ArgumentError { span: Span, code: ErrorCode, message: String },
SymbolNotFound { span: Span, code: ErrorCode, message: String },
AmbiguousSymbol { span: Span, code: ErrorCode, message: String },
DivisionByZero { span: Span, code: ErrorCode, message: String },
RecursionError { span: Span, code: ErrorCode, message: String },
IndexError { span: Span, code: ErrorCode, message: String },
IntegerOverflow { span: Span, code: ErrorCode, message: String },
ScopeError { span: Span, code: ErrorCode, message: String },
}
impl NightjarLanguageError {
pub fn span(&self) -> Span;
pub fn code(&self) -> ErrorCode;
pub fn message(&self) -> &str;
}
```
Error construction helpers (`parse_error`, `type_error`, …) live in
[src/error.rs](src/error.rs) and are `pub(crate)` — they are internal
conveniences, not part of the public API. Downstream code inspects errors
through `.code()`, `.span()`, `.message()`.
---
## Error codes — full reference
Every variant of `ErrorCode` that the implementation can actually raise,
with minimal reproducing expressions or conditions.
| E001 | `ParseError` | Tokenizer, parser | `GT 1 2` (no parens); `(GT 1 2` (unclosed); `"abc` (unterminated). |
| E002 | `TypeError` | Verifier, functions, quantifier | `(GT "a" 1)`; `(Head 42)`; `(ForAll (GT 0) .map)`. |
| E003 | `ArgumentError` | Parser (arity check) | `(GT 1 2 3)`; `(Add 1)`; `(Substring "a" 0)`. |
| E004 | `SymbolNotFound` | Symbol resolver, `Get` on Map | `(GT .absent 0)` against `{}`; `(Get .m "missing")`. |
| E005 | `AmbiguousSymbol` | Reserved — not raised today | *(no reproducer; placeholder for future shorthand lookup)* |
| E006 | `DivisionByZero` | `Div`, `Mod` | `(Div 1 0)`; `(Mod 1 0.0)`. |
| E007 | `RecursionError` | Parser (depth guard) | `(NOT (NOT (NOT …)))` deeper than `max_depth` (default 256). |
| E008 | `IndexError` | `Head`, `Tail`, `Get` on List | `(Head [])`; `(Tail [])`; `(Get [1,2] 5)`. |
| E009 | `IntegerOverflow` | Checked arithmetic | `(EQ (Add 9223372036854775807 1) 0)`. |
| E010 | `ScopeError` | `validate_scope` (and defensive runtime check) | `(EQ @.a 1)` at top level. |
E005 is reserved for a future shorthand-lookup mode (leaf-name resolution
with ambiguity detection). Tools should accept it as a valid code but should
not expect to see it from the current executor.
---
## Architecture and module layout
Everything lives under `src/`.
| [src/lib.rs](src/lib.rs) | Crate root and public re-exports. The authoritative list of what is `pub`. |
| [src/error.rs](src/error.rs) | `NightjarLanguageError`, `ErrorCode`, `Span`, internal `pub(crate)` helper constructors. |
| [src/language/grammar.rs](src/language/grammar.rs) | AST types, operator enums (`VerifierOp`, `FuncOp`, `QuantifierOp`, `UnaryCheckOp`, `Keyword`), `Predicate`, `Literal`, `SymbolRoot`, `Spanned`, `FuncOp::expected_arity`, authoritative EBNF in the module doc-comment. |
| [src/language/parser.rs](src/language/parser.rs) | Tokenizer, recursive-descent parser, `ParserConfig`, `parse`, `parse_with_config`, post-parse `validate_scope`. |
| [src/symbol_table.rs](src/symbol_table.rs) | `SymbolTable`, flattening algorithm, `resolve_in_entity` (element-rooted walker). |
| [src/executor.rs](src/executor.rs) | `ExecOptions`, `ExecResult`, `exec`, `exec_entity`, private `eval_bool` / `eval_value` / `resolve_predicate`. |
| [src/context/mod.rs](src/context/mod.rs) | Module grouping. |
| [src/context/entity.rs](src/context/entity.rs) | `Entity`, `TypeTag`, `is_non_empty`, `From` impls (including `serde_json::Value` under the `json` feature). |
| [src/context/verifier.rs](src/context/verifier.rs) | `apply_verifier` — EQ/NE/LT/LE/GT/GE dispatch, epsilon equality, NaN handling. |
| [src/context/function.rs](src/context/function.rs) | `apply_function` — arithmetic, string, collection functions. |
| [src/context/quantifier.rs](src/context/quantifier.rs) | `EvalPredicate`, `apply_predicate`, `apply_quantifier`, `apply_quantifier_full`. |
| [src/context/connective.rs](src/context/connective.rs) | `apply_and`, `apply_or`, `apply_not`. |
| [tests/test_parser.rs](tests/test_parser.rs) | Phase-1 integration tests. |
| [tests/test_executor.rs](tests/test_executor.rs) | Phase-2 integration tests. |
The directory structure mirrors the two-phase pipeline: `language/*` is
everything the parser needs, `context/*` is everything the runtime needs,
and `executor.rs` + `symbol_table.rs` glue them together.
---
## Extending the language
All recipes below assume you are editing the crate in-place. Every extension
should ship with tests — see [Testing strategy](#testing-strategy).
### Recipe A — Add a new built-in function
Suppose you are adding a `Reverse` function that takes a `String` or a `List`
and returns the reversed value.
1. **Grammar layer** — [src/language/grammar.rs](src/language/grammar.rs):
- Add `Reverse` to `FuncOp`.
- Add an entry in `FuncOp::expected_arity` returning `1`.
- Add a keyword constant for `"Reverse"` to the `Keyword` enum (and any
operator-name → `Keyword` mapping used by the tokenizer).
- Update the EBNF comment to list `Reverse` under `arith_op` /
`string_op` / `collection_op` as appropriate. Keep this block in sync
with this document's `## Formal language specification` section.
2. **Tokenizer** — [src/language/parser.rs](src/language/parser.rs):
- Register the keyword string so the tokenizer emits the new `Keyword`
variant.
3. **Parser** — [src/language/parser.rs](src/language/parser.rs):
- `func_expr` parsing is driven by `FuncOp::expected_arity`, so usually
nothing new is needed. Verify by adding a parse test.
4. **Runtime** — [src/context/function.rs](src/context/function.rs):
- Extend the match arms in `apply_function` to handle `FuncOp::Reverse`.
- Return the right `TypeTag`-tagged result; use `type_error` for bad
input types; reuse the existing error helpers.
5. **Public re-exports** — [src/lib.rs](src/lib.rs):
- No change is needed if `FuncOp` is already re-exported (it is).
6. **Tests**:
- Add unit tests in `#[cfg(test)] mod tests` inside
[src/context/function.rs](src/context/function.rs) for the happy path
and each error branch.
- Add at least one integration test in
[tests/test_parser.rs](tests/test_parser.rs) (parses) and
[tests/test_executor.rs](tests/test_executor.rs) (evaluates).
7. **Documentation**:
- Update the operator table in [README.md](README.md) under *Operator
cheat-sheet*.
- Update the relevant subsection under *Operator semantics* in this file.
### Recipe B — Add a new verifier
Adding, say, `Contains` (string contains substring):
1. Add `Contains` to `VerifierOp` (or, if it's genuinely a new family,
create a new enum alongside `VerifierOp`). If in doubt, prefer a new
family — verifiers are currently defined as total orders plus equality,
and `Contains` breaks that.
2. If it lands in `VerifierOp`: extend `apply_verifier` in
[src/context/verifier.rs](src/context/verifier.rs) with the new arm,
including type checks and `TypeError` for bad inputs.
3. Extend tokenizer, parser arity, and EBNF as in Recipe A.
4. Tests + docs as in Recipe A.
### Recipe C — Add a new quantifier
Example: `Count` (count elements satisfying a predicate) — note this would
return an `Int`, not a `Bool`, so it belongs in a new family (value-producing
quantifier), not in `QuantifierOp`.
1. Decide whether it is boolean-returning (goes alongside `ForAll`/`Exists`)
or value-returning (goes alongside `FuncOp`). Boolean quantifiers reuse
the `Quantifier` arm of `BoolExpr`; value-returning quantifiers need a
new AST variant — plan that change first.
2. For a boolean quantifier: add a variant to `QuantifierOp`; extend
`apply_quantifier` / `apply_quantifier_full` with the new reduction;
extend `eval_bool`'s quantifier arm if new predicate shapes are needed.
3. For a value-returning quantifier: add a new `ValueExpr` variant (e.g.
`ValueQuantifier { op, predicate, operand }`), extend the parser with a
new parse arm, add an executor arm in `eval_value`. Re-export the new
AST types from `lib.rs`.
4. Scope validator: entering the predicate position must still increment
`predicate_depth`, otherwise `@` will escape.
5. Tests + docs as in Recipe A.
### Recipe D — Add a new data type
Any change to `Entity` is load-bearing; every operator that inspects
`TypeTag` potentially needs updating.
1. Add the variant to `Entity` and `TypeTag` in
[src/context/entity.rs](src/context/entity.rs). Implement `type_tag()`
and `is_non_empty()` — both must remain total.
2. Provide `From` impls as appropriate for host integrations. If the `json`
feature has to represent the new type, update `From<serde_json::Value>`.
3. Update the flattener in [src/symbol_table.rs](src/symbol_table.rs) so
that the new type flattens correctly (either descend or not, but make
the choice explicitly).
4. Update `apply_verifier` in
[src/context/verifier.rs](src/context/verifier.rs) — decide equality
semantics for the new type, and whether ordering makes sense. Cross-type
comparisons must remain `TypeError`.
5. Update `apply_function` in
[src/context/function.rs](src/context/function.rs) — every existing op
must either accept or reject the new type explicitly (current match arms
must gain a `_ => TypeError` path if they don't already).
6. Update `apply_quantifier` scalar fallback path to decide whether the new
type supports iteration or scalar fallback.
7. Update `resolve_in_entity` in
[src/symbol_table.rs](src/symbol_table.rs): if the new type is
path-addressable (like Map/List) add a walker arm; otherwise let the
`_ => TypeError` branch catch it.
8. Tests + docs; update the type table in both [README.md](README.md) and
the *Data types* subsection here.
### Recipe E — Swap the Map backing or the `Clone` strategy
If you replace `HashMap<String, Entity>` with a different container, the
only externally-visible invariant that must survive is that `GetKeys` and
`GetValues` produce sorted output. If you replace `Entity: Clone` with
`Rc<Entity>`-sharing, every `From` impl, every `apply_*` signature, and
every executor arm that clones will need touching — plan the change as a
whole crate refactor, not an incremental one, and keep the public API
stable.
### EBNF drift check
The EBNF in this file must match the EBNF block in
[src/language/grammar.rs:23-110](src/language/grammar.rs#L23-L110) exactly.
When you add an operator, update both and diff them in your commit. If they
drift, the parser and the documentation disagree and the next contributor
will act on the wrong one.
---
## Testing strategy
Nightjar has three layers of tests.
1. **Module-local unit tests.** Every non-trivial module has
`#[cfg(test)] mod tests { … }` right at the bottom. These are the first
line of defence for new behaviour. Every helper function and every match
arm should have at least one happy-path test and one error-branch test
(where an error branch exists).
2. **Integration tests.** [tests/test_parser.rs](tests/test_parser.rs) and
[tests/test_executor.rs](tests/test_executor.rs) exercise the public API
end-to-end: `parse`, `exec`, `exec_entity`, `ExecResult`, error variants.
When you add an operator, add at least one parser test (it parses) and
one executor test (it evaluates correctly on real data).
3. **Property-based testing.** `proptest` is in `[dev-dependencies]`. For
operators with algebraic properties (associativity of `Concat`,
commutativity of `Add` on `Int`, idempotence of `Upper ∘ Upper`, …),
property tests are the appropriate form. Prefer them to hand-rolled
edge-case tables for anything fuzz-adjacent.
### Running tests
```sh
cargo test # default features (json on)
cargo test --no-default-features # core-only build (no serde_json)
cargo test --features yaml # yaml dep compiled in
```
CI should run all three to prevent feature-gated regressions.
---
## Design decisions and rationale
### Why prefix notation?
Prefix (S-expression-style) notation removes operator precedence and
associativity entirely. There is no "does `AND` bind tighter than `OR`?"
question because every expression is fully parenthesised. The parser is a
few hundred lines, the grammar is small enough to fit in this document,
and the AST shape is exactly the expression's surface shape. An infix
surface syntax could be added externally later as a layer that compiles
to this AST — the canonical form stays prefix.
### Why a three-valued `ExecResult`?
Formal verification loses its meaning if a missing key silently becomes
`Null` and the rule silently becomes `False`. The host cannot tell a
rule-was-false from rule-could-not-be-evaluated. By carving `Error` out
from the result type, Nightjar forces the host to decide how to handle
each case (log, fail-open, fail-closed, retry, …) rather than collapsing
them at the library boundary.
### Why epsilon equality on floats but IEEE ordering?
Equality is the comparison most sensitive to IEEE 754 representation
error: `0.1 + 0.2 != 0.3` is a foot-gun that Nightjar rules should not
step on. Ordering is much less sensitive to the same error (the relative
ordering of two floats is preserved even when their binary representations
drift a ulp), and the IEEE rules for ordering are already what users expect
from comparisons. Mixing the two would require users to reason about an
epsilon in contexts where it doesn't help them.
### Why 0-based list indexing via `_N`?
0-based aligns with Rust, JavaScript, Python, C, and nearly every modern
language; 1-based would surprise most implementers. The `_` prefix keeps
the index segment syntactically distinct from map keys (which start with a
letter or digit-less identifier), and the same convention is used both in
the flat symbol table and in `resolve_in_entity`.
### Why flatten into a HashMap?
Most Nightjar rules look up several fields of the same payload; a flat
table makes each lookup O(1) after a single O(N) build. Path-walking at
every symbol reference would be cheaper in memory but much more expensive
per lookup, especially for rules with many references. The trade-off
matters most for wide, shallow data (typical API payloads); it's worse for
very long lists, which is why the host is expected to bound list size.
### Why no short-circuit in AND / OR today?
Error visibility. If `AND` short-circuits and the right-hand side would
have errored, the rule's author never learns. Non-short-circuit evaluation
surfaces every error, which is the behaviour a verification tool wants.
If a future release adds opt-in short-circuit (via `ExecOptions`), it must
document that errors in the skipped branch are hidden.
### Why `@` as a separate sigil, not a lambda?
A lambda would bring first-class functions, closures over names, and a
name-resolution layer into the language. Nightjar is deliberately first-
order — predicates are syntactic forms, not values. `@` is a lexical
marker that means "the current element of the innermost quantifier". It
has no runtime representation other than a value binding, and it cannot
escape its quantifier.
---
## Known limitations and deferred features
- **Shorthand symbol resolution (E005).** Looking up a leaf name like
`revenue` without the full path `.data.revenue` and reporting
`AmbiguousSymbol` when it matches multiple paths is planned but not
implemented. The strict, fully-qualified form is the only form today.
- **Short-circuit evaluation.** Not available today; see the rationale
above.
- **REPL.** There is no interactive shell for Nightjar rules; the batch/
CLI pattern in the README serves the same purpose.
- **Infix → prefix converter.** An external convenience tool would let
users write `1 + 2 > 0` and compile it to `(GT (Add 1 2) 0)`. Out of
scope for the language itself; a reasonable standalone crate.
- **Currying beyond quantifier predicates.** The partial-verifier form
`(GT 0)` is the only currying the language does. Generalising it is
possible (arity-based disambiguation already discriminates partial from
full) but explicitly deferred.
- **`no_std` / WASM.** Not a current target. Neither `std` removal nor a
dedicated WASM build is in scope today.
- **Unbounded list unrolling.** The flattener registers one symbol-table
entry per list element. The host is responsible for bounding list
sizes before passing data to Nightjar; there is no configurable
upper bound in the library.
- **`Program`-accepting `exec`.** Today, `exec` / `exec_entity` re-parse
on every call. A future release may add a variant that accepts a
pre-built `Program` for hot loops; for now, consumers that need
parse-once behaviour can drive evaluation themselves using the public
AST.
---
## License
Licensed under the Apache License, Version 2.0.
See [LICENSE](LICENSE) for the full text.
Copyright © Wayne Hong (h-alice) <contact@halice.art>.