jetro-core
Parser, compiler, and bytecode VM for the Jetro JSON query language.
Pipeline
source string
│
▼ parser::parse pest PEG → syntax tree
Expr (AST)
│
▼ Compiler::emit structural lowering → Vec<Opcode>
│ Compiler::optimize 10 peephole passes
Program Arc<[Opcode]>, cheap to clone
│
▼ VM::execute stack machine over &serde_json::Value
serde_json::Value
The tree-walker in eval/ is the reference implementation: when VM behaviour diverges from it, the VM is wrong. Every new language feature lands in the tree-walker first, then is mirrored in the compiler and opcode set.
Layer 1 — parser (pest PEG)
grammar.pest defines the full syntax in pest rules. Highlights:
- Root / current:
$(document root),@(current item inside filter / comprehension). - Navigation:
.field,[idx],[start:end],..descendant,?.optional,.{dynamic}. - Method calls:
.name(arg, named: value); named + positional args mix freely. - Operators: arithmetic (
+ - * / %), comparison (== != < <= > >= ~=), logical (and or not), nullish (?|), type cast (as int). - Literals: int, float, bool, null,
"str",'str',f"format {expr}". - Kind checks:
x kind number,x kind not null. - Comprehensions: list
[e for v in iter if cond], dict{k: v for …}, set, generator. - Let:
let x = init in body. - Lambdas:
lambda x, y: body. - Pipelines:
base | step1 | step2, optional named bind-> name |. - Conditional: Python-style ternary
then_ if cond else else_(right-assoc). - Reserved words (guarded against identifier use):
and or not for in if else let lambda kind is as when patch DELETE true false null.
The parser walks pest's concrete syntax tree once, producing an ast::Expr. Identifiers are stored as Arc<str> so that cloning a name into an opcode is a refcount bump instead of a String allocation.
ast::Expr
Variants are orthogonal — each maps to one language concept. Subtrees are Box<Expr> (owned, not shared) because the compiler mutates the AST in place during reorder_and_operands.
Layer 2 — tree-walker (eval/)
eval::evaluate(&ast, &doc) is the shortest path from AST to result.
Val
Every compound variant wraps its payload in Arc. Cloning a Val bumps a refcount — no deep copy, no heap churn during chain traversal. Mutation uses Arc::try_unwrap with a clone-on-fallback pattern.
Val is the engine's internal currency. Conversion to / from serde_json::Value happens exactly twice per query: once on entry (Val::from(&doc)), once on exit (Val → Value).
Env
Let-bindings and comprehension variables push onto vars; scope exit truncates. The inline capacity of 4 covers every realistic query, and the linear lookup is faster than any hashmap at that size.
MethodRegistry
A HashMap<&'static str, Arc<dyn Method>> that the user can extend with custom methods. Method is:
The registry itself is Clone (all entries are Arc<dyn Method>), so threading it through recursive calls is free.
Layer 3 — bytecode VM (vm.rs)
The VM is an accelerated path on top of the tree-walker. It uses:
- A compile cache — parse + compile happen once per unique expression string.
- A pointer cache — purely structural programs (
$.a.b[0]) cache their resolved pointer path keyed by(doc_hash, program_id)and skip traversal on re-run. - A flat opcode array —
Arc<[Opcode]>; the execution loop iterates linearly with no recursion for navigation/arithmetic.
Program
A Program is immutable and Arc-cloneable. The is_structural flag is set when every opcode is pure navigation (PushRoot, GetField, GetIndex, GetSlice, OptField, RootChain, GetPointer) — those programs are eligible for the resolution cache.
Opcode
Representative slice — the full set is ~60 variants, grouped by purpose:
Compilation
compile
Runs three phases:
- AST rewrite —
reorder_and_operandscommutesa and bifbis cheaper/more-selective. The estimate is shallow but cheap. - Lowering —
emitwalks the (possibly rewritten) AST and produces a flatVec<Opcode>. Sub-programs (lambda bodies, method args, comprehension iterators) are recursively compiled and stored asArc<Program>. - Optimisation —
optimizeruns the peephole passes below. - Post-pass —
analysis::dedup_subprogramscanonicalises identicalArc<Program>sub-trees (common-subexpression elimination at the Program level).
Peephole passes
Run in this order — each may expose new fusions for the next:
root_chain—PushRootfollowed byGetField*collapses to a singleRootChainopcode. The whole path is resolved againstdocvia one pointer walk; structural programs made entirely ofRootChainbecome eligible for the pointer cache.filter_count—filter(p).count()→FilterCount(p). No intermediate array.filter_map / map_map / filter_filter— single-pass fusion of adjacent combinators.find_quantifier—filter(p).first()/.one()→ early-exitFindFirst/FindOne.strength_reduction—len() == 0→is_empty;.sort()[0:n]→TopN;map(f).sum()/.avg()fused.redundant_op_removal—PushCurrentfollowed immediately bySetCurrenton the same value cancels.kind_check_fold—expr kind Twhereexpris a literal folds to aPushBool.method_const_fold— builtins applied to literal args fold at compile time ("ab".len()→PushInt(2)).expr_const_fold— arithmetic on adjacent integer/float literals folds.nullness_specialisation— when upstream analysis proves a field cannot be absent,OptFielddowngrades toGetField(one fewer branch per execution).field_specialise—.map(x => x.k)and.filter(@.k op lit)collapse to singleMapField* / FilterField*opcodes; trivial field chains fuse intoMapFieldChain, dedup variants intoMapFieldUnique/MapFieldChainUnique.list_comp_specialise—[x.k for x in iter]and[x.k for x in iter if x.p op lit]splice the iter sub-program inline + emit the same fusedMapField/FilterFieldEqLitMapFieldopcodes as the method-chain form. No per-item opcode dispatch.ifelse_const_fold—then_ if true else e/then_ if false else ecollapse to the taken branch at compile time.
Each pass has a PassConfig toggle; VM::set_pass_config lets you disable any of them at runtime (the cache keys off the config hash, so toggles don't return stale programs).
VM state
compile_cache— keyed by(pass_config_hash, source). LRU eviction atcompile_cap(default 512).path_cache— structural programs map(doc_hash, program_id) → resolved Val.doc_hashis computed once per top-levelexecute()viahash_val_structure, which hashes both structure and primitive values (bounded at depth 8). Two documents with the same shape but different leaf values produce different hashes — otherwise the cache would return stale results across distinct docs.registry— shared viaArcso multiple VMs on the same thread share method implementations.
Execution
;
;
execute enters a tight loop over program.ops, maintaining a Vec<Val> operand stack. Opcodes that need sub-programs (method args, lambda bodies, comprehension iterators) recurse through exec(sub_program, env); the doc-hash is not recomputed on recursion.
Layer 4 — analyses (analysis.rs, schema.rs, plan.rs, cfg.rs, ssa.rs)
Optional IRs. None are mandatory for correctness; they exist so advanced callers can specialise programs further or export a readable data-flow graph.
| Module | Produces | Used for |
|---|---|---|
analysis |
Type / Nullness / Cardinality / cost / monotonicity | Optimiser heuristics; CSE via dedup_subprograms |
schema |
Shape inference from sample JSON | Specialise OptField → GetField |
plan |
Logical relational plan IR | Filter push-down, join detection |
cfg |
Basic blocks, edges, dominators, loop headers | Liveness, slot allocator |
ssa |
SSA numbering + phi nodes | Common-subexpression elimination, def-use |
The analyses operate on Program, not Expr — they observe the already-lowered bytecode.
Public API
// One-shot tree-walker.
;
;
// Persistent VM (thread-local in the umbrella; fresh in this crate).
// Lower-level building blocks.
pub use ;
pub use Expr; // typed-phantom wrapper for compile-time expression literals
pub use ;
pub use ;
pub use Graph; // multi-document queries
pub use Engine; // high-level wrapper
Jetro — document-bound convenience
use Jetro;
use json;
let j = new;
let titles = j.collect.unwrap;
assert_eq!;
Jetro::collect uses a thread-local VM so the compile cache accumulates across calls on the same thread.
query — one-shot
let n = query.unwrap;
Uses the tree-walker directly — no compile/resolution caches. Use Jetro or VM for repeated queries.
Custom methods
use ;
;
let mut vm = VMnew;
vm.register;
let r = vm.run_str.unwrap;
assert_eq!;
Graph — multi-document queries
Graph::new({node: Value, …}) merges named documents into a virtual root and evaluates an expression against the combined object. Joins are expressed inside the query (.#join('orders', 'customer_id', 'id')). The jetrodb crate's GraphBucket is a storage-backed orchestrator on top of this primitive.
Performance notes
- Clone cost:
Val::cloneis always O(1) — the cost is one atomic increment. Nothing in the VM copies compound payloads. - Pointer cache hits: A structural program re-run against the same document is a
HashMaplookup plus anArcbump. No traversal. - Compile cache: First call: parse + compile + optimize (~microseconds for short queries). Subsequent:
HashMaplookup. - Peephole fusions:
map(f).sum()runs one pass, no intermediate vec.filter(p).first()early-exits on the first match. - Dead-code elimination via CSE:
dedup_subprogramscanonicalises identical sub-programs across the compiled tree —map(x.price).sum() + map(x.price).avg()shares onemap(x.price)sub-program byArcidentity after this pass. smallvecfor scopes:Env::varskeeps the first 4 let-bindings inline. Real queries rarely exceed this.
Error types
;
;
Both implement std::error::Error + Display. The umbrella jetro crate re-exports these and keeps the same shape. jetrodb extends with Db and Io variants.
License
MIT. See LICENSE.