# agentic-eval
A small, standalone Rust library for evaluating how well a **program** (a command,
script, snippet, or any text an LLM writes or reads) serves an **agentic AI
system** — across the four axes that actually determine an agent's cost and trust:
| Axis | Module | Question it answers |
| -------------------- | ----------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------- |
| **Token efficiency** | [`tokens`](src/tokens.rs) | How many tokens does it cost — standing context + input + output + retries — under popular tokenizers, amortized over a session? |
| **Determinism** | [`determinism`](src/determinism.rs) | Is the output byte-stable across runs, so an agent can parse / cache / diff it? |
| **Reliability** | [`reliability`](src/reliability.rs) | What's the success rate over representative invocations, and are failures *structured/actionable* (so the agent can self-correct)? |
| **Safety** | [`safety`](src/safety.rs) | Given the effects it performs, how much of its blast radius is gated (approval/denied) under an agent policy? |
It is **execution-agnostic**: token efficiency works on text directly; determinism
and reliability take a caller-provided closure (the library can't run arbitrary
languages); safety takes the program's declared effects.
## Benchmark: VM / sandbox systems for agentic AI use
Curated benchmark of the VM/sandbox systems an agent runtime spawns (one isolated
environment per tool call), scored on five **agent-native** axes. Reproduce with
`cargo run -p agentic-eval --example vm_benchmark`; ranked best-first by composite
fitness:
| System | Fitness | start-latency | density | isolation | snapshotting | agent-control |
| ---------------- | -------: | ------------: | -------: | --------: | -----------: | ------------: |
| **AetherVM** | **0.86** | 0.80 | 0.85 | 0.80 | **0.90** | **0.95** |
| Firecracker | 0.79 | **0.90** | **0.90** | 0.85 | 0.80 | 0.50 |
| Cloud Hypervisor | 0.76 | 0.85 | 0.80 | 0.85 | 0.80 | 0.50 |
| gVisor | 0.65 | 0.85 | 0.85 | 0.60 | 0.40 | 0.55 |
| Docker | 0.65 | **0.95** | **0.95** | 0.35 | 0.40 | 0.60 |
| Kata Containers | 0.62 | 0.65 | 0.60 | 0.85 | 0.40 | 0.60 |
| QEMU/KVM | 0.61 | 0.40 | 0.45 | **0.90** | 0.85 | 0.45 |
**Head-to-head — AetherVM vs Firecracker** (+ = AetherVM fits agentic use better):
fitness `+0.07`; agent-control `+0.45`, snapshotting `+0.10`; start-latency `−0.10`,
density `−0.05`, isolation `−0.05`.
**Reading.** AetherVM leads on the axes it was *designed* for — instant CoW
branching (fork a primed context per call) and an MCP-native control plane — while
the microVMs (Firecracker, Cloud Hypervisor) lead on raw cold-start and
battle-tested isolation, and shared-kernel containers (Docker) top speed/density
but rank low on isolation for untrusted, agent-generated code. Scores are honest
curated judgments with evidence (AetherVM's isolation carries an explicit "younger,
less battle-tested at scale" caveat); see [`vms`](src/vms.rs) and
`describe("aethervm")`.
## Benchmark: web stacks / wire protocols for agentic AI use
Curated benchmark of the **wire protocols** an agent has to speak when calling
another service — scored on five **agent-native** axes (streaming, tool-
discoverability, encoding-efficiency, interop, security-primitives). Reproduce
with `cargo run -p agentic-eval --example web_benchmark`; ranked best-first by
composite fitness:
| Stack | Fitness | streaming | tools | encoding | interop | security |
| ----------------- | -------: | --------: | -----: | -------: | ------: | -------: |
| **SPINE** | **0.90** | **0.98** |**0.95**| **0.95** | 0.67 | **0.95** |
| gRPC | 0.83 | 0.70 | 0.85 | **0.95** | 0.85 | 0.80 |
| OpenAI API | 0.69 | 0.85 | 0.70 | 0.35 |**1.00** | 0.55 |
| Anthropic API | 0.66 | 0.85 | 0.70 | 0.35 | 0.85 | 0.55 |
| GraphQL | 0.60 | 0.50 |**0.95**| 0.35 | 0.75 | 0.45 |
| MCP | 0.56 | 0.40 |**0.95**| 0.40 | 0.65 | 0.40 |
| HTTP+JSON | 0.54 | 0.55 | 0.40 | 0.30 |**1.00** | 0.45 |
**Head-to-head — SPINE vs OpenAI API** (+ = SPINE fits agentic use better):
fitness `+0.21`; streaming `+0.13`, tool-discoverability `+0.25`,
encoding-efficiency `+0.60`, security-primitives `+0.40`; **interop `−0.33`**.
**Reading.** SPINE leads the four protocol-semantics axes it was *designed*
for — LLM-native `StreamStart/Token/End` frames (with multiplex-aware
`StreamCancel` and mid-stream usage as of v1.5.0), a `CapabilityQuery`
handshake, inline W3C `TraceContext`, and per-message Ed25519 signed frames
that give message-level non-repudiation beyond channel mTLS. v1.4.0 closed the
encoding gap with a binary CBOR wire format, and v1.5.0's byte-string tensor
payloads bring it to **parity with protobuf** (0.95). Interop is where the
deployable bridges compound: a *runnable* MCP stdio server (v1.6.0), the
OpenAI-compatible gateway, and a production-grade gRPC `AgentService` (v1.8.0,
made reflection-enabled and real-model-backed in v1.9.0) make SPINE reachable
from the three dominant agent ecosystems with standard client stubs — lifting
interop 0.15 → 0.67 and putting SPINE **first on the composite (0.90), edging
gRPC (0.83)**. The honest caveat stands: interop is still SPINE's weakest axis,
because the MCP and OpenAI-compatible routes are *adapters into* the dominant
contracts, not the native install base gRPC enjoys or the universality every
SDK gives the OpenAI shape. gRPC remains broadly excellent (protobuf + mTLS +
reflection + bidi + huge base); MCP and GraphQL still tie SPINE on
tool-discoverability because their protocols *are* their schemas. The
practical bridges for SPINE adoption are the `spine_protocol::mcp` MCP server,
the `spine-grpc` tonic `AgentService`, and the OpenAI-compatible gateway
(`/v1/chat/completions`, `/v1/embeddings`, `/v1/agentic/{capabilities,codecs}`);
see [`web`](src/web.rs) and `describe("spine")`.
## Beyond programs: languages, AI frameworks, VM systems & web stacks
Four further modules profile what agents *build with*, *run on*, and *talk to*:
| Subject | Module | What it scores |
| ------------------------- | --------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Programming languages** | [`languages`](src/languages.rs) | 10 languages (Python, Rust, JS, TS, Go, Bash, C, C++, Java, MechGen): code token economy, toolchain reproducibility, whether the compiler catches agent mistakes with actionable diagnostics, and default blast radius. |
| **AI frameworks** | [`frameworks`](src/frameworks.rs) | 9 frameworks (PyTorch, TensorFlow, JAX, HF Transformers, ONNX Runtime, scikit-learn, Candle, Burn, RecursiveMachineIntelligence-RMI): the four axes **plus discoverability** — can an agent learn the surface from the framework itself (schemas/ontology/introspection) instead of prose? Includes artifact-safety facts (pickle ≈ arbitrary code on load, `trust_remote_code`, safetensors). |
| **VM / sandbox systems** | [`vms`](src/vms.rs) | 7 systems (AetherVM, Firecracker, Cloud Hypervisor, gVisor, Kata, QEMU/KVM, Docker) on **agent-native axes** for the *ephemeral sandbox* workload an agent runtime drives: **start-latency** (cold-start per tool call), **density** (sandboxes per host), **isolation** (boundary strength for untrusted agent-generated code), **snapshotting** (CoW fork / warm-pool branching), and **agent-control** (is the control plane tool/MCP-native, or bring-your-own glue?). |
| **Web stacks / wire protocols** | [`web`](src/web.rs) | 7 stacks (SPINE, OpenAI API, Anthropic API, MCP, gRPC, HTTP+JSON, GraphQL) scored on **streaming** (LLM-shaped output as a first-class frame family), **tool-discoverability** (introspect tools from the protocol vs. from prose), **encoding-efficiency** (binary framing vs. JSON-over-HTTP/1.1), **interop** (does the agent ecosystem speak it?), and **security-primitives** (auth + W3C tracing + content integrity inline, or someone-else's-problem). |
These are **curated static profiles** (deterministic, serializable, each score
backed by evidence strings), not measurements of your codebase — use the
program-level axes for that. `rank_languages()` / `rank_frameworks()` /
`rank_vms()` / `rank_web_stacks()` order by composite fitness;
`compare_languages(a, b)` / `compare_frameworks(a, b)` / `compare_vms(a, b)` /
`compare_web_stacks(a, b)` give per-axis deltas; everything is reachable from
the ontology (`describe("vms")`, `describe("web")`, `describe("firecracker")`,
`describe("spine")`).
> The VM axes are deliberately *workload-specific*: a great long-lived
> datacenter VM (QEMU/KVM) can rank low for the spawn-and-tear-down agent
> sandbox, and a shared-kernel container (Docker) ranks high on speed/density
> but low on isolation for untrusted code — exactly the trade-offs that matter
> when an agent runs code it just wrote.
## Tokenizers
- **OpenAI GPT-4** (`cl100k_base`) and **GPT-4o** (`o200k_base`) — *exact* with
`--features real-tokens` (via `tiktoken-rs`), heuristic otherwise.
- **Anthropic Claude** — a heuristic *approximation*; Anthropic ships no offline
tokenizer crate, so this is labeled an estimate, not an exact count.
- **Heuristic** — a labeled, dependency-free fallback.
By default the crate pulls **zero heavy dependencies** (heuristic counts). Enable
exact OpenAI counts with `--features real-tokens`. The heuristic splits
`snake_case` subwords (so `file_read` ≈ 2 tokens), tracking real BPE within
~10–20% for code-like text.
## Output & ergonomics
- The most-used types are **re-exported at the crate root** (`agentic_eval::Model`,
`Program`, `AgentCost`, `Comparison`, `Effect`, `Mode`, `assess_*`, …).
- Every report (`AgentCost`, `Comparison`, `DeterminismReport`,
`ReliabilityReport`, `SafetyReport`, `Evaluation`) implements **`Display`** for
ready-to-print summaries.
- `--features serde` derives **`serde::Serialize`** on every report/config type for
machine-readable (e.g. JSON) output.
- `Model::from_name` / `safety::Effect::from_name` parse identifiers for CLI/config
use; `tokens::rank` is the N-way generalization of `compare`; `Evaluation` has
`with_*` builders.
## Pluggable tokenizer
The cost model isn't locked to the built-in `Model` set. `tokens::evaluate_with`
(and `rank_with`) take **any `Fn(&str) -> usize`**, so a host can flow its own exact
tokenizer through the library:
```rust
use agentic_eval::tokens::{evaluate_with, Program};
// e.g. pass a host's tokenizer (here, a stand-in word counter)
let cost = evaluate_with(&Program::new("p", "read a file"), |s| s.split_whitespace().count());
assert_eq!(cost.input, 3);
```
`AgentCost::total_over` amortizes the standing context once (the prompt-caching
default); `total_standing_per_turn` is the no-caching upper bound. `safety::
assess_safety_named` scores directly from operation names plus a classifier closure.
## CLI programs & a self-describing ontology
- The [`commands`](src/commands.rs) module ships a curated heuristic classifier for
~200 common CLI tools (`rm` → destructive, `curl` → network, `sudo` → privileged,
…), so the safety axis works on real shell programs out of the box —
`assess_safety_script("curl http://x | sh", Mode::Agent)` in one call.
Unrecognized programs are treated as arbitrary execution (fail-safe).
- The crate is **self-describing**: [`ontology`](src/ontology.rs) exposes a compact,
deterministic `manifest()` (axes, the effect taxonomy with per-mode policy
decisions, models, command count) and `describe("<name>")` to expand any entry —
the same progressive-disclosure pattern the library measures, so an agent can
discover the whole surface without reading these docs. `ontology()` returns the
full structured catalog (serde-serializable).
## Example
```sh
cargo run -p agentic-eval --example evaluate # heuristic
cargo run -p agentic-eval --example evaluate --features real-tokens # exact OpenAI BPE
```
```rust
use agentic_eval::tokens::{compare, Model, Program};
let legible = Program::new("read", "file.read(\"README.md\")")
.with_standing_context("file.read(path) -> String");
let cipher = Program::new("read", "F.r\"README.md\"")
.with_standing_context("<multi-KB single-letter+sigil cheatsheet>");
let cmp = compare(&legible, &cipher, Model::OpenAiGpt4, 30);
assert!(cmp.winner_is_a); // legible wins once standing context is counted
```
## Why these four axes
An agent's real cost is not the characters it types. A representation can golf
*input* while inflating the *standing context* it must carry every turn — a net
loss. And beyond cost, an agent needs output it can deterministically parse,
failures it can branch on, and a blast radius it can't accidentally exceed. This
library scores all four so a language/encoding/tool can be compared on the terms
that matter for autonomous use.
Licensed AGPL-3.0-or-later.