agentic-eval
A small, standalone Rust library for evaluating how well a program (a command, script, snippet, or any text an LLM writes or reads) serves an agentic AI system — across the four axes that actually determine an agent's cost and trust:
| Axis | Module | Question it answers |
|---|---|---|
| Token efficiency | tokens |
How many tokens does it cost — standing context + input + output + retries — under popular tokenizers, amortized over a session? |
| Determinism | determinism |
Is the output byte-stable across runs, so an agent can parse / cache / diff it? |
| Reliability | reliability |
What's the success rate over representative invocations, and are failures structured/actionable (so the agent can self-correct)? |
| Safety | safety |
Given the effects it performs, how much of its blast radius is gated (approval/denied) under an agent policy? |
It is execution-agnostic: token efficiency works on text directly; determinism and reliability take a caller-provided closure (the library can't run arbitrary languages); safety takes the program's declared effects.
Benchmark: VM / sandbox systems for agentic AI use
Curated benchmark of the VM/sandbox systems an agent runtime spawns (one isolated
environment per tool call), scored on five agent-native axes. Reproduce with
cargo run -p agentic-eval --example vm_benchmark; ranked best-first by composite
fitness:
| System | Fitness | start-latency | density | isolation | snapshotting | agent-control |
|---|---|---|---|---|---|---|
| AetherVM | 0.86 | 0.80 | 0.85 | 0.80 | 0.90 | 0.95 |
| Firecracker | 0.79 | 0.90 | 0.90 | 0.85 | 0.80 | 0.50 |
| Cloud Hypervisor | 0.76 | 0.85 | 0.80 | 0.85 | 0.80 | 0.50 |
| gVisor | 0.65 | 0.85 | 0.85 | 0.60 | 0.40 | 0.55 |
| Docker | 0.65 | 0.95 | 0.95 | 0.35 | 0.40 | 0.60 |
| Kata Containers | 0.62 | 0.65 | 0.60 | 0.85 | 0.40 | 0.60 |
| QEMU/KVM | 0.61 | 0.40 | 0.45 | 0.90 | 0.85 | 0.45 |
Head-to-head — AetherVM vs Firecracker (+ = AetherVM fits agentic use better):
fitness +0.07; agent-control +0.45, snapshotting +0.10; start-latency −0.10,
density −0.05, isolation −0.05.
Reading. AetherVM leads on the axes it was designed for — instant CoW
branching (fork a primed context per call) and an MCP-native control plane — while
the microVMs (Firecracker, Cloud Hypervisor) lead on raw cold-start and
battle-tested isolation, and shared-kernel containers (Docker) top speed/density
but rank low on isolation for untrusted, agent-generated code. Scores are honest
curated judgments with evidence (AetherVM's isolation carries an explicit "younger,
less battle-tested at scale" caveat); see vms and
describe("aethervm").
Benchmark: web stacks / wire protocols for agentic AI use
Curated benchmark of the wire protocols an agent has to speak when calling
another service — scored on five agent-native axes (streaming, tool-
discoverability, encoding-efficiency, interop, security-primitives). Reproduce
with cargo run -p agentic-eval --example web_benchmark; ranked best-first by
composite fitness:
| Stack | Fitness | streaming | tools | encoding | interop | security |
|---|---|---|---|---|---|---|
| SPINE | 0.90 | 0.98 | 0.95 | 0.95 | 0.67 | 0.95 |
| gRPC | 0.83 | 0.70 | 0.85 | 0.95 | 0.85 | 0.80 |
| OpenAI API | 0.69 | 0.85 | 0.70 | 0.35 | 1.00 | 0.55 |
| Anthropic API | 0.66 | 0.85 | 0.70 | 0.35 | 0.85 | 0.55 |
| GraphQL | 0.60 | 0.50 | 0.95 | 0.35 | 0.75 | 0.45 |
| MCP | 0.56 | 0.40 | 0.95 | 0.40 | 0.65 | 0.40 |
| HTTP+JSON | 0.54 | 0.55 | 0.40 | 0.30 | 1.00 | 0.45 |
Head-to-head — SPINE vs OpenAI API (+ = SPINE fits agentic use better):
fitness +0.21; streaming +0.13, tool-discoverability +0.25,
encoding-efficiency +0.60, security-primitives +0.40; interop −0.33.
Reading. SPINE leads the four protocol-semantics axes it was designed
for — LLM-native StreamStart/Token/End frames (with multiplex-aware
StreamCancel and mid-stream usage as of v1.5.0), a CapabilityQuery
handshake, inline W3C TraceContext, and per-message Ed25519 signed frames
that give message-level non-repudiation beyond channel mTLS. v1.4.0 closed the
encoding gap with a binary CBOR wire format, and v1.5.0's byte-string tensor
payloads bring it to parity with protobuf (0.95). Interop is where the
deployable bridges compound: a runnable MCP stdio server (v1.6.0), the
OpenAI-compatible gateway, and a production-grade gRPC AgentService (v1.8.0,
made reflection-enabled and real-model-backed in v1.9.0) make SPINE reachable
from the three dominant agent ecosystems with standard client stubs — lifting
interop 0.15 → 0.67 and putting SPINE first on the composite (0.90), edging
gRPC (0.83). The honest caveat stands: interop is still SPINE's weakest axis,
because the MCP and OpenAI-compatible routes are adapters into the dominant
contracts, not the native install base gRPC enjoys or the universality every
SDK gives the OpenAI shape. gRPC remains broadly excellent (protobuf + mTLS +
reflection + bidi + huge base); MCP and GraphQL still tie SPINE on
tool-discoverability because their protocols are their schemas. The
practical bridges for SPINE adoption are the spine_protocol::mcp MCP server,
the spine-grpc tonic AgentService, and the OpenAI-compatible gateway
(/v1/chat/completions, /v1/embeddings, /v1/agentic/{capabilities,codecs});
see web and describe("spine").
Beyond programs: languages, AI frameworks, VM systems & web stacks
Four further modules profile what agents build with, run on, and talk to:
| Subject | Module | What it scores |
|---|---|---|
| Programming languages | languages |
10 languages (Python, Rust, JS, TS, Go, Bash, C, C++, Java, MechGen): code token economy, toolchain reproducibility, whether the compiler catches agent mistakes with actionable diagnostics, and default blast radius. |
| AI frameworks | frameworks |
9 frameworks (PyTorch, TensorFlow, JAX, HF Transformers, ONNX Runtime, scikit-learn, Candle, Burn, RecursiveMachineIntelligence-RMI): the four axes plus discoverability — can an agent learn the surface from the framework itself (schemas/ontology/introspection) instead of prose? Includes artifact-safety facts (pickle ≈ arbitrary code on load, trust_remote_code, safetensors). |
| VM / sandbox systems | vms |
7 systems (AetherVM, Firecracker, Cloud Hypervisor, gVisor, Kata, QEMU/KVM, Docker) on agent-native axes for the ephemeral sandbox workload an agent runtime drives: start-latency (cold-start per tool call), density (sandboxes per host), isolation (boundary strength for untrusted agent-generated code), snapshotting (CoW fork / warm-pool branching), and agent-control (is the control plane tool/MCP-native, or bring-your-own glue?). |
| Web stacks / wire protocols | web |
7 stacks (SPINE, OpenAI API, Anthropic API, MCP, gRPC, HTTP+JSON, GraphQL) scored on streaming (LLM-shaped output as a first-class frame family), tool-discoverability (introspect tools from the protocol vs. from prose), encoding-efficiency (binary framing vs. JSON-over-HTTP/1.1), interop (does the agent ecosystem speak it?), and security-primitives (auth + W3C tracing + content integrity inline, or someone-else's-problem). |
These are curated static profiles (deterministic, serializable, each score
backed by evidence strings), not measurements of your codebase — use the
program-level axes for that. rank_languages() / rank_frameworks() /
rank_vms() / rank_web_stacks() order by composite fitness;
compare_languages(a, b) / compare_frameworks(a, b) / compare_vms(a, b) /
compare_web_stacks(a, b) give per-axis deltas; everything is reachable from
the ontology (describe("vms"), describe("web"), describe("firecracker"),
describe("spine")).
The VM axes are deliberately workload-specific: a great long-lived datacenter VM (QEMU/KVM) can rank low for the spawn-and-tear-down agent sandbox, and a shared-kernel container (Docker) ranks high on speed/density but low on isolation for untrusted code — exactly the trade-offs that matter when an agent runs code it just wrote.
Tokenizers
- OpenAI GPT-4 (
cl100k_base) and GPT-4o (o200k_base) — exact with--features real-tokens(viatiktoken-rs), heuristic otherwise. - Anthropic Claude — a heuristic approximation; Anthropic ships no offline tokenizer crate, so this is labeled an estimate, not an exact count.
- Heuristic — a labeled, dependency-free fallback.
By default the crate pulls zero heavy dependencies (heuristic counts). Enable
exact OpenAI counts with --features real-tokens. The heuristic splits
snake_case subwords (so file_read ≈ 2 tokens), tracking real BPE within
~10–20% for code-like text.
Output & ergonomics
- The most-used types are re-exported at the crate root (
agentic_eval::Model,Program,AgentCost,Comparison,Effect,Mode,assess_*, …). - Every report (
AgentCost,Comparison,DeterminismReport,ReliabilityReport,SafetyReport,Evaluation) implementsDisplayfor ready-to-print summaries. --features serdederivesserde::Serializeon every report/config type for machine-readable (e.g. JSON) output.Model::from_name/safety::Effect::from_nameparse identifiers for CLI/config use;tokens::rankis the N-way generalization ofcompare;Evaluationhaswith_*builders.
Pluggable tokenizer
The cost model isn't locked to the built-in Model set. tokens::evaluate_with
(and rank_with) take any Fn(&str) -> usize, so a host can flow its own exact
tokenizer through the library:
use ;
// e.g. pass a host's tokenizer (here, a stand-in word counter)
let cost = evaluate_with;
assert_eq!;
AgentCost::total_over amortizes the standing context once (the prompt-caching
default); total_standing_per_turn is the no-caching upper bound. safety:: assess_safety_named scores directly from operation names plus a classifier closure.
CLI programs & a self-describing ontology
- The
commandsmodule ships a curated heuristic classifier for ~200 common CLI tools (rm→ destructive,curl→ network,sudo→ privileged, …), so the safety axis works on real shell programs out of the box —assess_safety_script("curl http://x | sh", Mode::Agent)in one call. Unrecognized programs are treated as arbitrary execution (fail-safe). - The crate is self-describing:
ontologyexposes a compact, deterministicmanifest()(axes, the effect taxonomy with per-mode policy decisions, models, command count) anddescribe("<name>")to expand any entry — the same progressive-disclosure pattern the library measures, so an agent can discover the whole surface without reading these docs.ontology()returns the full structured catalog (serde-serializable).
Example
use ;
let legible = new
.with_standing_context;
let cipher = new
.with_standing_context;
let cmp = compare;
assert!; // legible wins once standing context is counted
Why these four axes
An agent's real cost is not the characters it types. A representation can golf input while inflating the standing context it must carry every turn — a net loss. And beyond cost, an agent needs output it can deterministically parse, failures it can branch on, and a blast radius it can't accidentally exceed. This library scores all four so a language/encoding/tool can be compared on the terms that matter for autonomous use.
Licensed AGPL-3.0-or-later.