agentic-eval

A small, standalone Rust library for evaluating how well a program (a command, script, snippet, or any text an LLM writes or reads) serves an agentic AI system — across the four axes that actually determine an agent's cost and trust:

Axis	Module	Question it answers
Token efficiency	`tokens`	How many tokens does it cost — standing context + input + output + retries — under popular tokenizers, amortized over a session?
Determinism	`determinism`	Is the output byte-stable across runs, so an agent can parse / cache / diff it?
Reliability	`reliability`	What's the success rate over representative invocations, and are failures structured/actionable (so the agent can self-correct)?
Safety	`safety`	Given the effects it performs, how much of its blast radius is gated (approval/denied) under an agent policy?

It is execution-agnostic: token efficiency works on text directly; determinism and reliability take a caller-provided closure (the library can't run arbitrary languages); safety takes the program's declared effects.

Benchmark: VM / sandbox systems for agentic AI use

Curated benchmark of the VM/sandbox systems an agent runtime spawns (one isolated environment per tool call), scored on five agent-native axes. Reproduce with cargo run -p agentic-eval --example vm_benchmark; ranked best-first by composite fitness:

System	Fitness	start-latency	density	isolation	snapshotting	agent-control
AetherVM	0.86	0.80	0.85	0.80	0.90	0.95
Firecracker	0.79	0.90	0.90	0.85	0.80	0.50
Cloud Hypervisor	0.76	0.85	0.80	0.85	0.80	0.50
gVisor	0.65	0.85	0.85	0.60	0.40	0.55
Docker	0.65	0.95	0.95	0.35	0.40	0.60
Kata Containers	0.62	0.65	0.60	0.85	0.40	0.60
QEMU/KVM	0.61	0.40	0.45	0.90	0.85	0.45

Head-to-head — AetherVM vs Firecracker (+ = AetherVM fits agentic use better): fitness +0.07; agent-control +0.45, snapshotting +0.10; start-latency −0.10, density −0.05, isolation −0.05.

Reading. AetherVM leads on the axes it was designed for — instant CoW branching (fork a primed context per call) and an MCP-native control plane — while the microVMs (Firecracker, Cloud Hypervisor) lead on raw cold-start and battle-tested isolation, and shared-kernel containers (Docker) top speed/density but rank low on isolation for untrusted, agent-generated code. Scores are honest curated judgments with evidence (AetherVM's isolation carries an explicit "younger, less battle-tested at scale" caveat); see vms and describe("aethervm").

Benchmark: web stacks / wire protocols for agentic AI use

Curated benchmark of the wire protocols an agent has to speak when calling another service — scored on five agent-native axes (streaming, tool- discoverability, encoding-efficiency, interop, security-primitives). Reproduce with cargo run -p agentic-eval --example web_benchmark; ranked best-first by composite fitness:

Stack	Fitness	streaming	tools	encoding	interop	security
SPINE	0.90	0.98	0.95	0.95	0.67	0.95
gRPC	0.83	0.70	0.85	0.95	0.85	0.80
OpenAI API	0.69	0.85	0.70	0.35	1.00	0.55
Anthropic API	0.66	0.85	0.70	0.35	0.85	0.55
GraphQL	0.60	0.50	0.95	0.35	0.75	0.45
MCP	0.56	0.40	0.95	0.40	0.65	0.40
HTTP+JSON	0.54	0.55	0.40	0.30	1.00	0.45

Head-to-head — SPINE vs OpenAI API (+ = SPINE fits agentic use better): fitness +0.21; streaming +0.13, tool-discoverability +0.25, encoding-efficiency +0.60, security-primitives +0.40; interop −0.33.

Reading. SPINE leads the four protocol-semantics axes it was designed for — LLM-native StreamStart/Token/End frames (with multiplex-aware StreamCancel and mid-stream usage as of v1.5.0), a CapabilityQuery handshake, inline W3C TraceContext, and per-message Ed25519 signed frames that give message-level non-repudiation beyond channel mTLS. v1.4.0 closed the encoding gap with a binary CBOR wire format, and v1.5.0's byte-string tensor payloads bring it to parity with protobuf (0.95). Interop is where the deployable bridges compound: a runnable MCP stdio server (v1.6.0), the OpenAI-compatible gateway, and a production-grade gRPC AgentService (v1.8.0, made reflection-enabled and real-model-backed in v1.9.0) make SPINE reachable from the three dominant agent ecosystems with standard client stubs — lifting interop 0.15 → 0.67 and putting SPINE first on the composite (0.90), edging gRPC (0.83). The honest caveat stands: interop is still SPINE's weakest axis, because the MCP and OpenAI-compatible routes are adapters into the dominant contracts, not the native install base gRPC enjoys or the universality every SDK gives the OpenAI shape. gRPC remains broadly excellent (protobuf + mTLS + reflection + bidi + huge base); MCP and GraphQL still tie SPINE on tool-discoverability because their protocols are their schemas. The practical bridges for SPINE adoption are the spine_protocol::mcp MCP server, the spine-grpc tonic AgentService, and the OpenAI-compatible gateway (/v1/chat/completions, /v1/embeddings, /v1/agentic/{capabilities,codecs}); see web and describe("spine").

Beyond programs: languages, AI frameworks, VM systems & web stacks

Four further modules profile what agents build with, run on, and talk to:

Subject	Module	What it scores
Programming languages	`languages`	10 languages (Python, Rust, JS, TS, Go, Bash, C, C++, Java, MechGen): code token economy, toolchain reproducibility, whether the compiler catches agent mistakes with actionable diagnostics, and default blast radius.
AI frameworks	`frameworks`	9 frameworks (PyTorch, TensorFlow, JAX, HF Transformers, ONNX Runtime, scikit-learn, Candle, Burn, RecursiveMachineIntelligence-RMI): the four axes plus discoverability — can an agent learn the surface from the framework itself (schemas/ontology/introspection) instead of prose? Includes artifact-safety facts (pickle ≈ arbitrary code on load, `trust_remote_code`, safetensors).
VM / sandbox systems	`vms`	7 systems (AetherVM, Firecracker, Cloud Hypervisor, gVisor, Kata, QEMU/KVM, Docker) on agent-native axes for the ephemeral sandbox workload an agent runtime drives: start-latency (cold-start per tool call), density (sandboxes per host), isolation (boundary strength for untrusted agent-generated code), snapshotting (CoW fork / warm-pool branching), and agent-control (is the control plane tool/MCP-native, or bring-your-own glue?).
Web stacks / wire protocols	`web`	7 stacks (SPINE, OpenAI API, Anthropic API, MCP, gRPC, HTTP+JSON, GraphQL) scored on streaming (LLM-shaped output as a first-class frame family), tool-discoverability (introspect tools from the protocol vs. from prose), encoding-efficiency (binary framing vs. JSON-over-HTTP/1.1), interop (does the agent ecosystem speak it?), and security-primitives (auth + W3C tracing + content integrity inline, or someone-else's-problem).

These are curated static profiles (deterministic, serializable, each score backed by evidence strings), not measurements of your codebase — use the program-level axes for that. rank_languages() / rank_frameworks() / rank_vms() / rank_web_stacks() order by composite fitness; compare_languages(a, b) / compare_frameworks(a, b) / compare_vms(a, b) / compare_web_stacks(a, b) give per-axis deltas; everything is reachable from the ontology (describe("vms"), describe("web"), describe("firecracker"), describe("spine")).

The VM axes are deliberately workload-specific: a great long-lived datacenter VM (QEMU/KVM) can rank low for the spawn-and-tear-down agent sandbox, and a shared-kernel container (Docker) ranks high on speed/density but low on isolation for untrusted code — exactly the trade-offs that matter when an agent runs code it just wrote.

Tokenizers

OpenAI GPT-4 (cl100k_base) and GPT-4o (o200k_base) — exact with --features real-tokens (via tiktoken-rs), heuristic otherwise.
Anthropic Claude — a heuristic approximation; Anthropic ships no offline tokenizer crate, so this is labeled an estimate, not an exact count.
Heuristic — a labeled, dependency-free fallback.

By default the crate pulls zero heavy dependencies (heuristic counts). Enable exact OpenAI counts with --features real-tokens. The heuristic splits snake_case subwords (so file_read ≈ 2 tokens), tracking real BPE within ~10–20% for code-like text.

Output & ergonomics

The most-used types are re-exported at the crate root (agentic_eval::Model, Program, AgentCost, Comparison, Effect, Mode, assess_*, …).
Every report (AgentCost, Comparison, DeterminismReport, ReliabilityReport, SafetyReport, Evaluation) implements Display for ready-to-print summaries.
--features serde derives serde::Serialize on every report/config type for machine-readable (e.g. JSON) output.
Model::from_name / safety::Effect::from_name parse identifiers for CLI/config use; tokens::rank is the N-way generalization of compare; Evaluation has with_* builders.

Pluggable tokenizer

The cost model isn't locked to the built-in Model set. tokens::evaluate_with (and rank_with) take any Fn(&str) -> usize, so a host can flow its own exact tokenizer through the library:

use agentic_eval::tokens::{evaluate_with, Program};
// e.g. pass a host's tokenizer (here, a stand-in word counter)
let cost = evaluate_with(&Program::new("p", "read a file"), |s| s.split_whitespace().count());
assert_eq!(cost.input, 3);

AgentCost::total_over amortizes the standing context once (the prompt-caching default); total_standing_per_turn is the no-caching upper bound. safety:: assess_safety_named scores directly from operation names plus a classifier closure.

CLI programs & a self-describing ontology

The commands module ships a curated heuristic classifier for ~200 common CLI tools (rm → destructive, curl → network, sudo → privileged, …), so the safety axis works on real shell programs out of the box — assess_safety_script("curl http://x | sh", Mode::Agent) in one call. Unrecognized programs are treated as arbitrary execution (fail-safe).
The crate is self-describing: ontology exposes a compact, deterministic manifest() (axes, the effect taxonomy with per-mode policy decisions, models, command count) and describe("<name>") to expand any entry — the same progressive-disclosure pattern the library measures, so an agent can discover the whole surface without reading these docs. ontology() returns the full structured catalog (serde-serializable).

Example

cargo run -p agentic-eval --example evaluate                    # heuristic
cargo run -p agentic-eval --example evaluate --features real-tokens   # exact OpenAI BPE

use agentic_eval::tokens::{compare, Model, Program};

let legible = Program::new("read", "file.read(\"README.md\")")
    .with_standing_context("file.read(path) -> String");
let cipher  = Program::new("read", "F.r\"README.md\"")
    .with_standing_context("<multi-KB single-letter+sigil cheatsheet>");

let cmp = compare(&legible, &cipher, Model::OpenAiGpt4, 30);
assert!(cmp.winner_is_a); // legible wins once standing context is counted

Why these four axes

An agent's real cost is not the characters it types. A representation can golf input while inflating the standing context it must carry every turn — a net loss. And beyond cost, an agent needs output it can deterministically parse, failures it can branch on, and a blast radius it can't accidentally exceed. This library scores all four so a language/encoding/tool can be compared on the terms that matter for autonomous use.

Licensed AGPL-3.0-or-later.

agentic-eval 0.14.2