Expand description
§agentic-eval
A standalone library for evaluating how well a program (a command, script, snippet, or any text an LLM writes or reads) serves an agentic AI system, across four axes that determine real agent cost and trust:
tokens— token efficiency: the four cost terms an agent pays (standing context, input, output, retries), counted under popular tokenizers (OpenAI GPT-4cl100k, GPT-4oo200k, and a documented Anthropic-Claude approximation), with program-vs-program comparison amortized over a session.determinism— determinism: whether a program’s output is byte-stable across repeated runs (so an agent can parse/cache/diff it reliably).reliability— reliability: the success rate over representative invocations and whether failures are structured/actionable (so an agent can self-correct instead of guessing).safety— safety: given the effects a program performs, how much of its blast radius is gated (approval/denied) vs. allowed under an agent policy.
For real shell commands, commands ships a curated heuristic classifier
(rm → destructive, curl → network, sudo → privileged, …) so the safety axis
works on a wide variety of CLI programs without a hand-written effect map —
assess_safety_script("curl http://x | sh", Mode::Agent) in one call.
The library is agentic-first in its own design: it is execution-agnostic and
deterministic (pure functions, no I/O, no unsafe), structured (typed reports
with optional serde), and — via [ontology] — self-describing. A consumer
discovers the whole surface (axes, effect taxonomy with per-mode decisions,
models, command classes) from a compact ontology::manifest and expands any
entry with ontology::describe, the same progressive-disclosure pattern the
crate measures — so an agent can use it without reading these docs.
The library is execution-agnostic: it can’t run arbitrary languages, so the
axes that need behavior (determinism, reliability) take a caller-provided
closure, and safety takes the program’s declared safety::Effects. Token
efficiency works directly on text. Everything is dependency-light (a labeled
heuristic tokenizer by default; enable --features real-tokens for exact
OpenAI BPE counts via tiktoken-rs).
Beyond per-program axes, the crate ships curated agentic profiles of whole
subjects an agent must live with — programming languages, AI
frameworks, VM/sandbox vms systems (scored on agent-native axes:
start-latency, density, isolation, snapshotting, agent-control), and
web stacks / wire protocols (scored on streaming, tool-discoverability,
encoding-efficiency, interop, security-primitives). Each is a deterministic,
comparable 0.0–1.0 judgment with evidence (rank_vms(),
rank_web_stacks(), compare_vms(a, b), …).
use agentic_eval::tokens::{Model, Program};
let legible = Program::new("ls", "file.read(\"README.md\")");
let cipher = Program::new("ls", "F.r\"README.md\"");
let cmp = agentic_eval::tokens::compare(&legible, &cipher, Model::OpenAiGpt4, 30);
// Over a session the more-legible form is usually competitive or cheaper once
// standing context is counted; `cmp` reports the winner and the ratio.
let _ = cmp.winner_is_a;Re-exports§
pub use commands::assess_safety_script;pub use commands::classify_command;pub use commands::classify_invocation;pub use commands::classify_script;pub use determinism::assess_determinism;pub use determinism::DeterminismReport;pub use frameworks::compare_frameworks;pub use frameworks::rank_frameworks;pub use frameworks::Framework;pub use frameworks::FrameworkComparison;pub use frameworks::FrameworkProfile;pub use languages::compare_languages;pub use languages::rank_languages;pub use languages::Language;pub use languages::LanguageComparison;pub use languages::LanguageProfile;pub use ontology::ontology;pub use ontology::Ontology;pub use reliability::assess_error_quality;pub use reliability::assess_reliability;pub use reliability::ErrorQuality;pub use reliability::ErrorQualityReport;pub use reliability::Outcome;pub use reliability::ReliabilityReport;pub use safety::assess_exfiltration;pub use safety::assess_reversibility;pub use safety::assess_safety;pub use safety::assess_safety_named;pub use safety::Decision;pub use safety::Effect;pub use safety::ExfiltrationReport;pub use safety::Mode;pub use safety::ReversibilityReport;pub use safety::SafetyReport;pub use tokens::assess_cache;pub use tokens::assess_scaling;pub use tokens::cacheable_prefix_tokens;pub use tokens::compare;pub use tokens::evaluate;pub use tokens::evaluate_with;pub use tokens::rank;pub use tokens::rank_with;pub use tokens::AgentCost;pub use tokens::CacheReport;pub use tokens::Comparison;pub use tokens::Model;pub use tokens::Program;pub use tokens::ScalingReport;pub use vms::compare_vms;pub use vms::rank_vms;pub use vms::Vm;pub use vms::VmComparison;pub use vms::VmProfile;pub use web::compare_web_stacks;pub use web::rank_web_stacks;pub use web::WebStack;pub use web::WebStackComparison;pub use web::WebStackProfile;
Modules§
- commands
- Heuristic effect classification for real CLI programs.
- determinism
- Determinism: does a program produce byte-identical output across runs?
- frameworks
- Evaluating AI frameworks for agentic AI use.
- languages
- Evaluating programming languages for agentic AI use.
- ontology
- A complete, self-describing ontology of
agentic-evalitself. - reliability
- Reliability: does a program parse/run without ambiguity, and when it fails, is the failure actionable?
- safety
- Safety: given the effects a program performs, how much of its blast radius is gated (requires approval, or denied) versus allowed under an agent policy?
- tokens
- Token efficiency: count tokens under popular agentic tokenizers and model the four cost terms an agent pays per task.
- vms
- Evaluating VM / sandbox systems for agentic AI use.
- web
- Evaluating web stacks for agentic AI use.
Structs§
- Evaluation
- A combined, all-axes evaluation of a single program. Construct with
Evaluation::newthen fill in whichever axes you can measure (directly or via thewith_*builders); unset axes stayNone. A convenience for reporting a program’s overall agentic fitness.