Crate agentic_eval

Expand description

§agentic-eval

A standalone library for evaluating how well a program (a command, script, snippet, or any text an LLM writes or reads) serves an agentic AI system, across four axes that determine real agent cost and trust:

tokens — token efficiency: the four cost terms an agent pays (standing context, input, output, retries), counted under popular tokenizers (OpenAI GPT-4 cl100k, GPT-4o o200k, and a documented Anthropic-Claude approximation), with program-vs-program comparison amortized over a session.
determinism — determinism: whether a program’s output is byte-stable across repeated runs (so an agent can parse/cache/diff it reliably).
reliability — reliability: the success rate over representative invocations and whether failures are structured/actionable (so an agent can self-correct instead of guessing).
safety — safety: given the effects a program performs, how much of its blast radius is gated (approval/denied) vs. allowed under an agent policy.

For real shell commands, commands ships a curated heuristic classifier (rm → destructive, curl → network, sudo → privileged, …) so the safety axis works on a wide variety of CLI programs without a hand-written effect map — assess_safety_script("curl http://x | sh", Mode::Agent) in one call.

The library is agentic-first in its own design: it is execution-agnostic and deterministic (pure functions, no I/O, no unsafe), structured (typed reports with optional serde), and — via [ontology] — self-describing. A consumer discovers the whole surface (axes, effect taxonomy with per-mode decisions, models, command classes) from a compact ontology::manifest and expands any entry with ontology::describe, the same progressive-disclosure pattern the crate measures — so an agent can use it without reading these docs.

The library is execution-agnostic: it can’t run arbitrary languages, so the axes that need behavior (determinism, reliability) take a caller-provided closure, and safety takes the program’s declared safety::Effects. Token efficiency works directly on text. Everything is dependency-light (a labeled heuristic tokenizer by default; enable --features real-tokens for exact OpenAI BPE counts via tiktoken-rs).

Beyond per-program axes, the crate ships curated agentic profiles of whole subjects an agent must live with — programming languages, AI frameworks, VM/sandbox vms systems (scored on agent-native axes: start-latency, density, isolation, snapshotting, agent-control), and web stacks / wire protocols (scored on streaming, tool-discoverability, encoding-efficiency, interop, security-primitives). Each is a deterministic, comparable 0.0–1.0 judgment with evidence (rank_vms(), rank_web_stacks(), compare_vms(a, b), …).

use agentic_eval::tokens::{Model, Program};
let legible = Program::new("ls", "file.read(\"README.md\")");
let cipher = Program::new("ls", "F.r\"README.md\"");
let cmp = agentic_eval::tokens::compare(&legible, &cipher, Model::OpenAiGpt4, 30);
// Over a session the more-legible form is usually competitive or cheaper once
// standing context is counted; `cmp` reports the winner and the ratio.
let _ = cmp.winner_is_a;

Re-exports§

pub use commands::assess_safety_script;
pub use commands::classify_command;
pub use commands::classify_invocation;
pub use commands::classify_script;
pub use determinism::assess_determinism;
pub use determinism::DeterminismReport;
pub use frameworks::compare_frameworks;
pub use frameworks::rank_frameworks;
pub use frameworks::Framework;
pub use frameworks::FrameworkComparison;
pub use frameworks::FrameworkProfile;
pub use languages::compare_languages;
pub use languages::rank_languages;
pub use languages::Language;
pub use languages::LanguageComparison;
pub use languages::LanguageProfile;
pub use ontology::ontology;
pub use ontology::Ontology;
pub use reliability::assess_error_quality;
pub use reliability::assess_reliability;
pub use reliability::ErrorQuality;
pub use reliability::ErrorQualityReport;
pub use reliability::Outcome;
pub use reliability::ReliabilityReport;
pub use safety::assess_exfiltration;
pub use safety::assess_reversibility;
pub use safety::assess_safety;
pub use safety::assess_safety_named;
pub use safety::Decision;
pub use safety::Effect;
pub use safety::ExfiltrationReport;
pub use safety::Mode;
pub use safety::ReversibilityReport;
pub use safety::SafetyReport;
pub use tokens::assess_cache;
pub use tokens::assess_scaling;
pub use tokens::cacheable_prefix_tokens;
pub use tokens::compare;
pub use tokens::evaluate;
pub use tokens::evaluate_with;
pub use tokens::rank;
pub use tokens::rank_with;
pub use tokens::AgentCost;
pub use tokens::CacheReport;
pub use tokens::Comparison;
pub use tokens::Model;
pub use tokens::Program;
pub use tokens::ScalingReport;
pub use vms::compare_vms;
pub use vms::rank_vms;
pub use vms::Vm;
pub use vms::VmComparison;
pub use vms::VmProfile;
pub use web::compare_web_stacks;
pub use web::rank_web_stacks;
pub use web::WebStack;
pub use web::WebStackComparison;
pub use web::WebStackProfile;

Modules§

commands: Heuristic effect classification for real CLI programs.
determinism: Determinism: does a program produce byte-identical output across runs?
frameworks: Evaluating AI frameworks for agentic AI use.
languages: Evaluating programming languages for agentic AI use.
ontology: A complete, self-describing ontology of agentic-eval itself.
reliability: Reliability: does a program parse/run without ambiguity, and when it fails, is the failure actionable?
safety: Safety: given the effects a program performs, how much of its blast radius is gated (requires approval, or denied) versus allowed under an agent policy?
tokens: Token efficiency: count tokens under popular agentic tokenizers and model the four cost terms an agent pays per task.
vms: Evaluating VM / sandbox systems for agentic AI use.
web: Evaluating web stacks for agentic AI use.

Structs§

Evaluation: A combined, all-axes evaluation of a single program. Construct with Evaluation::new then fill in whichever axes you can measure (directly or via the with_* builders); unset axes stay None. A convenience for reporting a program’s overall agentic fitness.