Crate skilltest_core

Expand description

skilltest-core — the library that powers the skilltest CLI and, through it, the language SDKs and test-framework packages.

The flow is: load a Config and one or more TestCases, build a Provider (the boundary to oneharness / a model), and hand both to a Runner, which drives each case into a conversation, scores the transcript with natural-language Evals, and returns a Report. The report’s JSON form is the stable contract the language SDKs consume.

Everything that crosses a trust boundary — config files, test-case YAML, skill frontmatter, and every provider response — is parsed into a typed model before use.

Re-exports§

pub use config::CommandConfig;
pub use config::Config;
pub use config::OneharnessConfig;
pub use config::Overrides;
pub use config::ProviderConfig;
pub use conversation::Message;
pub use conversation::Role;
pub use conversation::Transcript;
pub use error::Error;
pub use error::Result;
pub use eval::Comparator;
pub use eval::Eval;
pub use eval::EvalDetail;
pub use eval::EvalOutcome;
pub use eval::JudgeValue;
pub use exit::ExitCode;
pub use provider::supports_resume;
pub use provider::AssistantTurn;
pub use provider::CommandProvider;
pub use provider::JudgeKind;
pub use provider::JudgeQuery;
pub use provider::JudgeVerdict;
pub use provider::OneharnessProvider;
pub use provider::Provider;
pub use provider::SkillRef;
pub use provider::Usage;
pub use provider::UserTurn;
pub use report::CaseRun;
pub use report::Report;
pub use report::Summary;
pub use report::ValidationFinding;
pub use report::ValidationReport;
pub use runner::Runner;
pub use skill::load_skill;
pub use skill::validate_path;
pub use skill::validate_skill;
pub use skill::Finding;
pub use skill::SkillDefinition;
pub use testcase::discover_cases;
pub use testcase::SimulatedUser;
pub use testcase::TestCase;

Modules§

config: Configuration: which provider runs skills, the default platforms and models a run fans out across, and the model used for natural-language evals.
conversation: The conversation model: the transcript that flows between the runner and the provider, and is ultimately handed to evals.
error: Error type for the core library. The mapping from these errors to process exit codes lives in the CLI (see exit.rs for the documented codes).
eval: Natural-language evaluations. An eval poses a criterion in plain English and asks the provider’s judge to score the transcript: a boolean assertion, or a numeric score compared against a threshold.
exit: Documented process exit codes. Defined in the core so they are part of the library’s contract; the CLI maps crate::Error onto them.
provider: The provider boundary. skilltest never talks to a model directly; a Provider runs the skill, plays the simulated user, and judges the transcript.
report: Run results and the JSON report. The serialized shape here is the stable contract the language SDKs parse. These types are the source of truth: their JSON Schemas (via skilltest schema, goldens in schemas/) are what the SDK contract tests compare their Pydantic/Zod models against.
runner: The runner: orchestrates a test case into a conversation, drives the provider across turns, scores the transcript with evals, and fans out over the configured platform × model matrix.
skill: Skill definitions: a directory containing a SKILL.md with YAML frontmatter and a Markdown body. This module loads them and validates them, powering the skilltest validate subcommand.
testcase: Test cases: the YAML a user writes to describe one test of a skill — the initial data to hand the skill, an optional simulated user for multi-turn runs, and the evals that decide pass/fail.