Crate zeph_experiments

Expand description

Experiment engine for adaptive agent behavior testing and hyperparameter tuning.

zeph-experiments provides the infrastructure for running autonomous A/B experiments over Zeph’s tunable parameters (temperature, top-p, retrieval depth, etc.) using an LLM-as-judge evaluation loop.

§Architecture

The crate is organized around three main concerns:

Benchmark datasets — BenchmarkSet / BenchmarkCase: TOML-loaded prompt/reference pairs that define what to measure.
Evaluation — Evaluator: runs cases against a subject model and scores responses with a judge model, producing an EvalReport.
Search strategies — VariationGenerator implementations (GridStep, Random, Neighborhood) that decide which parameter to try next.

ExperimentEngine ties all three together: it evaluates a baseline, iterates over variations produced by the generator, accepts improvements (greedy hill-climbing), and optionally persists results to SQLite.

§Quick Start

use std::sync::Arc;
use zeph_experiments::{
    BenchmarkCase, BenchmarkSet, ConfigSnapshot, EvalError, Evaluator, ExperimentEngine,
    GridStep, SearchSpace,
};

let benchmark = BenchmarkSet {
    cases: vec![BenchmarkCase {
        prompt: "What is the capital of France?".into(),
        context: None,
        reference: Some("Paris".into()),
        tags: None,
    }],
};

// Use a mock provider for the judge in tests; real providers in production.
let judge = Arc::new(AnyProvider::Mock(MockProvider::with_responses(vec![
    r#"{"score": 9.0, "reason": "correct"}"#.into(),
])));
let subject = Arc::new(AnyProvider::Mock(MockProvider::with_responses(vec![
    "Paris".into(),
])));

let evaluator = Evaluator::new(Arc::clone(&judge), benchmark, 100_000)?;
let generator = Box::new(GridStep::new(SearchSpace::default()));
let baseline = ConfigSnapshot::default();
let config = ExperimentConfig::default();

let mut engine = ExperimentEngine::new(evaluator, generator, subject, baseline, config, None);
let report = engine.run().await?;
println!("baseline={:.2} final={:.2}", report.baseline_score, report.final_score);

Re-exports§

pub use benchmark::BenchmarkCase;
pub use benchmark::BenchmarkSet;
pub use engine::ExperimentEngine;
pub use engine::ExperimentSessionReport;
pub use error::EvalError;
pub use evaluator::CaseScore;
pub use evaluator::EvalReport;
pub use evaluator::Evaluator;
pub use evaluator::JudgeOutput;
pub use generator::VariationGenerator;
pub use grid::GridStep;
pub use neighborhood::Neighborhood;
pub use random::Random;
pub use search_space::ParameterRange;
pub use search_space::SearchSpace;
pub use snapshot::ConfigSnapshot;
pub use types::ExperimentResult;
pub use types::ExperimentSource;
pub use types::ParameterKind;
pub use types::Variation;
pub use types::VariationValue;

Modules§

benchmark: Benchmark dataset types and TOML loading.
engine: Experiment engine — core async loop for autonomous parameter tuning.
error: Error types for the experiments module.
evaluator: LLM-as-judge evaluator for benchmark datasets.
generator: VariationGenerator trait for parameter variation strategies.
grid: Systematic grid sweep strategy for parameter variation.
neighborhood: Neighborhood perturbation strategy for parameter variation.
random: Uniform random sampling strategy for parameter variation.
search_space: Search space definition for parameter variation experiments.
snapshot: Config snapshot for a single experiment arm.
types

Structs§

GenerationOverrides: Partial LLM generation parameter overrides for experiment variation injection.