Skip to main content

Crate zeph_experiments

Crate zeph_experiments 

Source
Expand description

Experiment engine for adaptive agent behavior testing and hyperparameter tuning.

zeph-experiments provides the infrastructure for running autonomous A/B experiments over Zeph’s tunable parameters (temperature, top-p, retrieval depth, etc.) using an LLM-as-judge evaluation loop.

§Architecture

The crate is organized around three main concerns:

  1. Benchmark datasetsBenchmarkSet / BenchmarkCase: TOML-loaded prompt/reference pairs that define what to measure.
  2. EvaluationEvaluator: runs cases against a subject model and scores responses with a judge model, producing an EvalReport.
  3. Search strategiesVariationGenerator implementations (GridStep, Random, Neighborhood) that decide which parameter to try next.

ExperimentEngine ties all three together: it evaluates a baseline, iterates over variations produced by the generator, accepts improvements (greedy hill-climbing), and optionally persists results to SQLite.

§Quick Start

use std::sync::Arc;
use zeph_experiments::{
    BenchmarkCase, BenchmarkSet, ConfigSnapshot, EvalError, Evaluator, ExperimentEngine,
    GridStep, SearchSpace,
};

let benchmark = BenchmarkSet {
    cases: vec![BenchmarkCase {
        prompt: "What is the capital of France?".into(),
        context: None,
        reference: Some("Paris".into()),
        tags: None,
    }],
};

// Use a mock provider for the judge in tests; real providers in production.
let judge = Arc::new(AnyProvider::Mock(MockProvider::with_responses(vec![
    r#"{"score": 9.0, "reason": "correct"}"#.into(),
])));
let subject = Arc::new(AnyProvider::Mock(MockProvider::with_responses(vec![
    "Paris".into(),
])));

let evaluator = Evaluator::new(Arc::clone(&judge), benchmark, 100_000)?;
let generator = Box::new(GridStep::new(SearchSpace::default()));
let baseline = ConfigSnapshot::default();
let config = ExperimentConfig::default();

let mut engine = ExperimentEngine::new(evaluator, generator, subject, baseline, config, None);
let report = engine.run().await?;
println!("baseline={:.2} final={:.2}", report.baseline_score, report.final_score);

Re-exports§

pub use benchmark::BenchmarkCase;
pub use benchmark::BenchmarkSet;
pub use engine::ExperimentEngine;
pub use engine::ExperimentSessionReport;
pub use error::EvalError;
pub use evaluator::CaseScore;
pub use evaluator::EvalReport;
pub use evaluator::Evaluator;
pub use evaluator::JudgeOutput;
pub use generator::VariationGenerator;
pub use grid::GridStep;
pub use neighborhood::Neighborhood;
pub use random::Random;
pub use search_space::ParameterRange;
pub use search_space::SearchSpace;
pub use snapshot::ConfigSnapshot;
pub use types::ExperimentResult;
pub use types::ExperimentSource;
pub use types::ParameterKind;
pub use types::Variation;
pub use types::VariationValue;

Modules§

benchmark
Benchmark dataset types and TOML loading.
engine
Experiment engine — core async loop for autonomous parameter tuning.
error
Error types for the experiments module.
evaluator
LLM-as-judge evaluator for benchmark datasets.
generator
VariationGenerator trait for parameter variation strategies.
grid
Systematic grid sweep strategy for parameter variation.
neighborhood
Neighborhood perturbation strategy for parameter variation.
random
Uniform random sampling strategy for parameter variation.
search_space
Search space definition for parameter variation experiments.
snapshot
Config snapshot for a single experiment arm.
types

Structs§

GenerationOverrides
Partial LLM generation parameter overrides for experiment variation injection.