Expand description
Experiment engine for adaptive agent behavior testing and hyperparameter tuning.
zeph-experiments provides the infrastructure for running autonomous A/B experiments
over Zeph’s tunable parameters (temperature, top-p, retrieval depth, etc.) using an
LLM-as-judge evaluation loop.
§Architecture
The crate is organized around three main concerns:
- Benchmark datasets —
BenchmarkSet/BenchmarkCase: TOML-loaded prompt/reference pairs that define what to measure. - Evaluation —
Evaluator: runs cases against a subject model and scores responses with a judge model, producing anEvalReport. - Search strategies —
VariationGeneratorimplementations (GridStep,Random,Neighborhood) that decide which parameter to try next.
ExperimentEngine ties all three together: it evaluates a baseline, iterates over
variations produced by the generator, accepts improvements (greedy hill-climbing), and
optionally persists results to SQLite.
§Quick Start
use std::sync::Arc;
use zeph_experiments::{
BenchmarkCase, BenchmarkSet, ConfigSnapshot, EvalError, Evaluator, ExperimentEngine,
GridStep, SearchSpace,
};
let benchmark = BenchmarkSet {
cases: vec![BenchmarkCase {
prompt: "What is the capital of France?".into(),
context: None,
reference: Some("Paris".into()),
tags: None,
}],
};
// Use a mock provider for the judge in tests; real providers in production.
let judge = Arc::new(AnyProvider::Mock(MockProvider::with_responses(vec![
r#"{"score": 9.0, "reason": "correct"}"#.into(),
])));
let subject = Arc::new(AnyProvider::Mock(MockProvider::with_responses(vec![
"Paris".into(),
])));
let evaluator = Evaluator::new(Arc::clone(&judge), benchmark, 100_000)?;
let generator = Box::new(GridStep::new(SearchSpace::default()));
let baseline = ConfigSnapshot::default();
let config = ExperimentConfig::default();
let mut engine = ExperimentEngine::new(evaluator, generator, subject, baseline, config, None);
let report = engine.run().await?;
println!("baseline={:.2} final={:.2}", report.baseline_score, report.final_score);Re-exports§
pub use benchmark::BenchmarkCase;pub use benchmark::BenchmarkSet;pub use engine::ExperimentEngine;pub use engine::ExperimentSessionReport;pub use error::EvalError;pub use evaluator::CaseScore;pub use evaluator::EvalReport;pub use evaluator::Evaluator;pub use evaluator::JudgeOutput;pub use generator::VariationGenerator;pub use grid::GridStep;pub use neighborhood::Neighborhood;pub use random::Random;pub use search_space::ParameterRange;pub use search_space::SearchSpace;pub use snapshot::ConfigSnapshot;pub use types::ExperimentResult;pub use types::ExperimentSource;pub use types::ParameterKind;pub use types::Variation;pub use types::VariationValue;
Modules§
- benchmark
- Benchmark dataset types and TOML loading.
- engine
- Experiment engine — core async loop for autonomous parameter tuning.
- error
- Error types for the experiments module.
- evaluator
- LLM-as-judge evaluator for benchmark datasets.
- generator
VariationGeneratortrait for parameter variation strategies.- grid
- Systematic grid sweep strategy for parameter variation.
- neighborhood
- Neighborhood perturbation strategy for parameter variation.
- random
- Uniform random sampling strategy for parameter variation.
- search_
space - Search space definition for parameter variation experiments.
- snapshot
- Config snapshot for a single experiment arm.
- types
Structs§
- Generation
Overrides - Partial LLM generation parameter overrides for experiment variation injection.