Crate sharpebench_core

Expand description

§sb-core — the SharpeBench scoring kernel

A pure, deterministic library that turns a set of agent trajectories (per-seed × per-window return series + decision traces) into a luck-robust, risk-adjusted score and leaderboard ranking.

Design invariants — these are what make a SharpeBench score reproducible forever:

Pure. No I/O, no system clock, no ambient randomness. Any randomness (the significance bootstrap) takes an explicit seed argument.
Deterministic. Plain f64 math, fixed reduction order, no parallel float sums. The same input yields byte-identical output on any platform.
No unsafe.

The headline idea: an agent does not rank on raw return. It ranks only if its edge survives (a) deflation for the number of agents tested (deflated_sharpe), (b) reliability across every seed×window (pass_k), and (c) decision-process discipline (process). Raw return is reported but never the rank key — see composite.

Re-exports§

pub use allocation::check_weights;
pub use allocation::score_allocation;
pub use allocation::turnover;
pub use allocation::AllocationPolicy;
pub use allocation::AllocationReport;
pub use allocation::AllocationStep;
pub use allocation::AllocationTrajectory;
pub use allocation::WeightValidity;
pub use allocation::WeightViolation;
pub use briefing::audit_briefing;
pub use briefing::Briefing;
pub use briefing::BriefingAudit;
pub use briefing::BriefingPolicy;
pub use briefing::BriefingSection;
pub use briefing::BriefingViolation;
pub use budget_curve::budget_curve;
pub use budget_curve::BudgetCurveOpts;
pub use budget_curve::BudgetCurveReport;
pub use budget_curve::BudgetPoint;
pub use comparison_sets::comparison_set;
pub use comparison_sets::qualifies;
pub use comparison_sets::restrict_field;
pub use comparison_sets::restrict_to_shared;
pub use comparison_sets::ComparisonSet;
pub use comparison_sets::TaggedRun;
pub use comparison_sets::TaggedSubmission;
pub use composite::rank;
pub use composite::score_agent;
pub use composite::AgentSubmission;
pub use composite::CompositeScore;
pub use composite::Run;
pub use composite::ScoreConfig;
pub use correlation::crowdedness;
pub use correlation::Crowdedness;
pub use disqualification::classify_disqualification;
pub use disqualification::rollup;
pub use disqualification::DisqualThresholds;
pub use disqualification::FailReason;
pub use econrationality::assess_rationality;
pub use econrationality::DominanceChoice;
pub use econrationality::EconRationalityReport;
pub use greeks::bs_greeks;
pub use greeks::bs_price;
pub use greeks::classify_greeks_risk;
pub use greeks::portfolio_greeks;
pub use greeks::Greeks;
pub use greeks::GreeksPolicy;
pub use greeks::GreeksRisk;
pub use greeks::Leg;
pub use oos::oos_decay;
pub use oos::OosDecayReport;
pub use percentile::percentile_of;
pub use percentile::BaselineBand;
pub use percentile::HumanBaseline;
pub use process::ProcessEvent;
pub use process::ProcessScore;
pub use process::Trace;
pub use rediscovery::classify_rediscovery;
pub use rediscovery::cosine_similarity;
pub use rediscovery::RediscoveryVerdict;
pub use rediscovery::DEFAULT_REDISCOVERY_THRESHOLD;
pub use rolling::rolling_sharpe;
pub use rolling::RollingSharpe;
pub use selfaudit::run_self_audit;
pub use selfaudit::SelfAuditReport;

Modules§

allocation: Allocation-vector scoring contract + turnover penalty.
attribution: Performance attribution — separate skill (alpha) from market beta.
briefing: Briefing-neutrality / input-salience-bias audit.
budget_curve: Luck-robust performance-vs-budget curve: the honest, out-of-sample inversion of an in-distribution scaling law.
calibration: Confidence calibration — does the agent’s stated conviction predict its outcomes? An agent that knows when it doesn’t know is more trustworthy with capital than one with a marginally higher Sharpe and no self-knowledge. We score this with the Brier score (lower is better; 0 = perfect, 0.25 = the always-0.5 baseline).
comparison_sets: Benchmark Comparison Sets — cross-agent ranking fairness.
composite: The composite score + leaderboard ranking — where the gates compose.
correlation: Cross-agent correlation / crowdedness — how much an agent is just riding the same factor as everyone else. Two agents with identical Sharpe are not equally valuable: the one whose returns are uncorrelated with the field is diversifying skill; the one tracking the crowd is renting a common beta that will decay (and crash) for the whole field at once. We report each agent’s correlation with the rest of the board so crowded edges are visible.
decay: Edge decay — does the agent’s signal survive forward in time, or is it a one-regime fluke? We estimate the half-life of the absolute information coefficient by regressing ln|IC| on time; a fast-decaying edge is penalized by the composite even if its average looks good. (After QuantBench’s IC half-life.)
deflated_sharpe: Sharpe ratio, Probabilistic Sharpe Ratio (PSR), and Deflated Sharpe Ratio (DSR).
disqualification: Disqualification-reason taxonomy — a legibility layer over the composite score.
econrationality: Economic-rationality litmus tests (after EconEvals).
greeks: Options pricing + Greeks-exposure risk scoring.
oos: Out-of-sample decay — how much does an agent’s edge erode after the window it was (implicitly) tuned on?
pass_k: pass^k reliability — does the agent clear the bar on every run, not on average? Stochastic agents (LLMs) can win once by luck; a benchmark that ranks the lucky single run is measuring noise. For safety-relevant suites use PassMode::All (after Sierra’s τ²-bench pass^k).
percentile: Legibility: a bare Deflated Sharpe number is illegible to outsiders.
process: Process-discipline scoring over a decision trace.
rediscovery: Rediscovery / strategy-recycling detection.
roles: Multi-agent role attribution — which role in a trading team adds skill?
rolling: Rolling-window Sharpe stability — is the deflated edge one lucky window?
selection: Selection-axis luck control.
selfaudit: Benchmark self-audit — does SharpeBench resist being gamed?
significance: Significance via a deterministic stationary bootstrap (Politis & Romano).
stats: Small, dependency-free, deterministic statistics helpers.
stylized_facts: Cont’s stylized facts — a deterministic realism validator for a return dataset.

Structs§

RealismThresholds: Realism-gate thresholds. Defaults are deliberately permissive lower bounds — a dataset only has to clear each stylized fact, not match any particular market.
RealismVerdict: The certification verdict: the measured profile, the thresholds applied, whether every gated stylized fact held, and the specific failures if not.
SelectionRobustness: Deflated-Sharpe summary across a set of candidate return streams.
StylizedFactsReport: The measured stylized-facts profile of a return series. Each field is a plain statistic; the realism predicates (StylizedFactsReport::has_fat_tails …) compare them against a RealismThresholds.

Enums§

RealismFailure: A single stylized fact a dataset failed to exhibit.

Functions§

selection_robustness: Compute selection robustness over candidate return streams. Each slice is one candidate strategy’s pooled returns; they are deflated with the same trial footprint and summarized. Empty input → all-zero.
stylized_facts: Measure the stylized-facts profile of a return series. Pure; deterministic.
validate_dataset: Certify a dataset against the default RealismThresholds.
validate_dataset_with: Certify a dataset against explicit thresholds.