Skip to main content

Crate sharpebench_core

Crate sharpebench_core 

Source
Expand description

§sb-core — the SharpeBench scoring kernel

A pure, deterministic library that turns a set of agent trajectories (per-seed × per-window return series + decision traces) into a luck-robust, risk-adjusted score and leaderboard ranking.

Design invariants — these are what make a SharpeBench score reproducible forever:

  • Pure. No I/O, no system clock, no ambient randomness. Any randomness (the significance bootstrap) takes an explicit seed argument.
  • Deterministic. Plain f64 math, fixed reduction order, no parallel float sums. The same input yields byte-identical output on any platform.
  • No unsafe.

The headline idea: an agent does not rank on raw return. It ranks only if its edge survives (a) deflation for the number of agents tested (deflated_sharpe), (b) reliability across every seed×window (pass_k), and (c) decision-process discipline (process). Raw return is reported but never the rank key — see composite.

Re-exports§

pub use allocation::check_weights;
pub use allocation::score_allocation;
pub use allocation::turnover;
pub use allocation::AllocationPolicy;
pub use allocation::AllocationReport;
pub use allocation::AllocationStep;
pub use allocation::AllocationTrajectory;
pub use allocation::WeightValidity;
pub use allocation::WeightViolation;
pub use briefing::audit_briefing;
pub use briefing::Briefing;
pub use briefing::BriefingAudit;
pub use briefing::BriefingPolicy;
pub use briefing::BriefingSection;
pub use briefing::BriefingViolation;
pub use comparison_sets::comparison_set;
pub use comparison_sets::qualifies;
pub use comparison_sets::restrict_field;
pub use comparison_sets::restrict_to_shared;
pub use comparison_sets::ComparisonSet;
pub use comparison_sets::TaggedRun;
pub use comparison_sets::TaggedSubmission;
pub use composite::rank;
pub use composite::score_agent;
pub use composite::AgentSubmission;
pub use composite::CompositeScore;
pub use composite::Run;
pub use composite::ScoreConfig;
pub use correlation::crowdedness;
pub use correlation::Crowdedness;
pub use disqualification::classify_disqualification;
pub use disqualification::rollup;
pub use disqualification::DisqualThresholds;
pub use disqualification::FailReason;
pub use econrationality::assess_rationality;
pub use econrationality::DominanceChoice;
pub use econrationality::EconRationalityReport;
pub use greeks::bs_greeks;
pub use greeks::bs_price;
pub use greeks::classify_greeks_risk;
pub use greeks::portfolio_greeks;
pub use greeks::Greeks;
pub use greeks::GreeksPolicy;
pub use greeks::GreeksRisk;
pub use greeks::Leg;
pub use oos::oos_decay;
pub use oos::OosDecayReport;
pub use percentile::percentile_of;
pub use process::ProcessEvent;
pub use process::ProcessScore;
pub use process::Trace;
pub use rediscovery::classify_rediscovery;
pub use rediscovery::cosine_similarity;
pub use rediscovery::RediscoveryVerdict;
pub use rediscovery::DEFAULT_REDISCOVERY_THRESHOLD;
pub use rolling::rolling_sharpe;
pub use rolling::RollingSharpe;
pub use selfaudit::run_self_audit;
pub use selfaudit::SelfAuditReport;

Modules§

allocation
Allocation-vector scoring contract + turnover penalty.
attribution
Performance attribution — separate skill (alpha) from market beta.
briefing
Briefing-neutrality / input-salience-bias audit.
calibration
Confidence calibration — does the agent’s stated conviction predict its outcomes? An agent that knows when it doesn’t know is more trustworthy with capital than one with a marginally higher Sharpe and no self-knowledge. We score this with the Brier score (lower is better; 0 = perfect, 0.25 = the always-0.5 baseline).
comparison_sets
Benchmark Comparison Sets — cross-agent ranking fairness.
composite
The composite score + leaderboard ranking — where the gates compose.
correlation
Cross-agent correlation / crowdedness — how much an agent is just riding the same factor as everyone else. Two agents with identical Sharpe are not equally valuable: the one whose returns are uncorrelated with the field is diversifying skill; the one tracking the crowd is renting a common beta that will decay (and crash) for the whole field at once. We report each agent’s correlation with the rest of the board so crowded edges are visible.
decay
Edge decay — does the agent’s signal survive forward in time, or is it a one-regime fluke? We estimate the half-life of the absolute information coefficient by regressing ln|IC| on time; a fast-decaying edge is penalized by the composite even if its average looks good. (After QuantBench’s IC half-life.)
deflated_sharpe
Sharpe ratio, Probabilistic Sharpe Ratio (PSR), and Deflated Sharpe Ratio (DSR).
disqualification
Disqualification-reason taxonomy — a legibility layer over the composite score.
econrationality
Economic-rationality litmus tests (after EconEvals).
greeks
Options pricing + Greeks-exposure risk scoring.
oos
Out-of-sample decay — how much does an agent’s edge erode after the window it was (implicitly) tuned on?
pass_k
pass^k reliability — does the agent clear the bar on every run, not on average? Stochastic agents (LLMs) can win once by luck; a benchmark that ranks the lucky single run is measuring noise. For safety-relevant suites use PassMode::All (after Sierra’s τ²-bench pass^k).
percentile
Legibility: a bare Deflated Sharpe number is illegible to outsiders.
process
Process-discipline scoring over a decision trace.
rediscovery
Rediscovery / strategy-recycling detection.
roles
Multi-agent role attribution — which role in a trading team adds skill?
rolling
Rolling-window Sharpe stability — is the deflated edge one lucky window?
selection
Selection-axis luck control.
selfaudit
Benchmark self-audit — does SharpeBench resist being gamed?
significance
Significance via a deterministic stationary bootstrap (Politis & Romano).
stats
Small, dependency-free, deterministic statistics helpers.
stylized_facts
Cont’s stylized facts — a deterministic realism validator for a return dataset.

Structs§

RealismThresholds
Realism-gate thresholds. Defaults are deliberately permissive lower bounds — a dataset only has to clear each stylized fact, not match any particular market.
RealismVerdict
The certification verdict: the measured profile, the thresholds applied, whether every gated stylized fact held, and the specific failures if not.
SelectionRobustness
Deflated-Sharpe summary across a set of candidate return streams.
StylizedFactsReport
The measured stylized-facts profile of a return series. Each field is a plain statistic; the realism predicates (StylizedFactsReport::has_fat_tails …) compare them against a RealismThresholds.

Enums§

RealismFailure
A single stylized fact a dataset failed to exhibit.

Functions§

selection_robustness
Compute selection robustness over candidate return streams. Each slice is one candidate strategy’s pooled returns; they are deflated with the same trial footprint and summarized. Empty input → all-zero.
stylized_facts
Measure the stylized-facts profile of a return series. Pure; deterministic.
validate_dataset
Certify a dataset against the default RealismThresholds.
validate_dataset_with
Certify a dataset against explicit thresholds.