Expand description
§sb-core — the SharpeBench scoring kernel
A pure, deterministic library that turns a set of agent trajectories (per-seed × per-window return series + decision traces) into a luck-robust, risk-adjusted score and leaderboard ranking.
Design invariants — these are what make a SharpeBench score reproducible forever:
- Pure. No I/O, no system clock, no ambient randomness. Any randomness (the significance bootstrap) takes an explicit seed argument.
- Deterministic. Plain
f64math, fixed reduction order, no parallel float sums. The same input yields byte-identical output on any platform. - No
unsafe.
The headline idea: an agent does not rank on raw return. It ranks only if
its edge survives (a) deflation for the number of agents tested
(deflated_sharpe), (b) reliability across every seed×window
(pass_k), and (c) decision-process discipline (process). Raw return is
reported but never the rank key — see composite.
Re-exports§
pub use allocation::check_weights;pub use allocation::score_allocation;pub use allocation::turnover;pub use allocation::AllocationPolicy;pub use allocation::AllocationReport;pub use allocation::AllocationStep;pub use allocation::AllocationTrajectory;pub use allocation::WeightValidity;pub use allocation::WeightViolation;pub use briefing::audit_briefing;pub use briefing::Briefing;pub use briefing::BriefingAudit;pub use briefing::BriefingPolicy;pub use briefing::BriefingSection;pub use briefing::BriefingViolation;pub use comparison_sets::comparison_set;pub use comparison_sets::qualifies;pub use comparison_sets::restrict_field;pub use comparison_sets::ComparisonSet;pub use comparison_sets::TaggedRun;pub use comparison_sets::TaggedSubmission;pub use composite::rank;pub use composite::score_agent;pub use composite::AgentSubmission;pub use composite::CompositeScore;pub use composite::Run;pub use composite::ScoreConfig;pub use correlation::crowdedness;pub use correlation::Crowdedness;pub use disqualification::classify_disqualification;pub use disqualification::rollup;pub use disqualification::DisqualThresholds;pub use disqualification::FailReason;pub use econrationality::assess_rationality;pub use econrationality::DominanceChoice;pub use econrationality::EconRationalityReport;pub use greeks::bs_greeks;pub use greeks::bs_price;pub use greeks::classify_greeks_risk;pub use greeks::portfolio_greeks;pub use greeks::Greeks;pub use greeks::GreeksPolicy;pub use greeks::GreeksRisk;pub use greeks::Leg;pub use oos::oos_decay;pub use oos::OosDecayReport;pub use percentile::percentile_of;pub use process::ProcessEvent;pub use process::ProcessScore;pub use process::Trace;pub use rediscovery::classify_rediscovery;pub use rediscovery::cosine_similarity;pub use rediscovery::RediscoveryVerdict;pub use rediscovery::DEFAULT_REDISCOVERY_THRESHOLD;pub use rolling::rolling_sharpe;pub use rolling::RollingSharpe;pub use selfaudit::run_self_audit;pub use selfaudit::SelfAuditReport;
Modules§
- allocation
- Allocation-vector scoring contract + turnover penalty.
- attribution
- Performance attribution — separate skill (alpha) from market beta.
- briefing
- Briefing-neutrality / input-salience-bias audit.
- calibration
- Confidence calibration — does the agent’s stated conviction predict its outcomes? An agent that knows when it doesn’t know is more trustworthy with capital than one with a marginally higher Sharpe and no self-knowledge. We score this with the Brier score (lower is better; 0 = perfect, 0.25 = the always-0.5 baseline).
- comparison_
sets - Benchmark Comparison Sets — cross-agent ranking fairness.
- composite
- The composite score + leaderboard ranking — where the gates compose.
- correlation
- Cross-agent correlation / crowdedness — how much an agent is just riding the same factor as everyone else. Two agents with identical Sharpe are not equally valuable: the one whose returns are uncorrelated with the field is diversifying skill; the one tracking the crowd is renting a common beta that will decay (and crash) for the whole field at once. We report each agent’s correlation with the rest of the board so crowded edges are visible.
- decay
- Edge decay — does the agent’s signal survive forward in time, or is it a
one-regime fluke? We estimate the half-life of the absolute information
coefficient by regressing
ln|IC|on time; a fast-decaying edge is penalized by the composite even if its average looks good. (After QuantBench’s IC half-life.) - deflated_
sharpe - Sharpe ratio, Probabilistic Sharpe Ratio (PSR), and Deflated Sharpe Ratio (DSR).
- disqualification
- Disqualification-reason taxonomy — a legibility layer over the composite score.
- econrationality
- Economic-rationality litmus tests (after EconEvals).
- greeks
- Options pricing + Greeks-exposure risk scoring.
- oos
- Out-of-sample decay — how much does an agent’s edge erode after the window it was (implicitly) tuned on?
- pass_k
- pass^k reliability — does the agent clear the bar on every run, not on
average? Stochastic agents (LLMs) can win once by luck; a benchmark that
ranks the lucky single run is measuring noise. For safety-relevant suites use
PassMode::All(after Sierra’s τ²-bench pass^k). - percentile
- Legibility: a bare Deflated Sharpe number is illegible to outsiders.
- process
- Process-discipline scoring over a decision trace.
- rediscovery
- Rediscovery / strategy-recycling detection.
- roles
- Multi-agent role attribution — which role in a trading team adds skill?
- rolling
- Rolling-window Sharpe stability — is the deflated edge one lucky window?
- selection
- Selection-axis luck control.
- selfaudit
- Benchmark self-audit — does SharpeBench resist being gamed?
- significance
- Significance via a deterministic stationary bootstrap (Politis & Romano).
- stats
- Small, dependency-free, deterministic statistics helpers.
- stylized_
facts - Cont’s stylized facts — a deterministic realism validator for a return dataset.
Structs§
- Realism
Thresholds - Realism-gate thresholds. Defaults are deliberately permissive lower bounds — a dataset only has to clear each stylized fact, not match any particular market.
- Realism
Verdict - The certification verdict: the measured profile, the thresholds applied, whether every gated stylized fact held, and the specific failures if not.
- Selection
Robustness - Deflated-Sharpe summary across a set of candidate return streams.
- Stylized
Facts Report - The measured stylized-facts profile of a return series. Each field is a plain
statistic; the realism predicates (
StylizedFactsReport::has_fat_tails…) compare them against aRealismThresholds.
Enums§
- Realism
Failure - A single stylized fact a dataset failed to exhibit.
Functions§
- selection_
robustness - Compute selection robustness over candidate return streams. Each slice is one candidate strategy’s pooled returns; they are deflated with the same trial footprint and summarized. Empty input → all-zero.
- stylized_
facts - Measure the stylized-facts profile of a return series. Pure; deterministic.
- validate_
dataset - Certify a dataset against the default
RealismThresholds. - validate_
dataset_ with - Certify a dataset against explicit thresholds.