Expand description
Axis 3: safety — the rate at which the model abstained from completing the user’s request.
Deliberately narrow. The signal is the model’s OWN refusal behaviour:
stop_reason == "content_filter"(a provider-standardised signal meaning the response was suppressed by the provider’s safety layer), OR- the response text matches a caller-supplied refusal pattern.
A default pattern set covers common English refusals from modern
RLHF-trained chat models (“I can’t help with that”, “I’m unable
to”, etc.). Callers using non-English models or domain-specific
refusal phrasings should pass a custom list to
compute_with_patterns.
This axis does NOT detect tool-call divergence — “candidate skipped
a tool the baseline called” surfaces on the
crate::diff::trajectory axis via edit distance, which is
principled and domain-free.
It also does NOT detect harmful semantic content delivered without
refusal: an agent that confidently invents medical dosages,
fabricates legal citations, or gives unsafe advice will still pass
this axis (the model didn’t refuse, so safety_score = 1.0). Harm
semantics need a domain rubric — Shadow’s answer is the Judge axis
(axis 8), where the user supplies an LLM-as-judge rubric. See
examples/harmful-content-judge/ for a worked example covering
medical / legal / eating-disorder content. Domain-specific policy
violations (“assistant should have asked for confirmation before
issuing a refund”, “ESI-1 must page physician”) are the Judge
axis’s territory.
§Coverage cross-references
When this axis reports severity = None but you suspect a
safety regression, check these other surfaces:
- Harmful content delivered without refusal → Judge axis (axis 8) with a domain rubric.
- Agent stopped saying “I can’t” without flipping
stop_reason→ fingerprint dimensionerror_token_flagin the v2.7+shadow.statistical.fingerprint(catches “unable”, “cannot”, “error” substrings) routed through Hotelling T². - Required disclaimers missing (“consult a clinician”,
“this is not legal advice”) →
must_include_textLTLf rule.
The goal of keeping safety narrow: the axis must mean the same thing in every domain. A rising safety rate in a customer-support bot means the same as a rising safety rate in a coding agent or a clinical-triage assistant — the model is refusing more.
Constants§
- DEFAULT_
REFUSAL_ PATTERNS - Default refusal patterns — English, lowercase-compared substrings.
Matches common phrasings produced by modern chat models across
providers. This is LLM-general (not a particular domain), and is
user-overridable via
compute_with_patterns.
Functions§
- compute
- Compute the safety axis over paired responses.
- compute_
with_ patterns computewith a caller-supplied refusal-pattern list.- is_
abstention - True iff the response is an abstention (refusal or filter stop).
- is_
abstention_ with - Variant that lets the caller supply a custom refusal pattern list.