Skip to main content

Module safety

Module safety 

Source
Expand description

Axis 3: safety — the rate at which the model abstained from completing the user’s request.

Deliberately narrow. The signal is the model’s OWN refusal behaviour:

  • stop_reason == "content_filter" (a provider-standardised signal meaning the response was suppressed by the provider’s safety layer), OR
  • the response text matches a caller-supplied refusal pattern.

A default pattern set covers common English refusals from modern RLHF-trained chat models (“I can’t help with that”, “I’m unable to”, etc.). Callers using non-English models or domain-specific refusal phrasings should pass a custom list to compute_with_patterns.

This axis does NOT detect tool-call divergence — “candidate skipped a tool the baseline called” surfaces on the crate::diff::trajectory axis via edit distance, which is principled and domain-free.

It also does NOT detect harmful semantic content delivered without refusal: an agent that confidently invents medical dosages, fabricates legal citations, or gives unsafe advice will still pass this axis (the model didn’t refuse, so safety_score = 1.0). Harm semantics need a domain rubric — Shadow’s answer is the Judge axis (axis 8), where the user supplies an LLM-as-judge rubric. See examples/harmful-content-judge/ for a worked example covering medical / legal / eating-disorder content. Domain-specific policy violations (“assistant should have asked for confirmation before issuing a refund”, “ESI-1 must page physician”) are the Judge axis’s territory.

§Coverage cross-references

When this axis reports severity = None but you suspect a safety regression, check these other surfaces:

  • Harmful content delivered without refusal → Judge axis (axis 8) with a domain rubric.
  • Agent stopped saying “I can’t” without flipping stop_reason → fingerprint dimension error_token_flag in the v2.7+ shadow.statistical.fingerprint (catches “unable”, “cannot”, “error” substrings) routed through Hotelling T².
  • Required disclaimers missing (“consult a clinician”, “this is not legal advice”) → must_include_text LTLf rule.

The goal of keeping safety narrow: the axis must mean the same thing in every domain. A rising safety rate in a customer-support bot means the same as a rising safety rate in a coding agent or a clinical-triage assistant — the model is refusing more.

Constants§

DEFAULT_REFUSAL_PATTERNS
Default refusal patterns — English, lowercase-compared substrings. Matches common phrasings produced by modern chat models across providers. This is LLM-general (not a particular domain), and is user-overridable via compute_with_patterns.

Functions§

compute
Compute the safety axis over paired responses.
compute_with_patterns
compute with a caller-supplied refusal-pattern list.
is_abstention
True iff the response is an abstention (refusal or filter stop).
is_abstention_with
Variant that lets the caller supply a custom refusal pattern list.