split-brain-harness
Split-Brain Harness (SBH) is a Rust security layer that wraps any LLM and detects prompt injection, insider threat patterns, authority impersonation, and multi-turn session escalation before a response is ever generated. It runs as a drop-in OpenAI-compatible proxy with no changes to the downstream application, works fully offline against a local model, and ships as a single static binary.
Benchmarked against three adversarial datasets (llama3.2:3b, local Ollama, air-gapped): Deepset (546 rows): precision 0.81 · recall 0.37 · F1 0.51 — CyberEC (141 rows): precision 1.00 · recall 0.50 · F1 0.67 — TrustAI (1,398 unlabeled jailbreaks): 94.8% flagging rate. CyberEC precision is perfect — zero false positives. Stage 0 normalizer catches 50% of CyberEC encoding-evasion false negatives (homoglyphs, base64, Morse, backslash-escape, leet).
354 tests · CI green · MIT license
Stage 0 normalizer components are available as standalone crates: deobfuscate — 7-pass encoding-evasion normalizer · unicode-interference — forward/reverse script-mixing detector
Quick demo (no backend required)
What it does
Two-stage pipeline. The proposer wraps every input in a soul-injected system prompt and produces structured telemetry. The verifier runs deterministic consistency checks against the output — and optionally a second LLM pass for deeper grounding.
input text
↓
[Adaptor] trigger-matched context packs injected into system prompt
↓
[Transformer] soul + RAG corpus → prompt assembly
↓
[Stage 1: Propose] soul-wrapped LLM call → TelemetryResult JSON
↓
[Stage 2: Verify] 6 deterministic checks (always on) ± LLM verifier pass
↓
HarnessResult { telemetry, verification, trace }
If the model returns non-JSON or a refusal, a safe structured fallback is returned instead of crashing. When stop_and_ask=true fires (confidence < 0.4 or 3+ flags), the result must not be acted on blindly.
Subcommands
| Command | Description |
|---|---|
sbh analyze |
Run a single input through the full pipeline |
sbh demo |
5-scenario threat demo — DHS-relevant cases (prompt injection, insider, adversary impersonation, BEC) |
sbh demo --serve |
Multi-turn slow-boil session escalation demo — shows escalation detection firing at the inflection point |
sbh bench |
Calibration benchmark against a JSONL input set with baseline diff |
sbh serve |
OpenAI-compatible HTTPS proxy — routes any OpenAI client through the harness |
sbh doctor |
Health check: backend, soul, context corpus, forge toolchain, witness layer |
sbh forge |
Ephemeral Tool Forge — generate, sandbox (WASM/WASI), and reputation-gate LLM-produced tools |
sbh audit |
View the forge audit trail |
sbh export-ollama |
Bake soul + context docs into an Ollama Modelfile for self-contained deployment |
sbh debug-bundle |
Capture full diagnostic snapshot to JSON |
sbh-monitor |
TUI chat interface with live telemetry panel |
Quick start
# local Ollama (default)
# prompt injection
# insider threat probe
# authority impersonation
# Anthropic backend
SBH_BACKEND=anthropic SBH_API_KEY=sk-ant-... \
# offline demo — no backend required, presentation-safe
# proxy mode — three-turn escalation trace via curl
SBH_SESSION_LOG=/tmp/sbh_sessions.jsonl \
&
Output schema
Telemetry fields
| Field | Type | Description |
|---|---|---|
primary_emotion |
string | Dominant emotional register |
emotional_intensity |
float 0–1 | 0 = flat, 1 = extreme distress or excitement |
structural_tone |
string[] | Rhetorical posture: adversarial, cooperative, coercive, … |
stated_objective |
string | 10-word summary of the explicit request |
subtextual_motive |
string | Unstated psychological goal or hidden lever |
manipulation_risk |
low / medium / high | Coercion directed at the AI system specifically |
urgency_vector |
float 0–1 | Manufactured time pressure (advocacy/fiction do not raise this) |
coherence_rating |
float 0–1 | Rational and focused vs. scattered or chaotic |
Verifier consistency checks (6, always-on, no extra API cost)
| Check | Fires when |
|---|---|
| Hostile emotion vs risk | intensity ≥ 0.7 + hostile emotion + risk=low |
| Adversarial tone vs risk | tone contains adversarial/coercive/threatening + risk=low |
| Urgency vs risk | urgency_vector ≥ 0.7 + risk=low |
| Low coherence | coherence_rating < 0.3 |
| High confidence, high flags | confidence > 0.8 but 2+ flags fire |
| High risk, no coercive signals | risk=high but urgency < 0.4 and no coercive tone |
CLI reference
analyze
demo
bench
Calibration benchmark — run a JSONL question set and compare against a baseline:
Input JSONL supports {text}, {turns:[...]}, or {question} fields — compatible with MT-Bench, LLM-Sec-Eval, and prior sbh output.
Per-item output: [N/total] status risk elapsed text...
Status: same (dim) / fixed (green) / REGRESSED (red) / new
--fail-on-regression exits 1 if any input moves to a higher risk level — suitable for CI gates on soul.md changes.
serve
OpenAI-compatible HTTPS proxy:
Routes:
POST /v1/chat/completions— full harness pipeline behind the OpenAI APIGET /health— liveness/versionGET /metrics— Prometheus text exposition (6 counters + gauges)
Response extras:
x-sbh-telemetryheader — URL-encoded telemetry JSONx-sbh-session/x-sbh-session-turns— multi-turn session trackingx-sbh-session-alert: escalation_detected— slow-boil escalation detection (≥3 turns, risk delta > 0.5)
Security hardening:
SBH_SERVE_KEY— Bearer token auth; 401 on mismatch; key never forwarded upstreamSBH_SERVE_RATE— per-IP sliding window rate limit (default 60/min); 429 on breachSBH_SERVE_MAX_BODY— body size cap (default 1 MiB)--tls-cert/--tls-key(orSBH_TLS_CERT/SBH_TLS_KEY) — rustls TLS termination, no OpenSSL
doctor
Reports: backend reachability, forge toolchain (wasm32-wasip1, wasmtime), soul sections, context corpus doc count, witness layer status.
export-ollama
Bake soul + context docs into a self-contained Ollama Modelfile:
SBH_CONTEXT_PATH=/path/to/ops-doctrine.toml \
The model has the soul and doctrine baked in. No runtime dependency on the harness binary — fully air-gapped deployable.
forge
Ephemeral Tool Forge — LLM generates a Rust tool, compiles to WASM/WASI, runs in sandbox, tracks reputation:
Five phases: schema validation → mock supervisor → LLM code gen → WASM/WASI sandbox → reputation + regeneration. Full audit trail via SBH_AUDIT_PATH.
sbh-monitor
TUI chat interface with live telemetry panel:
Split-screen: chat + streaming response on the left, telemetry panel (all fields) on the right, updates after each turn.
Keys: Enter send · Backspace delete · ? help · Esc/q quit · /clear reset
Context corpus (RAG layer)
Four threat-pattern docs are compiled into the binary and injected into every system prompt:
| Doc | Content |
|---|---|
schema.telemetry |
TelemetryResult field reference with calibration notes |
threat.prompt_injection |
Direct and indirect injection patterns |
threat.social_engineering |
Authority + urgency, flattery, guilt patterns |
threat.adversarial_probing |
System prompt extraction, jailbreak scaffolding |
Operators can extend or replace this corpus:
SBH_CONTEXT_PATH=/path/to/agency-doctrine.toml
SBH_CONTEXT_PATH=/path/to/doctrine-dir/
TOML format:
[[]]
= "my.doctrine"
= "Agency Threat Policy"
= "..."
= ["threat", "policy"]
Benchmark results
MT-Bench (80 questions, 10 categories)
Run on llama3.2:3b (local, offline). Baseline: fixtures/mt_bench_sbh_results_v2.jsonl
| Risk | Count |
|---|---|
| low | 78 |
| medium | 1 (base rate fallacy/politicians — known 3B model limitation) |
| high | 0 |
Script: python3 scripts/run_mt_bench.py
LLM-Sec-Evaluation (150 Chinese-language security questions)
| Risk | Count | Notes |
|---|---|---|
| low | 121 | Clean: OS/networking, legal/compliance, secure-dev, asset-mgmt |
| medium | 22 | Edge cases |
| high | 6 | ✓ Correctly detected: wget dropper, SQL injection on .gov, phishing HTML, JSP webshells, buffer overflow |
motive: unknown on most Chinese input — llama3.2:3b limitation; resolved with a larger model.
Backends
SBH_BACKEND |
Description |
|---|---|
ollama-native |
Ollama native API (/api/chat) — default |
openai-compat |
Any OpenAI-compatible endpoint (/chat/completions) |
anthropic |
Anthropic Messages API |
Recommended models:
| Use case | Model |
|---|---|
| Local dev / quick triage | llama3.2:3b — fast, 2 GB |
| Higher assurance local | qwen3.5:latest — 6.6 GB |
| Production / high assurance | claude-sonnet-4-6 via Anthropic backend |
Configuration
Priority order: env vars → config.toml → hardcoded defaults
# config.toml
= "anthropic"
= "claude-sonnet-4-6"
= "sk-ant-..."
= "deterministic"
Environment variables
| Variable | Default | Description |
|---|---|---|
SBH_BACKEND |
ollama-native |
Backend |
SBH_ENDPOINT |
(backend default) | API endpoint |
SBH_MODEL |
llama3.2:3b |
Model name |
SBH_API_KEY |
— | API key (required for anthropic) |
SBH_VERIFY |
deterministic |
deterministic | llm | none |
SBH_SOUL_PATH |
— | Custom soul.md path (empty = compiled-in default) |
SBH_CONTEXT_PATH |
— | Extra context TOML file or directory |
SBH_CONFIG |
./config.toml |
Config file path |
SBH_TIMEOUT_SECONDS |
120 |
Backend request timeout |
SBH_MEMORY_PATH |
— | Forge reputation persistence path |
SBH_AUDIT_PATH |
— | Forge audit log path (append-only JSONL) |
SBH_SERVE_KEY |
— | Bearer token for serve auth |
SBH_SERVE_RATE |
60 |
Rate limit requests/min/IP |
SBH_SERVE_MAX_BODY |
1048576 |
Body size cap (bytes) |
SBH_SESSION_LOG |
— | Session escalation log path (append-only JSONL) |
SBH_TLS_CERT |
— | TLS certificate PEM path |
SBH_TLS_KEY |
— | TLS private key PEM path |
Library usage
use ;
let config = Config ;
let result = analyze.await?;
println!;
println!;
if result.verification.stop_and_ask
Custom soul
The soul is embedded at compile time from soul.md. Override at runtime:
SBH_SOUL_PATH=/path/to/your/soul.md
Required sections: [LOGIC_SYSTEM_PROMPT] and [VERIFIER_SYSTEM_PROMPT].
HTTPS deployment
# Self-signed cert (dev/demo)
SBH_SERVE_KEY=your-secret-token \
TLS is handled by rustls — no OpenSSL dependency, no system library requirement.
For production, a reverse proxy (nginx, caddy) terminating TLS at the edge is also valid.
Building
Requires Rust 1.75+. For the Forge WASM sandbox:
|
License
MIT