JailGuard

JailGuard is a pure-Rust prompt-injection detector with a 1.5 MB embedded MLP classifier. It scores text in p50 14 ms on CPU, achieves 98.40% accuracy on a 7,049-sample held-out test set drawn from 17 public datasets, and ships bindings for Rust, Python, JavaScript, Go, and Elixir. No external service, no API key. Dual-licensed under MIT OR Apache-2.0.

Quick start

Rust — cargo add jailguard

use jailguard::{detect, is_injection};

if is_injection("ignore previous instructions") {
    return Err("blocked");
}

let result = detect("What is the capital of France?");
println!("score={:.3} risk={:?}", result.score, result.risk);

Python — pip install jailguard

import jailguard

if jailguard.is_injection("ignore previous instructions"):
    raise RuntimeError("blocked")

result = jailguard.detect("What is the capital of France?")
print(result.score, result.risk)

JavaScript / TypeScript — npm install @yfedoseev/jailguard

import { detect, isInjection } from "@yfedoseev/jailguard";

if (isInjection("ignore previous instructions")) {
    throw new Error("blocked");
}

const r = detect("What is the capital of France?");
console.log(r.score, r.risk);

Go — go get github.com/yfedoseev/jailguard/go

import jailguard "github.com/yfedoseev/jailguard/go"

if injection, _ := jailguard.IsInjection("ignore previous instructions"); injection {
    log.Fatal("blocked")
}

result, _ := jailguard.Detect("What is the capital of France?")
fmt.Printf("score=%.3f risk=%v\n", result.Score, result.Risk)

Elixir — mix.exs

def deps do
  [{:jailguard, "~> 0.1.2"}]
end

:ok = JailGuard.download_model()

{:ok, injection?} = JailGuard.is_injection("ignore previous instructions")
if injection?, do: raise("blocked")

{:ok, result} = JailGuard.detect("What is the capital of France?")
IO.inspect({result.score, result.risk})

Precompiled NIFs ship for Linux (x86_64, aarch64), macOS (x86_64, aarch64), and Windows (x86_64) — no Rust toolchain on install. Set JAILGUARD_BUILD=1 to compile from source on unsupported targets.

The classifier is embedded in every binding. The 90 MB MiniLM ONNX embedder is auto-downloaded to ~/.cache/jailguard/ on first use. For production: call jailguard::download_model() at startup to warm the cache before serving traffic.

JailGuard vs alternatives in 2026

Feature	JailGuard	Lakera Guard	Rebuff	ProtectAI deberta-v3	Meta Prompt Guard 2
License	Apache 2.0 / MIT	proprietary (Check Point announced acquisition Sep 16, 2025)	Apache 2.0 — archived May 16, 2025	Apache 2.0 (parent acq. by Palo Alto Jul 22, 2025)	Llama 4 Community (non-OSI)
Deployment	embedded library	SaaS API	self-host Python SDK	HF model	HF model
Model size	1.5 MB MLP + 90 MB MiniLM ONNX	n/a (API)	n/a	~440 MB	22 M or 86 M params
Latency (CPU)	p50 14 ms	~150–300 ms RTT	n/a	104–212 ms	92 ms (A100 GPU)¹
Classification	8-class taxonomy	binary	binary	binary	binary
Active in 2026?	✅	✅ (Check Point pending)	❌ archived	✅ (Palo Alto)	✅
No PyTorch / no runtime dep	✅ (Rust)	❌ HTTP client	❌ Python+OpenAI	❌ PyTorch	❌ PyTorch
Multi-language bindings	Rust, Python, JS, Go, Elixir	API clients	Python	Python	Python

¹ Meta does not publish CPU latency for Prompt Guard 2.

Full methodology, dataset breakdown, and head-to-head local-CPU comparisons against protectai/deberta-v3-base-prompt-injection-v2, deepset/deberta-v3-base-injection, and madhurjindal/Jailbreak-Detector-Large are in BENCHMARKS.md.

API at a glance

pub fn detect(text: &str) -> DetectionOutput
pub fn is_injection(text: &str) -> bool
pub fn score(text: &str) -> f32
pub fn detect_batch(texts: &[&str]) -> Vec<DetectionOutput>
pub fn download_model() -> Result<PathBuf, Error>

pub struct DetectionOutput {
    pub is_injection: bool,
    pub score: f32,
    pub confidence: f32,
    pub risk: RiskLevel,
}

pub enum RiskLevel { Safe, Low, Medium, High, Critical }

Python / JS / Go / Elixir expose the same surface in idiomatic form. See docs/API.md for full per-language signatures.

How it works

JailGuard pairs a frozen sentence-embedding model with a small classifier:

MiniLM-L6-v2 (384-dim, ONNX) produces a semantic vector for the input.
A 3-layer MLP (384 → 256 → 128 → 1, ~130 K parameters, ReLU + dropout 0.2 + sigmoid) scores it as injection vs. benign.

The embedding model is frozen — no fine-tuning — which keeps training and inference cost on CPU modest. The classifier weights are a 1.5 MB JSON file include_str!'d into the binary at compile time.

┌─────────────────────────────────────────────────────────────┐
│                 JAILGUARD DETECTION PIPELINE                │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│   User Prompt                                               │
│       │                                                     │
│       ▼                                                     │
│   ┌─────────────┐                                           │
│   │  MiniLM-L6  │  Semantic Embedding (384-dim)             │
│   │   (ONNX)    │  • Pre-trained by Microsoft               │
│   └──────┬──────┘  • Captures meaning, not just keywords    │
│          │                                                  │
│          ▼                                                  │
│   ┌─────────────────────────────────────────┐               │
│   │     Binary Classifier (Pure Rust)       │               │
│   │  ┌─────────────┐  ┌─────────────────┐   │               │
│   │  │ Dense 256   │→ │   Dense 128     │   │               │
│   │  │ ReLU+Drop   │  │   ReLU+Drop     │   │               │
│   │  └─────────────┘  └─────────────────┘   │               │
│   │                          │              │               │
│   │                          ▼              │               │
│   │              ┌─────────────────┐        │               │
│   │              │  Sigmoid (0-1)  │        │               │
│   │              └─────────────────┘        │               │
│   └─────────────────────────────────────────┘               │
│          │                                                  │
│          ▼                                                  │
│   Detection Result                                          │
│   • confidence: 0.0 - 1.0                                   │
│   • is_injection: confidence > 0.5                          │
│   • risk: Safe | Low | Medium | High | Critical             │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Measurements

Measured on Apple M3, last revalidated 2026-05-03. The pipeline test split is in-distribution (held out from the same 17-source training mix). J1N2 and shalyhinpavel are external datasets, neither used during training.

Test set	Samples	Accuracy	Precision	Recall	F1
Pipeline (in-distribution)	7,049	98.40%	98.56%	97.98%	0.983
J1N2 mix (OOD)	5,000	99.38%	98.09%	99.94%	0.990
shalyhinpavel hard-negatives (OOD)	147	89.12%	76.60%	87.80%	0.818

Latency (single CPU thread)

Component	Apple M3	Intel i5-10210U @ 1.6 GHz¹
Embedding (MiniLM ONNX)	~13 ms	~36 ms
Classification (MLP)	~1 ms	~1 ms
Total (p50)	~14 ms	~37 ms
Total (p99)	~19 ms	~43 ms
Cold start	~140 ms	~350 ms

¹ A 4-year-old low-power Chromebook CPU (Comet Lake-U, 2019, 4c/8t, running ChromeOS Crostini Linux 6.6). Included to show JailGuard runs well even on older / weaker hardware. Modern desktop or server CPUs land closer to the M3 column. Full per-benchmark numbers in BENCHMARKS.md.

Benchmarks

Reproducible latency and throughput numbers come from three harnesses:

benches/detect.rs — Criterion bench covering single-shot is_injection / detect / score and batch throughput at n = 1, 8, 32, 128. Run with cargo bench --bench detect.
examples/cold_start_bench.rs — process-startup cost (ONNX session init + first inference). Run with cargo run --release --example cold_start_bench.
scripts/bench.sh — portable POSIX wrapper that captures machine metadata (CPU, arch, kernel, toolchain) and emits a single markdown report. Works on Linux x86_64, Linux aarch64, macOS Intel, macOS Apple Silicon, and Chromebook Crostini.

Full methodology and head-to-head local-CPU comparisons in BENCHMARKS.md.

Attack categories covered in training

The classifier output is binary at the public API (injection / benign), but its training mix spans eight attack families:

Category	Examples
Direct injection	"Ignore previous instructions"
Jailbreak	DAN, developer-mode prompts
Role-play	Persona-based overrides
System prompt leak	"Reveal your instructions"
Encoding attacks	Base64, ROT13, Unicode obfuscation
Context manipulation	Framing and separator tricks
Output manipulation	Format coercion
Indirect injection	Malicious content embedded in documents

References

all-MiniLM-L6-v2 — sentence embeddings
PromptGuard (Meta)
Rebuff (archived)
Sentinel: SOTA model to protect against prompt injections
Not What You've Signed Up For — indirect injection

Citation

If you use JailGuard in research or production, please cite:

@software{jailguard,
  title = {JailGuard: Efficient Prompt Injection Detection via Pre-trained Embeddings},
  author = {Yury Fedoseev},
  year = {2026},
  url = {https://github.com/yfedoseev/jailguard}
}

A machine-readable CITATION.cff is also available.

License

Dual-licensed under MIT OR Apache-2.0 at your option.

jailguard 0.1.2