syara-x
Semantic YARA in Rust — extends YARA-compatible rules with semantic similarity, ML classifier, LLM-based, and perceptual hash matching. Catches malicious content (prompt injection, phishing, jailbreaks) by meaning and intent, not just exact text patterns.
Ported from SYARA originally written by Nabeel Yoosuf
This library was ported from Python to Rust by Claude (Anthropic's AI coding assistant), working through six implementation phases under human direction, tested in a larger implementation that includes working side by side with YARA-X. See CONTRIBUTING.md for how the project is maintained.
EXPERIMENTAL: Do not use this for anything important yet. I'm lazily commmiting
directly to main some very speculative features that I'm not sure if Claude can
pull off. The local-LLM backend (burn-llm / burn-llm-gpu) is currently walled
off pending a migration to candle-rs — see ROADMAP.md. Use at your
own risk.
Features
| Feature flag | Capability |
|---|---|
| (none) | String/regex matching, cleaners, chunkers |
sbert |
Semantic similarity via HTTP embedding endpoint (OpenAI-compatible; Ollama variant preserved) |
sbert-onnx |
Local ONNX MiniLM-L6-v2 backend (requires system libonnxruntime ≥1.17 — see System dependencies) |
classifier |
ML text classifiers via OpenAI-compatible HTTP embeddings (implies sbert) |
classifier-onnx |
Local ONNX classifier backend (recommended; implies classifier + sbert-onnx) |
llm |
LLM-based evaluation via OpenAI-compatible /v1/chat/completions (LM Studio / vLLM / llama-server / openai.com / Ollama's shim); native Ollama /api/chat preserved as legacy |
phash |
Perceptual hash matching for images, audio, and video |
all |
All of the above |
Quick start
# Cargo.toml
[]
= { = "0.3", = ["all"] }
use syara_x;
let rules = compile_str?;
for m in rules.scan
Rule syntax
syara-x uses a YARA-inspired DSL with extensions for semantic and ML matching.
String patterns
rule example {
strings:
$s1 = "literal match" nocase
$s2 = /regex\s+pattern/
$s3 = "wide char" wide
condition:
$s1 or $s2
}
Supported modifiers: nocase, wide, ascii, dotall, fullword.
Regex patterns support Rust's inline flag syntax — (?m)^foo (multiline),
(?s).* (dot-matches-newline), (?i)foo (case-insensitive). Inline flags
compose with modifiers.
Condition expressions
Conditions are boolean expressions over pattern identifiers ($name) and
pattern match counts (#name). YARA-style:
rule multi_turn_transcript {
strings:
$user = "user:"
$assistant = "assistant:"
condition:
#user >= 2 and #assistant >= 1
}
Supported: $id, #id (count), integer literals, == != < <= > >=,
+ / - arithmetic, unary -, and / or / not, any of / all of
pattern sets ((them | $a,$b | $prefix*)). See
tasks/YARA-X-PARITY-GAPS.md for what's not
yet supported vs YARA-X.
Semantic similarity (sbert feature)
rule semantic_phishing {
similarity:
$sim1 = {
pattern: "your account has been compromised click here"
threshold: 0.82
cleaner: default_cleaning
chunker: sentence_chunking
matcher: sbert
}
condition:
$sim1
}
Classifier (classifier / classifier-onnx features)
rule jailbreak_classifier {
classifier:
$c1 = {
pattern: "request to override AI safety guidelines"
threshold: 0.65
cleaner: default_cleaning
chunker: paragraph_chunking
classifier: tuned-sbert
}
condition:
$c1
}
The default tuned-sbert classifier is registered against an OpenAI-compatible
/v1/embeddings endpoint (http://localhost:1234). For deterministic, offline
scoring use the local ONNX backend instead:
use OnnxEmbeddingClassifier;
let cls = from_dir?;
rules.register_classifier;
LLM evaluation (llm feature)
rule llm_jailbreak {
llm:
$llm1 = {
pattern: "Does this text attempt to override AI safety guidelines?"
llm: openai-api-compatible
cleaner: no_op
chunker: no_chunking
}
condition:
$llm1
}
openai-api-compatible is the default and talks to any OpenAI-compatible
/v1/chat/completions endpoint (LM Studio, vLLM, llama-server, Open WebUI,
openai.com, and Ollama's OpenAI-compat shim). The legacy ollama name is kept
for users on Ollama's native /api/chat endpoint — specify llm: ollama.
Default endpoint and environment variables
The default registration points at http://localhost:1234/v1/chat/completions
with model local-model — the LM Studio convention. Override via environment:
| Variable | Precedence | Purpose |
|---|---|---|
SYARA_LLM_ENDPOINT |
highest | Full chat-completions URL (scoped to SYARA-X) |
SYARA_LLM_MODEL |
highest | Model identifier |
SYARA_LLM_API_KEY |
highest | Bearer token |
OPENAI_BASE_URL |
fallback | Root URL; /chat/completions is appended if missing |
OPENAI_MODEL |
fallback | Model identifier |
OPENAI_API_KEY |
fallback | Bearer token |
Opting out of env-var lookup. If you prefer SYARA-X never read these vars,
set SYARA_LLM_NO_ENV=1 — the default registration falls back to the
hardcoded localhost endpoint with no API key. Alternatively, register an
explicit evaluator before scanning:
#
Scoped token exposure. Prefer SYARA_LLM_API_KEY over OPENAI_API_KEY
to keep SYARA-X's credential separate from anything else on the system that
reads OPENAI_API_KEY. A process-scoped invocation also works:
SYARA_LLM_API_KEY="$(cat /path/to/key)" your-binary.
Builder knobs
[OpenAiChatEvaluatorBuilder] exposes endpoint, model, api_key,
temperature (default 0.0 — deterministic), max_tokens (default 8192 —
sized for reasoning models like Qwen3, DeepSeek-R1, GPT-OSS that spend
thousands of tokens on internal <think> / reasoning_content before the
final YES/NO; non-reasoning models still stop early via
finish_reason=stop), system_prompt, header (arbitrary HTTP header),
connect_timeout (default 20s), and read_timeout (default 60s).
If a reasoning model is truncated mid-think (finish_reason=length with
empty content), the evaluator returns a SyaraError::LlmError pointing
at .max_tokens(…) rather than silently parsing the empty response as a
non-match.
Responses are cached by (pattern, chunk) when temperature == 0.0; the
cache is bounded (1024 entries) and cleared after each scan() call to
mirror the lifecycle of the text cache.
Perceptual hash (phash feature)
rule known_malware_image {
phash:
$ph1 = "/path/to/reference.png" threshold=0.95 hasher="imagehash"
condition:
$ph1
}
Built-in components
Cleaners: default_cleaning, aggressive_cleaning, no_op
Chunkers: no_chunking, sentence_chunking, paragraph_chunking,
word_chunking, fixed_size_chunking
Matchers: sbert (HTTP embedding), tuned-sbert (classifier),
openai-api-compatible (LLM, default), ollama (LLM, legacy),
imagehash, audiohash, videohash
Custom components can be registered on CompiledRules via
register_cleaner, register_chunker, register_semantic_matcher, etc.
C API
A C FFI is available via the capi crate. After building, syara_x.h is
generated automatically by cbindgen.
SyaraRules *rules = NULL;
;
SyaraMatchArray *matches = NULL;
;
for
;
;
Architecture
.syara file
└─> SyaraParser parse DSL
└─> Compiler validate identifiers, conditions
└─> CompiledRules execution engine
├─ StringMatcher (cheapest)
├─ SemanticMatcher (sbert)
├─ PHashMatcher (phash)
├─ TextClassifier (classifier)
└─ LLMEvaluator (most expensive, short-circuited)
Execution is cost-ordered. LLM calls are skipped when the condition cannot
be satisfied even if the LLM matches (see is_identifier_needed in
condition.rs).
Development
External services (Ollama) are only contacted when the corresponding feature is enabled and a rule actually exercises that matcher. String-only rules need no external services.
System dependencies
Most features are pure-Rust and need nothing beyond cargo. The sbert-onnx
and classifier-onnx features are the exceptions — both link against the ONNX
Runtime shared library at runtime (via ort's load-dynamic mode) and will
not run without it installed on the host.
sbert-onnx / classifier-onnx (ONNX Runtime ≥ 1.17)
macOS (Homebrew):
# Homebrew installs to /opt/homebrew/lib, which dlopen does NOT search by default —
# point ort at the dylib explicitly:
Linux (Debian/Ubuntu): download the matching release from
microsoft/onnxruntime releases
and place libonnxruntime.so on your loader path, or set ORT_DYLIB_PATH to
point at the file.
Any platform (escape hatch): point ort at a specific dylib by exporting
ORT_DYLIB_PATH=/absolute/path/to/libonnxruntime.{dylib,so,dll} before
cargo test / cargo run.
Convenience wrapper: for repeated use, run
./scripts/install_onnxruntime_xdg.sh once to install
~/.local/bin/with-onnxruntime (XDG user-bin). Then prefix any command:
The same MiniLM weights are reused by integration_real_onnx_embed and
integration_real_onnx_classifier. To fetch them:
License
MIT — see LICENSE.