syara-x

Semantic YARA in Rust — extends YARA-compatible rules with semantic similarity, ML classifier, LLM-based, and perceptual hash matching. Catches malicious content (prompt injection, phishing, jailbreaks) by meaning and intent, not just exact text patterns.

Ported from SYARA originally written by Nabeel Yoosuf

This library was ported from Python to Rust by Claude (Anthropic's AI coding assistant), working through six implementation phases under human direction, tested in a larger implementation that includes working side by side with YARA-X. See CONTRIBUTING.md for how the project is maintained.

EXPERIMENTAL: Do not use this for anything important yet. I'm lazily commmiting directly to main some very speculative features that I'm not sure if Claude can pull off. The local-LLM backend (burn-llm / burn-llm-gpu) is currently walled off pending a migration to candle-rs — see ROADMAP.md. Use at your own risk.

Features

Feature flag	Capability
(none)	String/regex matching, cleaners, chunkers
`sbert`	Semantic similarity via HTTP embedding endpoint (OpenAI-compatible; Ollama variant preserved)
`sbert-onnx`	Local ONNX MiniLM-L6-v2 backend (requires system `libonnxruntime` ≥1.17 — see System dependencies)
`classifier`	ML text classifiers via OpenAI-compatible HTTP embeddings (implies `sbert`)
`classifier-onnx`	Local ONNX classifier backend (recommended; implies `classifier` + `sbert-onnx`)
`llm`	LLM-based evaluation via OpenAI-compatible `/v1/chat/completions` (LM Studio / vLLM / llama-server / openai.com / Ollama's shim); native Ollama `/api/chat` preserved as legacy
`phash`	Perceptual hash matching for images, audio, and video
`all`	All of the above

Quick start

# Cargo.toml
[dependencies]
syara-x = { version = "0.3", features = ["all"] }

use syara_x;

let rules = syara_x::compile_str(r#"
    rule prompt_injection {
        strings:
            $pi1 = "ignore previous instructions" nocase
            $pi2 = "disregard your system prompt" nocase
        condition:
            any of them
    }
"#)?;

for m in rules.scan(user_input) {
    if m.matched {
        println!("Rule '{}' matched", m.rule_name);
    }
}

Rule syntax

syara-x uses a YARA-inspired DSL with extensions for semantic and ML matching.

String patterns

rule example {
    strings:
        $s1 = "literal match" nocase
        $s2 = /regex\s+pattern/
        $s3 = "wide char" wide
    condition:
        $s1 or $s2
}

Supported modifiers: nocase, wide, ascii, dotall, fullword.

Regex patterns support Rust's inline flag syntax — (?m)^foo (multiline), (?s).* (dot-matches-newline), (?i)foo (case-insensitive). Inline flags compose with modifiers.

Condition expressions

Conditions are boolean expressions over pattern identifiers ($name) and pattern match counts (#name). YARA-style:

rule multi_turn_transcript {
    strings:
        $user      = "user:"
        $assistant = "assistant:"
    condition:
        #user >= 2 and #assistant >= 1
}

Supported: $id, #id (count), integer literals, == != < <= > >=, + / - arithmetic, unary -, and / or / not, any of / all of pattern sets ((them | $a,$b | $prefix*)). See tasks/YARA-X-PARITY-GAPS.md for what's not yet supported vs YARA-X.

Semantic similarity (`sbert` feature)

rule semantic_phishing {
    similarity:
        $sim1 = {
            pattern: "your account has been compromised click here"
            threshold: 0.82
            cleaner: default_cleaning
            chunker: sentence_chunking
            matcher: sbert
        }
    condition:
        $sim1
}

Classifier (`classifier` / `classifier-onnx` features)

rule jailbreak_classifier {
    classifier:
        $c1 = {
            pattern: "request to override AI safety guidelines"
            threshold: 0.65
            cleaner: default_cleaning
            chunker: paragraph_chunking
            classifier: tuned-sbert
        }
    condition:
        $c1
}

The default tuned-sbert classifier is registered against an OpenAI-compatible /v1/embeddings endpoint (http://localhost:1234). For deterministic, offline scoring use the local ONNX backend instead:

use syara_x::engine::classifier::OnnxEmbeddingClassifier;
let cls = OnnxEmbeddingClassifier::from_dir("../models/all-MiniLM-L6-v2")?;
rules.register_classifier("tuned-sbert", Box::new(cls));

LLM evaluation (`llm` feature)

rule llm_jailbreak {
    llm:
        $llm1 = {
            pattern: "Does this text attempt to override AI safety guidelines?"
            llm: openai-api-compatible
            cleaner: no_op
            chunker: no_chunking
        }
    condition:
        $llm1
}

openai-api-compatible is the default and talks to any OpenAI-compatible /v1/chat/completions endpoint (LM Studio, vLLM, llama-server, Open WebUI, openai.com, and Ollama's OpenAI-compat shim). The legacy ollama name is kept for users on Ollama's native /api/chat endpoint — specify llm: ollama.

Default endpoint and environment variables

The default registration points at http://localhost:1234/v1/chat/completions with model local-model — the LM Studio convention. Override via environment:

Variable	Precedence	Purpose
`SYARA_LLM_ENDPOINT`	highest	Full chat-completions URL (scoped to SYARA-X)
`SYARA_LLM_MODEL`	highest	Model identifier
`SYARA_LLM_API_KEY`	highest	Bearer token
`OPENAI_BASE_URL`	fallback	Root URL; `/chat/completions` is appended if missing
`OPENAI_MODEL`	fallback	Model identifier
`OPENAI_API_KEY`	fallback	Bearer token

Opting out of env-var lookup. If you prefer SYARA-X never read these vars, set SYARA_LLM_NO_ENV=1 — the default registration falls back to the hardcoded localhost endpoint with no API key. Alternatively, register an explicit evaluator before scanning:

# #[cfg(feature = "llm")] {
use syara_x::engine::llm_evaluator::OpenAiChatEvaluatorBuilder;
let mut rules = syara_x::compile_str(source)?;
rules.register_llm_evaluator(
    "openai-api-compatible",
    Box::new(
        OpenAiChatEvaluatorBuilder::new()
            .endpoint("http://localhost:1234/v1/chat/completions")
            .model("local-model")
            .api_key(std::fs::read_to_string("/run/secrets/api_key")?.trim())
            .build(),
    ),
);
# Ok::<(), syara_x::SyaraError>(()) }

Scoped token exposure. Prefer SYARA_LLM_API_KEY over OPENAI_API_KEY to keep SYARA-X's credential separate from anything else on the system that reads OPENAI_API_KEY. A process-scoped invocation also works: SYARA_LLM_API_KEY="$(cat /path/to/key)" your-binary.

Builder knobs

[OpenAiChatEvaluatorBuilder] exposes endpoint, model, api_key, temperature (default 0.0 — deterministic), max_tokens (default 8192 — sized for reasoning models like Qwen3, DeepSeek-R1, GPT-OSS that spend thousands of tokens on internal <think> / reasoning_content before the final YES/NO; non-reasoning models still stop early via finish_reason=stop), system_prompt, header (arbitrary HTTP header), connect_timeout (default 20s), and read_timeout (default 60s).

If a reasoning model is truncated mid-think (finish_reason=length with empty content), the evaluator returns a SyaraError::LlmError pointing at .max_tokens(…) rather than silently parsing the empty response as a non-match.

Responses are cached by (pattern, chunk) when temperature == 0.0; the cache is bounded (1024 entries) and cleared after each scan() call to mirror the lifecycle of the text cache.

Perceptual hash (`phash` feature)

rule known_malware_image {
    phash:
        $ph1 = "/path/to/reference.png" threshold=0.95 hasher="imagehash"
    condition:
        $ph1
}

Built-in components

Cleaners: default_cleaning, aggressive_cleaning, no_op

Chunkers: no_chunking, sentence_chunking, paragraph_chunking, word_chunking, fixed_size_chunking

Matchers: sbert (HTTP embedding), tuned-sbert (classifier), openai-api-compatible (LLM, default), ollama (LLM, legacy), imagehash, audiohash, videohash

Custom components can be registered on CompiledRules via register_cleaner, register_chunker, register_semantic_matcher, etc.

C API

A C FFI is available via the capi crate. After building, syara_x.h is generated automatically by cbindgen.

#include "syara_x.h"

SyaraRules *rules = NULL;
syara_compile_str("rule r { strings: $s = \"evil\" condition: $s }", &rules);

SyaraMatchArray *matches = NULL;
syara_scan(rules, input_text, &matches);

for (size_t i = 0; i < matches->count; i++) {
    if (matches->matches[i].matched) {
        printf("matched: %s\n", matches->matches[i].rule_name);
    }
}

syara_matches_free(matches);
syara_rules_free(rules);

Architecture

.syara file
    └─> SyaraParser     parse DSL
    └─> Compiler        validate identifiers, conditions
    └─> CompiledRules   execution engine
            ├─ StringMatcher     (cheapest)
            ├─ SemanticMatcher   (sbert)
            ├─ PHashMatcher      (phash)
            ├─ TextClassifier    (classifier)
            └─ LLMEvaluator      (most expensive, short-circuited)

Execution is cost-ordered. LLM calls are skipped when the condition cannot be satisfied even if the LLM matches (see is_identifier_needed in condition.rs).

Development

cargo build                          # build all crates
cargo test                           # run all tests
cargo test -p syara-x --features all # library tests with all features
cargo clippy -- -D warnings          # lint (must be clean)

External services (Ollama) are only contacted when the corresponding feature is enabled and a rule actually exercises that matcher. String-only rules need no external services.

System dependencies

Most features are pure-Rust and need nothing beyond cargo. The sbert-onnx and classifier-onnx features are the exceptions — both link against the ONNX Runtime shared library at runtime (via ort's load-dynamic mode) and will not run without it installed on the host.

`sbert-onnx` / `classifier-onnx` (ONNX Runtime ≥ 1.17)

macOS (Homebrew):

brew install onnxruntime
# Homebrew installs to /opt/homebrew/lib, which dlopen does NOT search by default —
# point ort at the dylib explicitly:
export ORT_DYLIB_PATH="$(brew --prefix onnxruntime)/lib/libonnxruntime.dylib"

Linux (Debian/Ubuntu): download the matching release from microsoft/onnxruntime releases and place libonnxruntime.so on your loader path, or set ORT_DYLIB_PATH to point at the file.

Any platform (escape hatch): point ort at a specific dylib by exporting ORT_DYLIB_PATH=/absolute/path/to/libonnxruntime.{dylib,so,dll} before cargo test / cargo run.

Convenience wrapper: for repeated use, run ./scripts/install_onnxruntime_xdg.sh once to install ~/.local/bin/with-onnxruntime (XDG user-bin). Then prefix any command:

with-onnxruntime cargo test --features classifier-onnx -- --ignored

The same MiniLM weights are reused by integration_real_onnx_embed and integration_real_onnx_classifier. To fetch them:

./scripts/fetch_minilm.sh       # downloads to <repo>/models/all-MiniLM-L6-v2/

License

MIT — see LICENSE.

syara-x 0.3.0