# xarf-rs
[](https://github.com/wiggels/xarf-rs/actions/workflows/ci.yml)
[](https://github.com/wiggels/xarf-rs/actions/workflows/audit.yml)
[](https://github.com/wiggels/xarf-rs/actions/workflows/bench.yml)
[](https://codecov.io/gh/wiggels/xarf-rs)
[](https://crates.io/crates/xarf-rs)
[](https://docs.rs/xarf-rs)
[](https://blog.rust-lang.org/)
[](LICENSE)
A Rust implementation of [XARF v4](https://xarf.org/) — the eXtended Abuse
Reporting Format. Parses, validates, and generates abuse reports with a
small, dependency-light API. Compatible with XARF v3 input via automatic
conversion.
```toml
[dependencies]
xarf-rs = "0.1"
```
The crate publishes as `xarf-rs` on crates.io but imports as `xarf`:
```rust
use xarf::{parse, ReportBuilder, Contact, create_evidence};
```
## What XARF is
XARF is a JSON standard for describing abuse incidents (spam, DDoS,
phishing, copyright violations, compromised infrastructure, and so on) in a
machine-readable, schema-validated form. The full v4 specification lives at
[github.com/xarf/xarf-spec](https://github.com/xarf/xarf-spec). This crate
ships the canonical JSON Schemas embedded at compile time — no network or
filesystem access is required at runtime.
## Quick start
```rust
use serde_json::json;
use xarf::{create_evidence, parse, Contact, ReportBuilder};
// Parse an incoming report.
let result = parse(r#"{"xarf_version": "4.2.0", "...": "..."}"#).unwrap();
if result.errors.is_empty() {
let report = result.report.unwrap();
println!("{}/{}", report.category.as_str(), report.type_);
}
// Build a new report programmatically.
let evidence = create_evidence("text/plain", b"original spam payload");
let result = ReportBuilder::new("messaging", "spam", "192.0.2.1")
.reporter(Contact::new("Acme", "abuse@acme.example", "acme.example"))
.sender(Contact::new("Acme", "abuse@acme.example", "acme.example"))
.source_port(25)
.extra("protocol", json!("smtp"))
.extra("smtp_from", json!("spam@bad.example"))
.add_evidence(evidence)
.build()
.unwrap();
```
## API surface
| `parse(json)` | Parse a JSON string. Auto-converts v3 input. |
| `parse_value(value, ...)` | Same, but starts from a `serde_json::Value`. |
| `validate(value, ...)` | Validate without typed deserialization. |
| `ReportBuilder` | Build a report programmatically; auto-fills `report_id`/timestamp. |
| `create_report(...)` | Functional shorthand for `ReportBuilder`. |
| `create_evidence(...)` | Compute hash + base64 + size from raw bytes. |
| `is_v3_report(value)` | Detect XARF v3 format. |
| `convert_v3_to_v4(...)` | Convert v3 → v4 JSON. |
| `Report`, `Contact`, | |
| `Evidence`, `Category` | Typed views over a parsed report. |
## Validation modes
* **Standard** (default): required fields enforced; recommended fields
ignored; unknown fields surface as warnings.
* **Strict** (`ParseOptions { strict: true, .. }`): recommended fields
promoted to required; unknown fields become errors.
* **Show missing optional** (`show_missing_optional: true`): populate
`info` with every absent recommended/optional field — useful for review
tools.
## XARF v3 backward compatibility
`parse()` automatically detects v3 input (`Version: "3"/"3.0"/"3.0.0"` plus
`ReporterInfo` and `Report`), converts it to v4, and surfaces a
deprecation warning. Behavioural parity is held with the reference
[Python implementation](https://github.com/xarf/xarf-python)'s
`v3_compat` module.
## Architecture
* Core fields on `Report` are strongly typed (`Contact`, `Evidence`,
`Category`); category-specific fields flow through a sorted
`BTreeMap<String, serde_json::Value>` for lossless round-tripping and
forward compatibility.
* The 33 type schemas plus the master schema and core schema are bundled
into the binary with `include_str!`.
* Strict-mode validation deep-copies the master schema once at startup and
promotes every `x-recommended: true` property to its parent's `required`
array — matching the algorithm in the v4 implementer's guide.
* Schema reference resolution (`$ref` between type schemas and the core
schema) is satisfied entirely from the embedded documents through a
custom `jsonschema::Retrieve` retriever.
## Performance
The hot paths (parse, validate, build) all run at single-digit µs on
commodity hardware, well inside the XARF v4 spec's `<1ms` typical-report
target:
| `parse` — small spam (~500 B) | ~9 µs | ~68 MiB/s |
| `parse` — DDoS report (~700 B) | ~8 µs | ~68 MiB/s |
| `parse` — large bulk (~40 KB) | ~23 µs | ~1.7 GiB/s |
| `parse` (strict mode) | ~10 µs | ~60 MiB/s |
| `validate` only | ~7 µs | — |
| `convert_v3_to_v4` | ~5 µs | ~100 MiB/s |
| `is_v3_report` detection | ~30 ns | ~16 GiB/s |
| `create_evidence` (SHA-256, 64 KB) | ~39 µs | ~1.6 GiB/s |
| `create_evidence` (MD5, 64 KB) | ~85 µs | ~730 MiB/s |
The compiled `jsonschema::Validator` for the master schema is built once
on first use and cached for the lifetime of the process — concurrent
parses from many tasks share a single immutable validator.
### Regression detection
The crate ships with **three layers** of defense against perf regressions:
1. **Committed perf-budget tests** (`tests/perf_budgets.rs`) — wall-clock
bounds with 5–180× headroom, run as part of `cargo test` on any
contributor's machine. They catch catastrophic regressions (e.g. an
accidentally-uncached schema validator) without any CI setup. Run:
```sh
cargo test --release --test perf_budgets -- --nocapture
```
2. **CI-side benchmark comparison** (`.github/workflows/bench.yml`) — runs
`cargo bench` on every PR, compares against the `main` baseline stored
on a `gh-pages` branch, and **fails the build** if any benchmark
regresses by more than 10%. Posts a comment on the PR with the diff.
Powered by
[`benchmark-action/github-action-benchmark`](https://github.com/benchmark-action/github-action-benchmark).
3. **Local criterion baselines** (`benches/xarf.rs`) — for fine-grained
work on a single machine:
```sh
cargo bench --bench xarf -- --save-baseline main
cargo bench --bench xarf -- --baseline main
```
Criterion writes reports to `target/criterion/` (gitignored, so these
stay local) and flags any operation that regresses by more than 5%.
The committed tests are the universal fallback; the CI workflow is the
authoritative regression gate; the local baselines are for development.
## Sync and async
The public API is synchronous and performs **zero I/O at runtime** — all 34
JSON schemas are embedded into the binary with `include_str!`, so there is
no filesystem or network access to await on. That means:
* Call `parse`, `validate`, `ReportBuilder::build`, etc. directly from any
context — synchronous CLI tools, `tokio` tasks, `async-std`, threads,
anywhere.
* Every public type is `Send + Sync` (verified at compile time with
`static_assertions`), so reports flow across `tokio::spawn` boundaries
and channels without wrapping.
* For very large batches you can push the work off the runtime with
`tokio::task::spawn_blocking(move || xarf::parse(...))` — but for normal
message sizes the parse is microseconds and you don't need to.
See `tests/async_compat.rs` for working examples of multi-threaded tokio
runtimes, `spawn`, `spawn_blocking`, and concurrent registry use.
## Testing
The crate ships with **115 tests** across **eleven test binaries**:
* `golden_parser_tests` — exercises every sample from the shared
[`xarf-parser-tests`](https://github.com/xarf/xarf-parser-tests) corpus
(44 valid + 5 invalid) plus the 32 canonical samples from `xarf-spec`.
* `snapshots` — `insta` snapshot tests covering v3→v4 conversion output,
validation error shape, strict-mode promotion, `show_missing_optional`
output, generator output, and round-trip stability. Run
`cargo install cargo-insta && cargo insta review` to inspect pending
changes.
* `async_compat` — `Send + Sync` compile-time assertions plus `tokio`
integration tests (single + multi-thread runtimes, spawn boundary,
concurrent registry use, `spawn_blocking` batches).
* `parser_tests` — JSON parsing, strict mode, info mode, unknown-field
handling, evidence and tag round-trips.
* `validator_tests` — schema enforcement for every category, format
validation (UUID, email, hostname, datetime), `x-recommended` promotion
in strict mode.
* `generator_tests` — `ReportBuilder` auto-metadata, hash algorithms,
evidence creation, strict-mode generation errors.
* `v3_compat_tests` — detection, conversion, every documented v3
ReportType, error paths.
* `model_tests` — `Category` (un)known, contact/evidence (de)serialization,
internal stripping.
* `round_trip` — parse → mutate → serialize → re-parse for every spec
sample, plus extension-field preservation.
* `smoke`, `api_surface` — sanity checks.
Run them all:
```sh
cargo test
```
## License
MIT — see [LICENSE](LICENSE).