# simdutf8-cli
**🌐 Languages:** **English** · [Deutsch](LIESMICH.md) · [Français](LISEZMOI.md)
A small, security-hardened command-line front-end for the
[`simdutf8`](https://crates.io/crates/simdutf8) crate — SIMD-accelerated UTF-8
validation. It tells you, quickly, whether files (or standard input) contain
well-formed UTF-8, and *where* the first error is when they do not.
The project doubles as a worked example of:
- **Reusing an upstream crate's own test-suite verbatim** — the simdutf8 unit
tests run unchanged against this binary's dependency tree.
- **Hardened path handling** (path traversal, symlink-escape, TOCTOU, and
resource-exhaustion defences) in [`src/path_security.rs`](src/path_security.rs).
- **Structured output** — SARIF 2.1.0 and Markdown for CI ingestion.
- **Test-driven development** and **fuzzing**.
## Code health (rust-doctor)
Measured with [`rust-doctor`](https://crates.io/crates/rust-doctor) `0.1.20`:
```text
100 / 100 Great
Security 100 · Reliability 100 · Maintainability 99 · Performance 100 · Dependencies 100
```
> `clippy.toml` and `deny.toml` were migrated to the current tool schemas, so
> `cargo clippy` and `cargo deny check` both run cleanly and the crate scores a
> perfect 100. (rust-doctor additionally force-warns a few pedantic
> `# Panics`/`# Errors` doc lints on test/internal code; these are allowed by
> project policy and do not affect the score.)
Reproduce:
```sh
rust-doctor . # full report
rust-doctor --score # bare integer for CI
```
## Features
- Validates one or many files, or standard input (`-` / no arguments).
- Reports the exact byte offset and nature of the first UTF-8 error (via
`simdutf8::compat`, matching `std::str::from_utf8`).
- Four output formats: **text**, **json**, **sarif** (SARIF 2.1.0,
strict-validated), and **markdown** (GitHub-Flavored, derived from the SARIF).
- Auto-generates validated `report.sarif` + `report.md` (toggle with
`--no-report`, redirect with `--output-dir`).
- Optional confinement to a `--base-dir`, symlink rejection, and a configurable
size cap.
- No `unsafe` code in this crate (`#![forbid(unsafe_code)]`).
## Build & install
```sh
cargo build --release # -> target/release/simdutf8-cli
cargo install --path . # -> ~/.cargo/bin/simdutf8-cli
```
Requires Rust ≥ 1.93 (edition 2021).
## Usage
```text
simdutf8-cli [OPTIONS] [PATH]...
Arguments:
[PATH]... Files or directories. Directories are walked recursively.
Use `-` or pass none to read standard input.
Options:
--exclude <GLOB> Exclude paths matching this glob when walking dirs
(repeatable, gitignore syntax)
--no-ignore Don't respect .gitignore / .ignore when walking
(they are respected by default)
--hidden Include hidden files/dirs when walking (skipped by default)
--format <FORMAT> text | json | sarif | markdown [default: text]
--base-dir <DIR> Confine inputs to this directory (rejects traversal & symlink escapes)
--no-follow-symlinks Reject symbolic links instead of resolving them
--max-size <BYTES> Maximum bytes to read per input [default: 67108864]
--output-dir <DIR> Directory for report.sarif / report.md [default: .]
--no-report Do not auto-generate the report files
-q, --quiet Suppress stdout; rely on the exit code only
-h, --help Print help
-V, --version Print version
```
### Directory walking
A directory argument is walked recursively. By default the walker **respects
`.gitignore` / `.ignore`** (even outside a git repo), **skips hidden files**, and
**does not follow symlinked directories**. Tune it with:
```sh
simdutf8-cli src/ # walk, honouring .gitignore
simdutf8-cli --no-ignore src/ # walk everything, ignore .gitignore
simdutf8-cli --hidden src/ # also descend into dotfiles
simdutf8-cli --exclude '*.min.js' --exclude vendor docs/ # skip globs / dirs
```
Explicitly named files are always validated; ignore rules apply only while
walking directories. Each discovered file is still opened and read through the
hardened [`PathPolicy`](src/path_security.rs).
### Exit codes
| `0` | every input was valid UTF-8 |
| `1` | every input was readable, but at least one was invalid |
| `2` | at least one input could not be read securely (I/O, policy, report) |
### Examples
```sh
# Validate a directory of files
simdutf8-cli tests/fixtures/*
# OK tests/fixtures/ascii.txt
# FAIL tests/fixtures/utf16le_bom.txt: invalid UTF-8 at byte 0 (1 invalid byte)
# JSON (one object per input, in an array)
simdutf8-cli --no-report --format json tests/fixtures/ascii.txt tests/fixtures/truncated_utf8.bin
# [{"path":"tests/fixtures/ascii.txt","valid":true},
# {"path":"tests/fixtures/truncated_utf8.bin","valid":false,"valid_up_to":3,"error_len":null}]
# SARIF 2.1.0 for code scanning
simdutf8-cli --no-report --format sarif $(git ls-files) > simdutf8.sarif
# From standard input
# Confine inputs and reject symlinks (e.g. scanning untrusted uploads)
simdutf8-cli --base-dir ./uploads --no-follow-symlinks ./uploads/file.txt
```
## Output formats & reports
`--format` controls stdout: `text` (default), `json`, `sarif`, `markdown`.
Independently, unless `--no-report` is given, every run that produces a finding
writes two validated files into `--output-dir` (default current directory):
- `report.sarif` — SARIF 2.1.0 JSON, strict-validated with `sarif_rust`.
- `report.md` — GitHub-Flavored Markdown, derived from that SARIF via
`sarif-to-md-core`.
This follows `skills/rust-sarif.md`; the report write is scoped through a
`cap-std` capability handle (see `skills/rust-path-security.md`).
## Security model (`rust-path-security`)
All filesystem access goes through [`PathPolicy`](src/path_security.rs). For each
input it:
1. rejects empty paths and paths with an interior NUL byte;
2. with `--base-dir`, rejects `..` lexically **and** verifies the canonicalized
path stays inside the canonical base directory (defeats `../` traversal and
symlink escapes);
3. optionally refuses symbolic links (`--no-follow-symlinks`);
4. re-checks file type/size on the **open descriptor** (`fstat`) — TOCTOU
mitigation;
5. accepts **regular files only**;
6. **hard-caps** bytes read (`--max-size`), bounding memory for files and stdin.
A lexical [`safe_join`](src/path_security.rs) primitive confines
attacker-influenced relative paths, and report writes use a capability-scoped
`cap-std` directory handle. The crate is `#![forbid(unsafe_code)]` and avoids
`println!` / `std::fs::read_to_string` / `std::fs::write`.
## Example encoding files
[`examples/generate_fixtures.rs`](examples/generate_fixtures.rs) produces files
in different encodings (committed under [`tests/fixtures/`](tests/fixtures/)):
| `ascii.txt` | ASCII | ✅ |
| `utf8_multilingual.txt` | UTF-8 (CJK, emoji, …) | ✅ |
| `utf8_bom.txt` | UTF-8 with BOM | ✅ |
| `utf16le_bom.txt` | UTF-16 LE + BOM | ❌ |
| `utf16be_bom.txt` | UTF-16 BE + BOM | ❌ |
| `utf32le_bom.bin` | UTF-32 LE + BOM | ❌ |
| `latin1.txt` | ISO-8859-1 / Latin-1 | ❌ |
| `truncated_utf8.bin` | UTF-8, truncated tail | ❌ |
| `lone_continuation.bin` | stray continuation byte | ❌ |
```sh
cargo run --example generate_fixtures # -> ./tests/fixtures
cargo run --example validate_bytes # library-usage demo
```
## Testing
```sh
cargo test # unit + integration + upstream tests
cargo test --features public_imp # also exercise the low-level SIMD impls
RUSTFLAGS="-C target-cpu=native" cargo test --features public_imp
```
Three layers: unit tests (TDD, in each `src/*.rs`), `tests/cli.rs` (black-box
binary tests via `assert_cmd`), and `tests/upstream_tests.rs` (the simdutf8
crate's own suite, vendored **verbatim**, dual-licensed Apache-2.0 OR MIT).
## Fuzzing
The [`fuzz/`](fuzz/) directory is a standalone `cargo-fuzz` (libFuzzer) workspace
(nightly required):
```sh
cargo +nightly fuzz build # add --jobs 2 on low-memory hosts
cargo +nightly fuzz run validate_vs_std -- -max_total_time=60
```
Targets: `validate_vs_std` (differential vs `std`), `read_capped`, `json_escape`,
`sarif_build`, `safe_join` (path traversal).
## Documentation
- [User Guide](documentation/user_guide.md)
- [Administrator Guide](documentation/administrator_guide.md)
- [Troubleshooting Guide](documentation/troubleshooting_guide.md)
- [Changelog](CHANGELOG.md)
## Dependencies
Runtime: `simdutf8`, `clap`, `thiserror`, `sarif_rust`, `sarif-to-md-core`,
`cap-std`, `ignore`. Dev-only: `assert_cmd`, `predicates`, `tempfile`, `flexpect`. All
permissively licensed (MIT / Apache-2.0); `cargo audit` reports no advisories.
## License
Apache-2.0. See [LICENSE](LICENSE). The vendored upstream test file is
additionally available under the simdutf8 authors' MIT license.