simdutf8-cli 0.1.6

SIMD-accelerated UTF-8 validation CLI built on the simdutf8 crate, with hardened path handling.
Documentation
<!--
SPDX-License-Identifier: Apache-2.0
SPDX-FileCopyrightText: 2025,2026 ndaal Gesellschaft für Sicherheit in der Informationstechnik mbH & Co KG, Cologne
SPDX-FileCopyrightText: Author: Pierre Gronau <Pierre.Gronau@ndaal.eu>
-->

# simdutf8-cli

**🌐 Languages:** **English** · [Deutsch](LIESMICH.md) · [Français](LISEZMOI.md)

A small, security-hardened command-line front-end for the
[`simdutf8`](https://crates.io/crates/simdutf8) crate — SIMD-accelerated UTF-8
validation. It tells you, quickly, whether files (or standard input) contain
well-formed UTF-8, and *where* the first error is when they do not.

The project doubles as a worked example of:

- **Reusing an upstream crate's own test-suite verbatim** — the simdutf8 unit
  tests run unchanged against this binary's dependency tree.
- **Hardened path handling** (path traversal, symlink-escape, TOCTOU, and
  resource-exhaustion defences) in [`src/path_security.rs`]src/path_security.rs.
- **Structured output** — SARIF 2.1.0 and Markdown for CI ingestion.
- **Test-driven development** and **fuzzing**.

## Code health (rust-doctor)

Measured with [`rust-doctor`](https://crates.io/crates/rust-doctor) `0.1.20`:

```text
100 / 100  Great
Security 100 · Reliability 100 · Maintainability 99 · Performance 100 · Dependencies 100
```

> `clippy.toml` and `deny.toml` were migrated to the current tool schemas, so
> `cargo clippy` and `cargo deny check` both run cleanly and the crate scores a
> perfect 100. (rust-doctor additionally force-warns a few pedantic
> `# Panics`/`# Errors` doc lints on test/internal code; these are allowed by
> project policy and do not affect the score.)

Reproduce:

```sh
rust-doctor .          # full report
rust-doctor --score    # bare integer for CI
```

## Features

- Validates one or many files, or standard input (`-` / no arguments).
- Reports the exact byte offset and nature of the first UTF-8 error (via
  `simdutf8::compat`, matching `std::str::from_utf8`).
- Four output formats: **text**, **json**, **sarif** (SARIF 2.1.0,
  strict-validated), and **markdown** (GitHub-Flavored, derived from the SARIF).
- Auto-generates validated `report.sarif` + `report.md` (toggle with
  `--no-report`, redirect with `--output-dir`).
- Optional confinement to a `--base-dir`, symlink rejection, and a configurable
  size cap.
- No `unsafe` code in this crate (`#![forbid(unsafe_code)]`).

## Build & install

```sh
cargo build --release      # -> target/release/simdutf8-cli
cargo install --path .     # -> ~/.cargo/bin/simdutf8-cli
```

Requires Rust ≥ 1.93 (edition 2021).

## Usage

```text
simdutf8-cli [OPTIONS] [PATH]...

Arguments:
  [PATH]...   Files or directories. Directories are walked recursively.
              Use `-` or pass none to read standard input.

Options:
      --exclude <GLOB>      Exclude paths matching this glob when walking dirs
                            (repeatable, gitignore syntax)
      --no-ignore           Don't respect .gitignore / .ignore when walking
                            (they are respected by default)
      --hidden              Include hidden files/dirs when walking (skipped by default)
      --format <FORMAT>     text | json | sarif | markdown        [default: text]
      --base-dir <DIR>      Confine inputs to this directory (rejects traversal & symlink escapes)
      --no-follow-symlinks  Reject symbolic links instead of resolving them
      --max-size <BYTES>    Maximum bytes to read per input        [default: 67108864]
      --output-dir <DIR>    Directory for report.sarif / report.md [default: .]
      --no-report           Do not auto-generate the report files
  -q, --quiet               Suppress stdout; rely on the exit code only
  -h, --help                Print help
  -V, --version             Print version
```

### Directory walking

A directory argument is walked recursively. By default the walker **respects
`.gitignore` / `.ignore`** (even outside a git repo), **skips hidden files**, and
**does not follow symlinked directories**. Tune it with:

```sh
simdutf8-cli src/                       # walk, honouring .gitignore
simdutf8-cli --no-ignore src/           # walk everything, ignore .gitignore
simdutf8-cli --hidden src/              # also descend into dotfiles
simdutf8-cli --exclude '*.min.js' --exclude vendor docs/   # skip globs / dirs
```

Explicitly named files are always validated; ignore rules apply only while
walking directories. Each discovered file is still opened and read through the
hardened [`PathPolicy`](src/path_security.rs).

### Exit codes

| code | meaning |
|------|---------|
| `0`  | every input was valid UTF-8 |
| `1`  | every input was readable, but at least one was invalid |
| `2`  | at least one input could not be read securely (I/O, policy, report) |

### Examples

```sh
# Validate a directory of files
simdutf8-cli tests/fixtures/*
# OK    tests/fixtures/ascii.txt
# FAIL  tests/fixtures/utf16le_bom.txt: invalid UTF-8 at byte 0 (1 invalid byte)

# JSON (one object per input, in an array)
simdutf8-cli --no-report --format json tests/fixtures/ascii.txt tests/fixtures/truncated_utf8.bin
# [{"path":"tests/fixtures/ascii.txt","valid":true},
#  {"path":"tests/fixtures/truncated_utf8.bin","valid":false,"valid_up_to":3,"error_len":null}]

# SARIF 2.1.0 for code scanning
simdutf8-cli --no-report --format sarif $(git ls-files) > simdutf8.sarif

# From standard input
printf 'grüße 😊' | simdutf8-cli --no-report

# Confine inputs and reject symlinks (e.g. scanning untrusted uploads)
simdutf8-cli --base-dir ./uploads --no-follow-symlinks ./uploads/file.txt
```

## Output formats & reports

`--format` controls stdout: `text` (default), `json`, `sarif`, `markdown`.
Independently, unless `--no-report` is given, every run that produces a finding
writes two validated files into `--output-dir` (default current directory):

- `report.sarif` — SARIF 2.1.0 JSON, strict-validated with `sarif_rust`.
- `report.md` — GitHub-Flavored Markdown, derived from that SARIF via
  `sarif-to-md-core`.

This follows `skills/rust-sarif.md`; the report write is scoped through a
`cap-std` capability handle (see `skills/rust-path-security.md`).

## Security model (`rust-path-security`)

All filesystem access goes through [`PathPolicy`](src/path_security.rs). For each
input it:

1. rejects empty paths and paths with an interior NUL byte;
2. with `--base-dir`, rejects `..` lexically **and** verifies the canonicalized
   path stays inside the canonical base directory (defeats `../` traversal and
   symlink escapes);
3. optionally refuses symbolic links (`--no-follow-symlinks`);
4. re-checks file type/size on the **open descriptor** (`fstat`) — TOCTOU
   mitigation;
5. accepts **regular files only**;
6. **hard-caps** bytes read (`--max-size`), bounding memory for files and stdin.

A lexical [`safe_join`](src/path_security.rs) primitive confines
attacker-influenced relative paths, and report writes use a capability-scoped
`cap-std` directory handle. The crate is `#![forbid(unsafe_code)]` and avoids
`println!` / `std::fs::read_to_string` / `std::fs::write`.

## Example encoding files

[`examples/generate_fixtures.rs`](examples/generate_fixtures.rs) produces files
in different encodings (committed under [`tests/fixtures/`](tests/fixtures/)):

| file | encoding | valid UTF-8? |
|------|----------|:------------:|
| `ascii.txt` | ASCII ||
| `utf8_multilingual.txt` | UTF-8 (CJK, emoji, …) ||
| `utf8_bom.txt` | UTF-8 with BOM ||
| `utf16le_bom.txt` | UTF-16 LE + BOM ||
| `utf16be_bom.txt` | UTF-16 BE + BOM ||
| `utf32le_bom.bin` | UTF-32 LE + BOM ||
| `latin1.txt` | ISO-8859-1 / Latin-1 ||
| `truncated_utf8.bin` | UTF-8, truncated tail ||
| `lone_continuation.bin` | stray continuation byte ||

```sh
cargo run --example generate_fixtures           # -> ./tests/fixtures
cargo run --example validate_bytes              # library-usage demo
```

## Testing

```sh
cargo test                                      # unit + integration + upstream tests
cargo test --features public_imp                # also exercise the low-level SIMD impls
RUSTFLAGS="-C target-cpu=native" cargo test --features public_imp
```

Three layers: unit tests (TDD, in each `src/*.rs`), `tests/cli.rs` (black-box
binary tests via `assert_cmd`), and `tests/upstream_tests.rs` (the simdutf8
crate's own suite, vendored **verbatim**, dual-licensed Apache-2.0 OR MIT).

## Fuzzing

The [`fuzz/`](fuzz/) directory is a standalone `cargo-fuzz` (libFuzzer) workspace
(nightly required):

```sh
cargo +nightly fuzz build           # add --jobs 2 on low-memory hosts
cargo +nightly fuzz run validate_vs_std -- -max_total_time=60
```

Targets: `validate_vs_std` (differential vs `std`), `read_capped`, `json_escape`,
`sarif_build`, `safe_join` (path traversal).

## Documentation

- [User Guide]documentation/user_guide.md
- [Administrator Guide]documentation/administrator_guide.md
- [Troubleshooting Guide]documentation/troubleshooting_guide.md
- [Changelog]CHANGELOG.md

## Dependencies

Runtime: `simdutf8`, `clap`, `thiserror`, `sarif_rust`, `sarif-to-md-core`,
`cap-std`, `ignore`. Dev-only: `assert_cmd`, `predicates`, `tempfile`, `flexpect`. All
permissively licensed (MIT / Apache-2.0); `cargo audit` reports no advisories.

## License

Apache-2.0. See [LICENSE](LICENSE). The vendored upstream test file is
additionally available under the simdutf8 authors' MIT license.