simdutf8-cli 0.2.7

SIMD-accelerated UTF-8 validation CLI built on the simdutf8 crate, with hardened path handling.
Documentation

simdutf8-cli

🌐 Languages: English · Deutsch · Français

A small, security-hardened command-line front-end for the simdutf8 crate β€” SIMD-accelerated UTF-8 validation. It tells you, quickly, whether files (or standard input) contain well-formed UTF-8, and where the first error is when they do not.

The project doubles as a worked example of:

  • Reusing an upstream crate's own test-suite verbatim β€” the simdutf8 unit tests run unchanged against this binary's dependency tree.
  • Hardened path handling (path traversal, symlink-escape, TOCTOU, and resource-exhaustion defences) in src/path_security.rs.
  • Structured output β€” SARIF 2.1.0 and Markdown for CI ingestion.
  • Test-driven development and fuzzing.

Code health (rust-doctor)

Measured with rust-doctor 0.1.20:

100 / 100  Great
Security 100 Β· Reliability 100 Β· Maintainability 99 Β· Performance 100 Β· Dependencies 100

clippy.toml and deny.toml were migrated to the current tool schemas, so cargo clippy and cargo deny check both run cleanly and the crate scores a perfect 100. (rust-doctor additionally force-warns a few pedantic # Panics/# Errors doc lints on test/internal code; these are allowed by project policy and do not affect the score.)

Reproduce:

rust-doctor .          # full report
rust-doctor --score    # bare integer for CI

Features

  • Validates one or many files, or standard input (- / no arguments).
  • Reports the exact byte offset and nature of the first UTF-8 error (via simdutf8::compat, matching std::str::from_utf8).
  • Four output formats: text, json, sarif (SARIF 2.1.0, strict-validated), and markdown (GitHub-Flavored, derived from the SARIF).
  • Auto-generates validated report.sarif + report.md (toggle with --no-report, redirect with --output-dir).
  • Optional confinement to a --base-dir, symlink rejection, and a configurable size cap.
  • No unsafe code in this crate (#![forbid(unsafe_code)]).

Build & install

cargo build --release      # -> target/release/simdutf8-cli
cargo install --path .     # -> ~/.cargo/bin/simdutf8-cli

Requires Rust β‰₯ 1.93 (edition 2021).

Usage

simdutf8-cli [OPTIONS] [PATH]...

Arguments:
  [PATH]...   Files or directories. Directories are walked recursively.
              Use `-` or pass none to read standard input.

Options:
      --exclude <GLOB>      Exclude paths matching this glob when walking dirs
                            (repeatable, gitignore syntax)
      --no-ignore           Don't respect .gitignore / .ignore when walking
                            (they are respected by default)
      --hidden              Include hidden files/dirs when walking (skipped by default)
      --format <FORMAT>     text | json | sarif | markdown        [default: text]
      --base-dir <DIR>      Confine inputs to this directory (rejects traversal & symlink escapes)
      --no-follow-symlinks  Reject symbolic links instead of resolving them
      --max-size <BYTES>    Maximum bytes to read per input        [default: 67108864]
      --output-dir <DIR>    Directory for report.sarif / report.md [default: .]
      --no-report           Do not auto-generate the report files
  -q, --quiet               Suppress stdout; rely on the exit code only
  -h, --help                Print help
  -V, --version             Print version

Directory walking

A directory argument is walked recursively. By default the walker respects .gitignore / .ignore (even outside a git repo), skips hidden files, and does not follow symlinked directories. Tune it with:

simdutf8-cli src/                       # walk, honouring .gitignore
simdutf8-cli --no-ignore src/           # walk everything, ignore .gitignore
simdutf8-cli --hidden src/              # also descend into dotfiles
simdutf8-cli --exclude '*.min.js' --exclude vendor docs/   # skip globs / dirs

Explicitly named files are always validated; ignore rules apply only while walking directories. Each discovered file is still opened and read through the hardened PathPolicy.

Exit codes

code meaning
0 every input was valid UTF-8
1 every input was readable, but at least one was invalid
2 at least one input could not be read securely (I/O, policy, report)

Examples

# Validate a directory of files
simdutf8-cli tests/fixtures/*
# OK    tests/fixtures/ascii.txt
# FAIL  tests/fixtures/utf16le_bom.txt: invalid UTF-8 at byte 0 (1 invalid byte)

# JSON (one object per input, in an array)
simdutf8-cli --no-report --format json tests/fixtures/ascii.txt tests/fixtures/truncated_utf8.bin
# [{"path":"tests/fixtures/ascii.txt","valid":true},
#  {"path":"tests/fixtures/truncated_utf8.bin","valid":false,"valid_up_to":3,"error_len":null}]

# SARIF 2.1.0 for code scanning
simdutf8-cli --no-report --format sarif $(git ls-files) > simdutf8.sarif

# From standard input
printf 'grüße 😊' | simdutf8-cli --no-report

# Confine inputs and reject symlinks (e.g. scanning untrusted uploads)
simdutf8-cli --base-dir ./uploads --no-follow-symlinks ./uploads/file.txt

Output formats & reports

--format controls stdout: text (default), json, sarif, markdown. Independently, unless --no-report is given, every run that produces a finding writes two validated files into --output-dir (default current directory):

  • report.sarif β€” SARIF 2.1.0 JSON, strict-validated with sarif_rust.
  • report.md β€” GitHub-Flavored Markdown, derived from that SARIF via sarif-to-md-core.

This follows skills/rust-sarif.md; the report write is scoped through a cap-std capability handle (see skills/rust-path-security.md).

Security model (rust-path-security)

All filesystem access goes through PathPolicy. For each input it:

  1. rejects empty paths and paths with an interior NUL byte;
  2. with --base-dir, rejects .. lexically and verifies the canonicalized path stays inside the canonical base directory (defeats ../ traversal and symlink escapes);
  3. optionally refuses symbolic links (--no-follow-symlinks);
  4. re-checks file type/size on the open descriptor (fstat) β€” TOCTOU mitigation;
  5. accepts regular files only;
  6. hard-caps bytes read (--max-size), bounding memory for files and stdin.

A lexical safe_join primitive confines attacker-influenced relative paths, and report writes use a capability-scoped cap-std directory handle. The crate is #![forbid(unsafe_code)] and avoids println! / std::fs::read_to_string / std::fs::write.

Example encoding files

examples/generate_fixtures.rs produces files in different encodings (committed under tests/fixtures/):

file encoding valid UTF-8?
ascii.txt ASCII βœ…
utf8_multilingual.txt UTF-8 (CJK, emoji, …) βœ…
utf8_bom.txt UTF-8 with BOM βœ…
utf16le_bom.txt UTF-16 LE + BOM ❌
utf16be_bom.txt UTF-16 BE + BOM ❌
utf32le_bom.bin UTF-32 LE + BOM ❌
latin1.txt ISO-8859-1 / Latin-1 ❌
truncated_utf8.bin UTF-8, truncated tail ❌
lone_continuation.bin stray continuation byte ❌
cargo run --example generate_fixtures           # -> ./tests/fixtures
cargo run --example validate_bytes              # library-usage demo

Testing

cargo test                                      # unit + integration + upstream tests
cargo test --features public_imp                # also exercise the low-level SIMD impls
RUSTFLAGS="-C target-cpu=native" cargo test --features public_imp

Four layers: unit tests (TDD, in each src/*.rs), tests/cli.rs (black-box binary tests via assert_cmd), tests/proptest_validation.rs (property-based invariants over validation, reporting and path handling, via proptest), and tests/upstream_tests.rs (the simdutf8 crate's own suite, vendored verbatim, dual-licensed Apache-2.0 OR MIT).

Fuzzing

The fuzz/ directory is a standalone cargo-fuzz (libFuzzer) workspace (nightly required):

cargo +nightly fuzz build           # add --jobs 2 on low-memory hosts
cargo +nightly fuzz run validate_vs_std -- -max_total_time=60

Eight targets: validate_vs_std (differential vs std), validate_prefix (reported valid_up_to is a real UTF-8 boundary; validation is deterministic), read_capped, json_escape, render_blocks (text/JSON stdout renderers), sarif_build, sarif_markdown (SARIF β†’ Markdown rendering), safe_join (path traversal).

Documentation

Dependencies

Runtime: simdutf8, clap, thiserror, sarif_rust, sarif-to-md-core, cap-std, ignore. Dev-only: assert_cmd, predicates, tempfile, flexpect. All permissively licensed (MIT / Apache-2.0); cargo audit reports no advisories.

License

Apache-2.0. See LICENSE. The vendored upstream test file is additionally available under the simdutf8 authors' MIT license.