simdutf8-cli 0.1.6

SIMD-accelerated UTF-8 validation CLI built on the simdutf8 crate, with hardened path handling.
Documentation

simdutf8-cli

🌐 Languages: English · Deutsch · Français

A small, security-hardened command-line front-end for the simdutf8 crate — SIMD-accelerated UTF-8 validation. It tells you, quickly, whether files (or standard input) contain well-formed UTF-8, and where the first error is when they do not.

The project doubles as a worked example of:

  • Reusing an upstream crate's own test-suite verbatim — the simdutf8 unit tests run unchanged against this binary's dependency tree.
  • Hardened path handling (path traversal, symlink-escape, TOCTOU, and resource-exhaustion defences) in src/path_security.rs.
  • Structured output — SARIF 2.1.0 and Markdown for CI ingestion.
  • Test-driven development and fuzzing.

Code health (rust-doctor)

Measured with rust-doctor 0.1.20:

100 / 100  Great
Security 100 · Reliability 100 · Maintainability 99 · Performance 100 · Dependencies 100

clippy.toml and deny.toml were migrated to the current tool schemas, so cargo clippy and cargo deny check both run cleanly and the crate scores a perfect 100. (rust-doctor additionally force-warns a few pedantic # Panics/# Errors doc lints on test/internal code; these are allowed by project policy and do not affect the score.)

Reproduce:

rust-doctor .          # full report
rust-doctor --score    # bare integer for CI

Features

  • Validates one or many files, or standard input (- / no arguments).
  • Reports the exact byte offset and nature of the first UTF-8 error (via simdutf8::compat, matching std::str::from_utf8).
  • Four output formats: text, json, sarif (SARIF 2.1.0, strict-validated), and markdown (GitHub-Flavored, derived from the SARIF).
  • Auto-generates validated report.sarif + report.md (toggle with --no-report, redirect with --output-dir).
  • Optional confinement to a --base-dir, symlink rejection, and a configurable size cap.
  • No unsafe code in this crate (#![forbid(unsafe_code)]).

Build & install

cargo build --release      # -> target/release/simdutf8-cli
cargo install --path .     # -> ~/.cargo/bin/simdutf8-cli

Requires Rust ≥ 1.93 (edition 2021).

Usage

simdutf8-cli [OPTIONS] [PATH]...

Arguments:
  [PATH]...   Files or directories. Directories are walked recursively.
              Use `-` or pass none to read standard input.

Options:
      --exclude <GLOB>      Exclude paths matching this glob when walking dirs
                            (repeatable, gitignore syntax)
      --no-ignore           Don't respect .gitignore / .ignore when walking
                            (they are respected by default)
      --hidden              Include hidden files/dirs when walking (skipped by default)
      --format <FORMAT>     text | json | sarif | markdown        [default: text]
      --base-dir <DIR>      Confine inputs to this directory (rejects traversal & symlink escapes)
      --no-follow-symlinks  Reject symbolic links instead of resolving them
      --max-size <BYTES>    Maximum bytes to read per input        [default: 67108864]
      --output-dir <DIR>    Directory for report.sarif / report.md [default: .]
      --no-report           Do not auto-generate the report files
  -q, --quiet               Suppress stdout; rely on the exit code only
  -h, --help                Print help
  -V, --version             Print version

Directory walking

A directory argument is walked recursively. By default the walker respects .gitignore / .ignore (even outside a git repo), skips hidden files, and does not follow symlinked directories. Tune it with:

simdutf8-cli src/                       # walk, honouring .gitignore
simdutf8-cli --no-ignore src/           # walk everything, ignore .gitignore
simdutf8-cli --hidden src/              # also descend into dotfiles
simdutf8-cli --exclude '*.min.js' --exclude vendor docs/   # skip globs / dirs

Explicitly named files are always validated; ignore rules apply only while walking directories. Each discovered file is still opened and read through the hardened PathPolicy.

Exit codes

code meaning
0 every input was valid UTF-8
1 every input was readable, but at least one was invalid
2 at least one input could not be read securely (I/O, policy, report)

Examples

# Validate a directory of files
simdutf8-cli tests/fixtures/*
# OK    tests/fixtures/ascii.txt
# FAIL  tests/fixtures/utf16le_bom.txt: invalid UTF-8 at byte 0 (1 invalid byte)

# JSON (one object per input, in an array)
simdutf8-cli --no-report --format json tests/fixtures/ascii.txt tests/fixtures/truncated_utf8.bin
# [{"path":"tests/fixtures/ascii.txt","valid":true},
#  {"path":"tests/fixtures/truncated_utf8.bin","valid":false,"valid_up_to":3,"error_len":null}]

# SARIF 2.1.0 for code scanning
simdutf8-cli --no-report --format sarif $(git ls-files) > simdutf8.sarif

# From standard input
printf 'grüße 😊' | simdutf8-cli --no-report

# Confine inputs and reject symlinks (e.g. scanning untrusted uploads)
simdutf8-cli --base-dir ./uploads --no-follow-symlinks ./uploads/file.txt

Output formats & reports

--format controls stdout: text (default), json, sarif, markdown. Independently, unless --no-report is given, every run that produces a finding writes two validated files into --output-dir (default current directory):

  • report.sarif — SARIF 2.1.0 JSON, strict-validated with sarif_rust.
  • report.md — GitHub-Flavored Markdown, derived from that SARIF via sarif-to-md-core.

This follows skills/rust-sarif.md; the report write is scoped through a cap-std capability handle (see skills/rust-path-security.md).

Security model (rust-path-security)

All filesystem access goes through PathPolicy. For each input it:

  1. rejects empty paths and paths with an interior NUL byte;
  2. with --base-dir, rejects .. lexically and verifies the canonicalized path stays inside the canonical base directory (defeats ../ traversal and symlink escapes);
  3. optionally refuses symbolic links (--no-follow-symlinks);
  4. re-checks file type/size on the open descriptor (fstat) — TOCTOU mitigation;
  5. accepts regular files only;
  6. hard-caps bytes read (--max-size), bounding memory for files and stdin.

A lexical safe_join primitive confines attacker-influenced relative paths, and report writes use a capability-scoped cap-std directory handle. The crate is #![forbid(unsafe_code)] and avoids println! / std::fs::read_to_string / std::fs::write.

Example encoding files

examples/generate_fixtures.rs produces files in different encodings (committed under tests/fixtures/):

file encoding valid UTF-8?
ascii.txt ASCII
utf8_multilingual.txt UTF-8 (CJK, emoji, …)
utf8_bom.txt UTF-8 with BOM
utf16le_bom.txt UTF-16 LE + BOM
utf16be_bom.txt UTF-16 BE + BOM
utf32le_bom.bin UTF-32 LE + BOM
latin1.txt ISO-8859-1 / Latin-1
truncated_utf8.bin UTF-8, truncated tail
lone_continuation.bin stray continuation byte
cargo run --example generate_fixtures           # -> ./tests/fixtures
cargo run --example validate_bytes              # library-usage demo

Testing

cargo test                                      # unit + integration + upstream tests
cargo test --features public_imp                # also exercise the low-level SIMD impls
RUSTFLAGS="-C target-cpu=native" cargo test --features public_imp

Three layers: unit tests (TDD, in each src/*.rs), tests/cli.rs (black-box binary tests via assert_cmd), and tests/upstream_tests.rs (the simdutf8 crate's own suite, vendored verbatim, dual-licensed Apache-2.0 OR MIT).

Fuzzing

The fuzz/ directory is a standalone cargo-fuzz (libFuzzer) workspace (nightly required):

cargo +nightly fuzz build           # add --jobs 2 on low-memory hosts
cargo +nightly fuzz run validate_vs_std -- -max_total_time=60

Targets: validate_vs_std (differential vs std), read_capped, json_escape, sarif_build, safe_join (path traversal).

Documentation

Dependencies

Runtime: simdutf8, clap, thiserror, sarif_rust, sarif-to-md-core, cap-std, ignore. Dev-only: assert_cmd, predicates, tempfile, flexpect. All permissively licensed (MIT / Apache-2.0); cargo audit reports no advisories.

License

Apache-2.0. See LICENSE. The vendored upstream test file is additionally available under the simdutf8 authors' MIT license.