simdutf8-cli

🌐 Languages: English · Deutsch · Français

A small, security-hardened command-line front-end for the simdutf8 crate — SIMD-accelerated UTF-8 validation. It tells you, quickly, whether files (or standard input) contain well-formed UTF-8, and where the first error is when they do not.

The project doubles as a worked example of:

Reusing an upstream crate's own test-suite verbatim — the simdutf8 unit tests run unchanged against this binary's dependency tree.
Hardened path handling (path traversal, symlink-escape, TOCTOU, and resource-exhaustion defences) in src/path_security.rs.
Structured output — SARIF 2.1.0 and Markdown for CI ingestion.
Test-driven development and fuzzing.

Code health (rust-doctor)

Measured with rust-doctor 0.1.20:

100 / 100  Great
Security 100 · Reliability 100 · Maintainability 99 · Performance 100 · Dependencies 100

clippy.toml and deny.toml were migrated to the current tool schemas, so cargo clippy and cargo deny check both run cleanly and the crate scores a perfect 100. (rust-doctor additionally force-warns a few pedantic # Panics/# Errors doc lints on test/internal code; these are allowed by project policy and do not affect the score.)

Reproduce:

rust-doctor .          # full report
rust-doctor --score    # bare integer for CI

Features

Validates one or many files, or standard input (- / no arguments).
Reports the exact byte offset and nature of the first UTF-8 error (via simdutf8::compat, matching std::str::from_utf8).
Four output formats: text, json, sarif (SARIF 2.1.0, strict-validated), and markdown (GitHub-Flavored, derived from the SARIF).
Auto-generates validated report.sarif + report.md (toggle with --no-report, redirect with --output-dir).
Optional confinement to a --base-dir, symlink rejection, and a configurable size cap.
No unsafe code in this crate (#![forbid(unsafe_code)]).

Build & install

cargo build --release      # -> target/release/simdutf8-cli
cargo install --path .     # -> ~/.cargo/bin/simdutf8-cli

Requires Rust ≥ 1.93 (edition 2021).

Usage

simdutf8-cli [OPTIONS] [PATH]...

Arguments:
  [PATH]...   Files or directories. Directories are walked recursively.
              Use `-` or pass none to read standard input.

Options:
      --exclude <GLOB>      Exclude paths matching this glob when walking dirs
                            (repeatable, gitignore syntax)
      --no-ignore           Don't respect .gitignore / .ignore when walking
                            (they are respected by default)
      --hidden              Include hidden files/dirs when walking (skipped by default)
      --format <FORMAT>     text | json | sarif | markdown        [default: text]
      --base-dir <DIR>      Confine inputs to this directory (rejects traversal & symlink escapes)
      --no-follow-symlinks  Reject symbolic links instead of resolving them
      --max-size <BYTES>    Maximum bytes to read per input        [default: 67108864]
      --output-dir <DIR>    Directory for report.sarif / report.md [default: .]
      --no-report           Do not auto-generate the report files
  -q, --quiet               Suppress stdout; rely on the exit code only
  -h, --help                Print help
  -V, --version             Print version

Directory walking

A directory argument is walked recursively. By default the walker respects .gitignore / .ignore (even outside a git repo), skips hidden files, and does not follow symlinked directories. Tune it with:

simdutf8-cli src/                       # walk, honouring .gitignore
simdutf8-cli --no-ignore src/           # walk everything, ignore .gitignore
simdutf8-cli --hidden src/              # also descend into dotfiles
simdutf8-cli --exclude '*.min.js' --exclude vendor docs/   # skip globs / dirs

Explicitly named files are always validated; ignore rules apply only while walking directories. Each discovered file is still opened and read through the hardened PathPolicy.

Exit codes

code	meaning
`0`	every input was valid UTF-8
`1`	every input was readable, but at least one was invalid
`2`	at least one input could not be read securely (I/O, policy, report)

Examples

# Validate a directory of files
simdutf8-cli tests/fixtures/*
# OK    tests/fixtures/ascii.txt
# FAIL  tests/fixtures/utf16le_bom.txt: invalid UTF-8 at byte 0 (1 invalid byte)

# JSON (one object per input, in an array)
simdutf8-cli --no-report --format json tests/fixtures/ascii.txt tests/fixtures/truncated_utf8.bin
# [{"path":"tests/fixtures/ascii.txt","valid":true},
#  {"path":"tests/fixtures/truncated_utf8.bin","valid":false,"valid_up_to":3,"error_len":null}]

# SARIF 2.1.0 for code scanning
simdutf8-cli --no-report --format sarif $(git ls-files) > simdutf8.sarif

# From standard input
printf 'grüße 😊' | simdutf8-cli --no-report

# Confine inputs and reject symlinks (e.g. scanning untrusted uploads)
simdutf8-cli --base-dir ./uploads --no-follow-symlinks ./uploads/file.txt

Output formats & reports

--format controls stdout: text (default), json, sarif, markdown. Independently, unless --no-report is given, every run that produces a finding writes two validated files into --output-dir (default current directory):

report.sarif — SARIF 2.1.0 JSON, strict-validated with sarif_rust.
report.md — GitHub-Flavored Markdown, derived from that SARIF via sarif-to-md-core.

This follows skills/rust-sarif.md; the report write is scoped through a cap-std capability handle (see skills/rust-path-security.md).

Security model (`rust-path-security`)

All filesystem access goes through PathPolicy. For each input it:

rejects empty paths and paths with an interior NUL byte;
with --base-dir, rejects .. lexically and verifies the canonicalized path stays inside the canonical base directory (defeats ../ traversal and symlink escapes);
optionally refuses symbolic links (--no-follow-symlinks);
re-checks file type/size on the open descriptor (fstat) — TOCTOU mitigation;
accepts regular files only;
hard-caps bytes read (--max-size), bounding memory for files and stdin.

A lexical safe_join primitive confines attacker-influenced relative paths, and report writes use a capability-scoped cap-std directory handle. The crate is #![forbid(unsafe_code)] and avoids println! / std::fs::read_to_string / std::fs::write.

Example encoding files

examples/generate_fixtures.rs produces files in different encodings (committed under tests/fixtures/):

file	encoding	valid UTF-8?
`ascii.txt`	ASCII	✅
`utf8_multilingual.txt`	UTF-8 (CJK, emoji, …)	✅
`utf8_bom.txt`	UTF-8 with BOM	✅
`utf16le_bom.txt`	UTF-16 LE + BOM	❌
`utf16be_bom.txt`	UTF-16 BE + BOM	❌
`utf32le_bom.bin`	UTF-32 LE + BOM	❌
`latin1.txt`	ISO-8859-1 / Latin-1	❌
`truncated_utf8.bin`	UTF-8, truncated tail	❌
`lone_continuation.bin`	stray continuation byte	❌

cargo run --example generate_fixtures           # -> ./tests/fixtures
cargo run --example validate_bytes              # library-usage demo

Testing

cargo test                                      # unit + integration + upstream tests
cargo test --features public_imp                # also exercise the low-level SIMD impls
RUSTFLAGS="-C target-cpu=native" cargo test --features public_imp

Four layers: unit tests (TDD, in each src/*.rs), tests/cli.rs (black-box binary tests via assert_cmd), tests/proptest_validation.rs (property-based invariants over validation, reporting and path handling, via proptest), and tests/upstream_tests.rs (the simdutf8 crate's own suite, vendored verbatim, dual-licensed Apache-2.0 OR MIT).

Fuzzing

The fuzz/ directory is a standalone cargo-fuzz (libFuzzer) workspace (nightly required):

cargo +nightly fuzz build           # add --jobs 2 on low-memory hosts
cargo +nightly fuzz run validate_vs_std -- -max_total_time=60

Eight targets: validate_vs_std (differential vs std), validate_prefix (reported valid_up_to is a real UTF-8 boundary; validation is deterministic), read_capped, json_escape, render_blocks (text/JSON stdout renderers), sarif_build, sarif_markdown (SARIF → Markdown rendering), safe_join (path traversal).

Documentation

Dependencies

Runtime: simdutf8, clap, thiserror, sarif_rust, sarif-to-md-core, cap-std, ignore. Dev-only: assert_cmd, predicates, tempfile, flexpect. All permissively licensed (MIT / Apache-2.0); cargo audit reports no advisories.

License

Apache-2.0. See LICENSE. The vendored upstream test file is additionally available under the simdutf8 authors' MIT license.

simdutf8-cli 0.2.7