simdutf8-cli
🌐 Languages: English · Deutsch · Français
A small, security-hardened command-line front-end for the
simdutf8 crate — SIMD-accelerated UTF-8
validation. It tells you, quickly, whether files (or standard input) contain
well-formed UTF-8, and where the first error is when they do not.
The project doubles as a worked example of:
- Reusing an upstream crate's own test-suite verbatim — the simdutf8 unit tests run unchanged against this binary's dependency tree.
- Hardened path handling (path traversal, symlink-escape, TOCTOU, and
resource-exhaustion defences) in
src/path_security.rs. - Structured output — SARIF 2.1.0 and Markdown for CI ingestion.
- Test-driven development and fuzzing.
Code health (rust-doctor)
Measured with rust-doctor 0.1.20:
100 / 100 Great
Security 100 · Reliability 100 · Maintainability 99 · Performance 100 · Dependencies 100
clippy.tomlanddeny.tomlwere migrated to the current tool schemas, socargo clippyandcargo deny checkboth run cleanly and the crate scores a perfect 100. (rust-doctor additionally force-warns a few pedantic# Panics/# Errorsdoc lints on test/internal code; these are allowed by project policy and do not affect the score.)
Reproduce:
Features
- Validates one or many files, or standard input (
-/ no arguments). - Reports the exact byte offset and nature of the first UTF-8 error (via
simdutf8::compat, matchingstd::str::from_utf8). - Four output formats: text, json, sarif (SARIF 2.1.0, strict-validated), and markdown (GitHub-Flavored, derived from the SARIF).
- Auto-generates validated
report.sarif+report.md(toggle with--no-report, redirect with--output-dir). - Optional confinement to a
--base-dir, symlink rejection, and a configurable size cap. - No
unsafecode in this crate (#![forbid(unsafe_code)]).
Build & install
Requires Rust ≥ 1.93 (edition 2021).
Usage
simdutf8-cli [OPTIONS] [PATH]...
Arguments:
[PATH]... Files or directories. Directories are walked recursively.
Use `-` or pass none to read standard input.
Options:
--exclude <GLOB> Exclude paths matching this glob when walking dirs
(repeatable, gitignore syntax)
--no-ignore Don't respect .gitignore / .ignore when walking
(they are respected by default)
--hidden Include hidden files/dirs when walking (skipped by default)
--format <FORMAT> text | json | sarif | markdown [default: text]
--base-dir <DIR> Confine inputs to this directory (rejects traversal & symlink escapes)
--no-follow-symlinks Reject symbolic links instead of resolving them
--max-size <BYTES> Maximum bytes to read per input [default: 67108864]
--output-dir <DIR> Directory for report.sarif / report.md [default: .]
--no-report Do not auto-generate the report files
-q, --quiet Suppress stdout; rely on the exit code only
-h, --help Print help
-V, --version Print version
Directory walking
A directory argument is walked recursively. By default the walker respects
.gitignore / .ignore (even outside a git repo), skips hidden files, and
does not follow symlinked directories. Tune it with:
Explicitly named files are always validated; ignore rules apply only while
walking directories. Each discovered file is still opened and read through the
hardened PathPolicy.
Exit codes
| code | meaning |
|---|---|
0 |
every input was valid UTF-8 |
1 |
every input was readable, but at least one was invalid |
2 |
at least one input could not be read securely (I/O, policy, report) |
Examples
# Validate a directory of files
# OK tests/fixtures/ascii.txt
# FAIL tests/fixtures/utf16le_bom.txt: invalid UTF-8 at byte 0 (1 invalid byte)
# JSON (one object per input, in an array)
# [{"path":"tests/fixtures/ascii.txt","valid":true},
# {"path":"tests/fixtures/truncated_utf8.bin","valid":false,"valid_up_to":3,"error_len":null}]
# SARIF 2.1.0 for code scanning
# From standard input
|
# Confine inputs and reject symlinks (e.g. scanning untrusted uploads)
Output formats & reports
--format controls stdout: text (default), json, sarif, markdown.
Independently, unless --no-report is given, every run that produces a finding
writes two validated files into --output-dir (default current directory):
report.sarif— SARIF 2.1.0 JSON, strict-validated withsarif_rust.report.md— GitHub-Flavored Markdown, derived from that SARIF viasarif-to-md-core.
This follows skills/rust-sarif.md; the report write is scoped through a
cap-std capability handle (see skills/rust-path-security.md).
Security model (rust-path-security)
All filesystem access goes through PathPolicy. For each
input it:
- rejects empty paths and paths with an interior NUL byte;
- with
--base-dir, rejects..lexically and verifies the canonicalized path stays inside the canonical base directory (defeats../traversal and symlink escapes); - optionally refuses symbolic links (
--no-follow-symlinks); - re-checks file type/size on the open descriptor (
fstat) — TOCTOU mitigation; - accepts regular files only;
- hard-caps bytes read (
--max-size), bounding memory for files and stdin.
A lexical safe_join primitive confines
attacker-influenced relative paths, and report writes use a capability-scoped
cap-std directory handle. The crate is #![forbid(unsafe_code)] and avoids
println! / std::fs::read_to_string / std::fs::write.
Example encoding files
examples/generate_fixtures.rs produces files
in different encodings (committed under tests/fixtures/):
| file | encoding | valid UTF-8? |
|---|---|---|
ascii.txt |
ASCII | ✅ |
utf8_multilingual.txt |
UTF-8 (CJK, emoji, …) | ✅ |
utf8_bom.txt |
UTF-8 with BOM | ✅ |
utf16le_bom.txt |
UTF-16 LE + BOM | ❌ |
utf16be_bom.txt |
UTF-16 BE + BOM | ❌ |
utf32le_bom.bin |
UTF-32 LE + BOM | ❌ |
latin1.txt |
ISO-8859-1 / Latin-1 | ❌ |
truncated_utf8.bin |
UTF-8, truncated tail | ❌ |
lone_continuation.bin |
stray continuation byte | ❌ |
Testing
RUSTFLAGS="-C target-cpu=native"
Three layers: unit tests (TDD, in each src/*.rs), tests/cli.rs (black-box
binary tests via assert_cmd), and tests/upstream_tests.rs (the simdutf8
crate's own suite, vendored verbatim, dual-licensed Apache-2.0 OR MIT).
Fuzzing
The fuzz/ directory is a standalone cargo-fuzz (libFuzzer) workspace
(nightly required):
Targets: validate_vs_std (differential vs std), read_capped, json_escape,
sarif_build, safe_join (path traversal).
Documentation
Dependencies
Runtime: simdutf8, clap, thiserror, sarif_rust, sarif-to-md-core,
cap-std, ignore. Dev-only: assert_cmd, predicates, tempfile, flexpect. All
permissively licensed (MIT / Apache-2.0); cargo audit reports no advisories.
License
Apache-2.0. See LICENSE. The vendored upstream test file is additionally available under the simdutf8 authors' MIT license.