simdutf8-cli
π Languages: English Β· Deutsch Β· FranΓ§ais
A small, security-hardened command-line front-end for the
simdutf8 crate β SIMD-accelerated UTF-8
validation. It tells you, quickly, whether files (or standard input) contain
well-formed UTF-8, and where the first error is when they do not.
The project doubles as a worked example of:
- Reusing an upstream crate's own test-suite verbatim β the simdutf8 unit tests run unchanged against this binary's dependency tree.
- Hardened path handling (path traversal, symlink-escape, TOCTOU, and
resource-exhaustion defences) in
src/path_security.rs. - Structured output β SARIF 2.1.0 and Markdown for CI ingestion.
- Test-driven development and fuzzing.
Code health (rust-doctor)
Measured with rust-doctor 0.1.20:
100 / 100 Great
Security 100 Β· Reliability 100 Β· Maintainability 99 Β· Performance 100 Β· Dependencies 100
clippy.tomlanddeny.tomlwere migrated to the current tool schemas, socargo clippyandcargo deny checkboth run cleanly and the crate scores a perfect 100. (rust-doctor additionally force-warns a few pedantic# Panics/# Errorsdoc lints on test/internal code; these are allowed by project policy and do not affect the score.)
Reproduce:
Features
- Validates one or many files, or standard input (
-/ no arguments). - Reports the exact byte offset and nature of the first UTF-8 error (via
simdutf8::compat, matchingstd::str::from_utf8). - Four output formats: text, json, sarif (SARIF 2.1.0, strict-validated), and markdown (GitHub-Flavored, derived from the SARIF).
- Auto-generates validated
report.sarif+report.md(toggle with--no-report, redirect with--output-dir). - Optional confinement to a
--base-dir, symlink rejection, and a configurable size cap. - No
unsafecode in this crate (#![forbid(unsafe_code)]).
Build & install
Requires Rust β₯ 1.93 (edition 2021).
Usage
simdutf8-cli [OPTIONS] [PATH]...
Arguments:
[PATH]... Files or directories. Directories are walked recursively.
Use `-` or pass none to read standard input.
Options:
--exclude <GLOB> Exclude paths matching this glob when walking dirs
(repeatable, gitignore syntax)
--no-ignore Don't respect .gitignore / .ignore when walking
(they are respected by default)
--hidden Include hidden files/dirs when walking (skipped by default)
--format <FORMAT> text | json | sarif | markdown [default: text]
--base-dir <DIR> Confine inputs to this directory (rejects traversal & symlink escapes)
--no-follow-symlinks Reject symbolic links instead of resolving them
--max-size <BYTES> Maximum bytes to read per input [default: 67108864]
--output-dir <DIR> Directory for report.sarif / report.md [default: .]
--no-report Do not auto-generate the report files
-q, --quiet Suppress stdout; rely on the exit code only
-h, --help Print help
-V, --version Print version
Directory walking
A directory argument is walked recursively. By default the walker respects
.gitignore / .ignore (even outside a git repo), skips hidden files, and
does not follow symlinked directories. Tune it with:
Explicitly named files are always validated; ignore rules apply only while
walking directories. Each discovered file is still opened and read through the
hardened PathPolicy.
Exit codes
| code | meaning |
|---|---|
0 |
every input was valid UTF-8 |
1 |
every input was readable, but at least one was invalid |
2 |
at least one input could not be read securely (I/O, policy, report) |
Examples
# Validate a directory of files
# OK tests/fixtures/ascii.txt
# FAIL tests/fixtures/utf16le_bom.txt: invalid UTF-8 at byte 0 (1 invalid byte)
# JSON (one object per input, in an array)
# [{"path":"tests/fixtures/ascii.txt","valid":true},
# {"path":"tests/fixtures/truncated_utf8.bin","valid":false,"valid_up_to":3,"error_len":null}]
# SARIF 2.1.0 for code scanning
# From standard input
|
# Confine inputs and reject symlinks (e.g. scanning untrusted uploads)
Output formats & reports
--format controls stdout: text (default), json, sarif, markdown.
Independently, unless --no-report is given, every run that produces a finding
writes two validated files into --output-dir (default current directory):
report.sarifβ SARIF 2.1.0 JSON, strict-validated withsarif_rust.report.mdβ GitHub-Flavored Markdown, derived from that SARIF viasarif-to-md-core.
This follows skills/rust-sarif.md; the report write is scoped through a
cap-std capability handle (see skills/rust-path-security.md).
Security model (rust-path-security)
All filesystem access goes through PathPolicy. For each
input it:
- rejects empty paths and paths with an interior NUL byte;
- with
--base-dir, rejects..lexically and verifies the canonicalized path stays inside the canonical base directory (defeats../traversal and symlink escapes); - optionally refuses symbolic links (
--no-follow-symlinks); - re-checks file type/size on the open descriptor (
fstat) β TOCTOU mitigation; - accepts regular files only;
- hard-caps bytes read (
--max-size), bounding memory for files and stdin.
A lexical safe_join primitive confines
attacker-influenced relative paths, and report writes use a capability-scoped
cap-std directory handle. The crate is #![forbid(unsafe_code)] and avoids
println! / std::fs::read_to_string / std::fs::write.
Example encoding files
examples/generate_fixtures.rs produces files
in different encodings (committed under tests/fixtures/):
| file | encoding | valid UTF-8? |
|---|---|---|
ascii.txt |
ASCII | β |
utf8_multilingual.txt |
UTF-8 (CJK, emoji, β¦) | β |
utf8_bom.txt |
UTF-8 with BOM | β |
utf16le_bom.txt |
UTF-16 LE + BOM | β |
utf16be_bom.txt |
UTF-16 BE + BOM | β |
utf32le_bom.bin |
UTF-32 LE + BOM | β |
latin1.txt |
ISO-8859-1 / Latin-1 | β |
truncated_utf8.bin |
UTF-8, truncated tail | β |
lone_continuation.bin |
stray continuation byte | β |
Testing
RUSTFLAGS="-C target-cpu=native"
Four layers: unit tests (TDD, in each src/*.rs), tests/cli.rs (black-box
binary tests via assert_cmd), tests/proptest_validation.rs (property-based
invariants over validation, reporting and path handling, via proptest), and
tests/upstream_tests.rs (the simdutf8 crate's own suite, vendored
verbatim, dual-licensed Apache-2.0 OR MIT).
Fuzzing
The fuzz/ directory is a standalone cargo-fuzz (libFuzzer) workspace
(nightly required):
Eight targets: validate_vs_std (differential vs std), validate_prefix
(reported valid_up_to is a real UTF-8 boundary; validation is deterministic),
read_capped, json_escape, render_blocks (text/JSON stdout renderers),
sarif_build, sarif_markdown (SARIF β Markdown rendering), safe_join
(path traversal).
Documentation
Dependencies
Runtime: simdutf8, clap, thiserror, sarif_rust, sarif-to-md-core,
cap-std, ignore. Dev-only: assert_cmd, predicates, tempfile, flexpect. All
permissively licensed (MIT / Apache-2.0); cargo audit reports no advisories.
License
Apache-2.0. See LICENSE. The vendored upstream test file is additionally available under the simdutf8 authors' MIT license.