fsys 0.9.3 - Docs.rs

FSYS (fsys-rs) is a low-level file and directory IO crate for Rust. It is aimed at systems code that needs explicit control over durability, predictable cross-platform behavior, and an API surface that stays close to how storage software actually thinks about IO.

The crate sits between std::fs and a fully bespoke platform layer. It keeps the operational model explicit: you choose a durability method, build a long-lived Handle, and issue file or directory operations through that handle. On supported platforms, fsys uses the best available primitive for the selected method while keeping fallback behavior visible rather than implicit.

That makes it a good fit for storage engines, embedded databases, local-first applications, durable caches, append-heavy services, background workers, and other programs where write semantics matter as much as raw throughput. It is not trying to replace std::fs for ordinary application code.

FEATURES

Journal substrate — open-once append-only log file with atomic LSN reservation, group-commit fsync, and a CRC-32C-protected self-identifying frame format. Intended for write-ahead-log workloads (database WAL, persistent queues, ledgers) where the atomic-replace primitive's per-call fsync cost is the bottleneck. Three throughput tiers are present: a cross-platform synchronous core, a lock-free concurrent append path, and a native io_uring asynchronous substrate on Linux. An opt-in Direct-IO mode (JournalOptions::direct(true)) routes appends through a sector-aligned in-memory log buffer — the architecture used by InnoDB's redo log and the WiredTiger journal — which trades the lock-free hot path for predictable tail latency and zero-copy device writes via O_DIRECT / F_NOCACHE / FILE_FLAG_NO_BUFFERING. 0.9.1 adds a vectored JournalHandle::append_batch(&[&[u8]]) that submits N records as a single framed-write syscall (~1.6× faster than append-in-loop on Windows page cache; larger wins expected on Linux + NVMe), hardware-accelerated CRC-32C with runtime CPU-feature dispatch (SSE4.2 / ARMv8 CRC), cache-padded hot atomics, stack-allocated frame encoding for small records, and a parking_lot Condvar leader/follower group-commit coordinator with two new tuning knobs (JournalOptions::group_commit_window, group_commit_max_batch) ported from emdb v0.8.5 (default Some(500 µs) / 8).
Five real durability methods — Sync, Data, Mmap, Direct, and hardware-aware Auto. Every method is platform-honest: the actual primitive in use is observable via Handle::active_method() and Handle::active_durability_primitive().
Cross-platform IO semantics — one API surface across Linux, macOS, and Windows, with platform-specific fallbacks documented rather than hidden.
NVMe passthrough flush — on Linux (NVME_IOCTL_IO_CMD) and Windows (IOCTL_STORAGE_PROTOCOL_COMMAND) when the hardware supports it and the process has the privilege. Transparent fallback to fdatasync / WRITE_THROUGH otherwise.
Linux io_uring path — Method::Direct on Linux routes through io_uring when available (kernel ≥ 5.1, no SECCOMP/AppArmor block), falling back to O_DIRECT + pwrite + fdatasync cleanly.
Atomic replace-style writes — every public write API (write, write_copy, write_batch, Batch::commit) uses a temp-file + atomic rename pattern. The target file is either entirely the old payload or entirely the new payload — never torn.
Crash-safety verified — per-method crash tests with three kill points (pre-syscall, mid-syscall, post-syscall) and the 100× pre-merge stability protocol.
write_copy with metadata preservation — atomic-swap that preserves the target's existing mode (Unix), owner/group (Unix, when permitted), ACLs (Windows), and timestamps (all platforms).
Root-scoped handles — bind a Handle to a base directory and reject paths that escape it.
Full file and directory CRUD — write, read, append, positioned writes, range reads, truncate, rename, copy, metadata, sync, directory creation/removal, listing, recursive scan, glob find, and recursive count.
Batch operations — grouped writes, deletes, and copies through write_batch, delete_batch, copy_batch, and the chainable Batch builder.
Async layer with two substrates — gated behind the async Cargo feature. Every sync method gets an _async sibling. On Linux + Method::Direct, async ops submit directly to the per-handle io_uring ring (the native substrate, new in 0.7.0). Everywhere else, async ops route through tokio::task::spawn_blocking. Which substrate a handle uses is observable via Handle::async_substrate() -> AsyncSubstrate.
Configurable group lane — tune batch window, batch size, queue depth, io_uring queue depth, and aligned-buffer-pool size per handle.
Quick one-shot API — convenience helpers backed by a lazily initialized default handle for simple cases.
Structured error reporting — 21 explicit error variants with stable FS-XXXXX codes for unsupported methods, alignment failures, atomic-replace failures, NVMe passthrough denial, async-runtime requirements, glob-pattern errors, batch failure position, handle poisoning, io_uring submit failure, and completion-driver liveness.
Hardware-aware database surface (0.9.2) — Handle::is_plp_protected() / Handle::plp_status() for safe per-commit fsync skip on confirmed-PLP enterprise NVMe (3–10× transaction-throughput lever); crate::observer::FsysObserver trait + Builder::observer for typed per-op telemetry (journal append / sync / handle write / read); runtime CPU-feature detection (replacing pre-0.9.2 compile-time cfg!(target_feature = ...) that lied on cross-target builds); Builder::tune_for(Workload::Database) for one-line storage-engine tuning (8 MiB buffer pool, 256-deep io_uring ring, 4096-deep batch queue).
Pipeline throughput tier (0.9.3) — Builder::dispatcher_shards(N) spawns N independent dispatcher threads per handle, each with its own bounded queue; batches hash-routed by first op's path so within-batch order is preserved while concurrent submitters writing to different files scale near-linearly with shard count (was a one-core ceiling pre-0.9.3). Batch::commit_grouped() amortises parent-directory fsync across the entire batch — one syscall per unique parent directory instead of one per op — for bulk-load / SST-flush / checkpoint workloads where the batch is the durability unit.

Installation

[dependencies]
fsys = "0.9.3"

To opt into the async layer:

[dependencies]
fsys = { version = "0.9.0", features = ["async"] }

Cargo features

Feature	Default	Pulls in	Purpose
`async`	off	`tokio` (`rt`, `rt-multi-thread`, `sync`, `macros`)	`_async` siblings for every sync method; async batch via `tokio::sync::oneshot`.
`stress`	off	(none)	Switches the soak tests in `tests/stress.rs` from a 60-second validation run to the full 1-hour soak duration. CI nightly enables this; dev iteration leaves it off.
`fuzz`	off	(none)	Compile-only flag for fuzz instrumentation. The actual fuzz targets live in `fuzz/` (separate `cargo-fuzz` workspace).

Minimum supported Rust version

1.75. The MSRV may be raised in any minor version before 1.0.0. After 1.0.0, MSRV bumps require a minor version bump.

Benchmark results

Numbers below were captured on windows-ntfs-nvme (Windows 11 Pro, x86_64, local NVMe SSD; std::env::temp_dir() resolves to NTFS) with 100 timed iterations after 10 warmup. Run-to-run noise is roughly ±5 % on this host class. The full methodology, additional payload sizes, and Linux numbers live in docs/BENCH.md; reproduce locally with cargo bench.

Journal substrate vs atomic-replace — the headline 0.9.0 result. Atomic-replace pays 5–7 syscalls per durable write; the journal opens once, appends without per-call fsync, and amortises durability across a sync_through call.

Payload	Atomic-replace	Journal (sync at end)	Speedup
64 B	634 ops/s	462.9 K ops/s	730×
4 KiB	891 ops/s	189.3 K ops/s	212×

The "sync at end" cadence is the canonical WAL pattern: append many records, fsync once at a transaction boundary. At an intermediate cadence (sync every 100 appends), the journal still delivers 109–255× the atomic-replace throughput. See docs/BENCH.md for the full table including the per-append sync cadence.

Atomic-replace write vs std::fs::write — tail latency is what fsys pays for; medians on small writes go to std::fs::write because it does not provide durability guarantees.

Payload	`fsys::Auto` median / p99	`std::fs::write` median / p99
4 KiB	1.08 ms / 4.69 ms	218.7 µs / 7.18 ms
64 KiB	1.23 ms / 5.50 ms	4.48 ms / 5.47 ms
1 MiB	1.80 ms / 5.00 ms	2.84 ms / 16.45 ms

std::fs::write is ~5× faster than fsys::Auto at the 4 KiB median because it skips the fsync + atomic-rename cycle. At p99 the gap inverts: fsys::Auto is 3.3× faster than std::fs::write at 1 MiB because the durability cost is paid deterministically rather than deferred to OS scheduling. The fair comparison for durable writes is fsys::Sync versus std::fs plus a manual temp-file + sync_all + rename dance — the latter is what most application code gets wrong.

Read parity — the read path is essentially std::fs::read plus handle bookkeeping.

Payload	`fsys::Auto` median / p99	`std::fs::read` median / p99	`tokio::fs::read` median / p99
4 KiB	25.0 / 89.4 µs	23.7 / 77.1 µs	35.8 / 152.8 µs
64 KiB	25.0 / 58.9 µs	24.1 / 64.0 µs	105.9 / 337.5 µs
1 MiB	182.5 / 482.3 µs	189.0 / 327.4 µs	250.7 / 585.8 µs

tokio::fs::read (simulated via spawn_blocking, which is what tokio's own fs module does internally) is 1.5–4.4× slower because of the thread-pool hop. On Linux + Method::Direct + the async feature, fsys's native io_uring substrate bypasses that hop entirely — see docs/BENCH.md for the WSL2 measurement.

Documentation

API reference: https://docs.rs/fsys
Architecture overview: docs/ARCHITECTURE.md
Runnable examples (16): docs/EXAMPLES.md — catalogues every example in examples/ with a "when to use this pattern" guide.
Method matrix and Auto decision ladder: docs/METHODS.md
Performance targets and tuning: docs/PERFORMANCE.md
Crash-safety contract per method: docs/CRASH-SAFETY.md
Per-platform behavior + capability requirements: docs/PLATFORM-NOTES.md
API stability + breaking-change policy: see Stability + breaking-change policy and API changes in 0.9.0 in docs/API.md. Per-version migration deltas live in CHANGELOG.md.

Coming Soon...

LICENSE

Licensed under the Apache License version 2.0 [ LICENSE-APACHE ], or the MIT License [ LICENSE-MIT ]; otherwise known as the ("License Agreement"); you are permitted to use this software, its source code, documentation, concepts, and any of the associated contents, within the limitations defined by the "License Agreement".