btrfs-fs 0.13.0

High-level filesystem API on top of btrfs-disk: lookup, readdir, read, xattr.
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
# btrfs-fs: Userspace btrfs Filesystem Crate

## Goal

A high-level Rust filesystem API on top of `btrfs-disk` (read) and
`btrfs-transaction` (write), exposed as `Filesystem<R>` with all the
operations a userspace driver needs: `lookup`, `readdir`, `read`,
`write`, `getattr`, `xattr_*`, plus btrfs-specific operations like
subvolume creation and send stream generation through ioctl
passthrough.

The crate is the substrate for `btrfs-fuse` (the FUSE driver) and any
other embedder that wants to read or write a btrfs filesystem without
talking to the kernel — offline tools, tests, alternate FUSE bindings,
network-mounted images.

## Design principles

- **FUSE-independent.** The crate exposes plain `io::Result` /
  `Filesystem` ops; nothing depends on `fuser`. The FUSE protocol
  mapping (inode translation, `Stat``FileAttr`, `reply.*`) lives
  in `btrfs-fuse`. New embedders depend on `btrfs-fs` directly.

- **Async API from F2 onwards.** All ops become `async fn`. Sync
  internals get wrapped in `tokio::task::spawn_blocking` until we
  have an async I/O backend. The `Filesystem` handle is `Clone` (cheap
  `Arc` bump) so multiple worker tasks can drive it. The FUSE adapter
  spawns a tokio task per callback, moves the `Reply*` handle in,
  awaits the async op, and replies from the task — no FUSE worker
  thread blocked on disk I/O.

- **Single source of truth: `btrfs-disk` for reads, `btrfs-transaction`
  for writes.** No parsing or write-path logic re-implemented at this
  layer. `btrfs-fs` composes the lower-level primitives into
  filesystem-level operations and adds caching, dirty tracking, and
  multi-subvolume bookkeeping.

- **Inode is `(SubvolId, ino)`.** Multi-subvolume support is the
  default mental model from the start, even before F5 implements
  crossing. FUSE adapters translate to a flat `u64` at the boundary.

- **Cache hits go lock-free.** `Filesystem` operations use
  `RwLock<LruCache<...>>` for shared read-side caches; cache misses
  fall back to a serialized I/O path. The split is internal — the
  `&self` API surface stays the same as the cache layer evolves.

- **Happy path only.** No degraded RAID mounts, no partial-recovery
  modes. If `btrfs-disk` can open the filesystem, `btrfs-fs` operates
  on it; if not, it errors out. Recovery tooling lives in
  `btrfs cli` (e.g. `btrfs rescue`).

- **Correctness over performance.** Especially for the write path.
  Cross-validation with kernel btrfs (`btrfs check` after a fuse
  session, mount-as-kernel after fuse modifications) is the
  acceptance test for write phases.

- **Tests at every phase.** Unit tests for pure logic, integration
  tests against `mkfs.btrfs --rootdir` fixture images (unprivileged),
  and from F12 onward cross-validation against kernel btrfs.

## Existing infrastructure we build on

From `btrfs-disk`:
- `BlockReader<R>` with `read_data`, `read_tree_block`
- `filesystem_open()``OpenFilesystem` (superblock + chunk cache + root map)
- `TreeBlock` (Node/Leaf), `Header`, `Item`, `DiskKey`
- `tree_walk()` / `tree_walk_tolerant()` with visitor callbacks
- All on-disk item parsers (`InodeItem`, `DirItem`, `FileExtentItem`, ...)
- `Superblock` parsing
- `ChunkTreeCache` for logical→physical resolution

From `btrfs-transaction` (used from F9 onward):
- `Transaction` with `commit()`, B-tree CoW, delayed refs
- High-level helpers: `create_inode`, `link_dir_entry`, `set_xattr`,
  `write_file_data`, `insert_inline_extent`, `set_root_readonly`,
  `set_default_subvol`, `insert_root_ref`, `reserve_data_extent`
- `Filesystem::create_subvolume_shape` for new subvolume bootstrap
- Free-space tree, block group accounting, csum tree updates

From `btrfs-stream`:
- `StreamReader` and TLV command/attribute encoding (used in F7 send)

## Crate structure (target)

```
fs/
  Cargo.toml          # btrfs-fs, MIT/Apache-2.0
  src/
    lib.rs            # public API re-exports
    filesystem.rs     # Filesystem<R>, Inner<R>, ops
    inode.rs          # Inode, SubvolId types
    dir.rs            # Entry, FileKind
    stat.rs           # Stat
    read.rs           # extent resolution, decompression
    xattr.rs          # xattr enumeration / lookup
    cache/            # tree-block, inode, extent-map caches (F3)
    subvol.rs         # multi-subvol traversal (F5)
    ioctl/            # FUSE_IOCTL decode + dispatch (F6, F11)
    send.rs           # send stream generation (F7)
    write/            # write-path operations (F9-F11)
      mod.rs
      tx.rs           # TxnHandle wrapping btrfs-transaction
      ops.rs          # POSIX ops: create, write, truncate, ...
  tests/
    basic.rs          # F1: read-path integration tests
    compression.rs    # F4: zlib/zstd/LZO sweep
    multisubvol.rs    # F5: subvol traversal
    ioctl.rs          # F6/F11: ioctl behavior
    send.rs           # F7: send stream round-trips
    write.rs          # F10: POSIX write ops
    durability.rs     # F12: cross-validation with kernel mount
```

## Phases

Each phase ends with a green test suite and a single commit. Tests
land *with* the feature, not after.

### F1 — Crate extraction ✅

Done. `btrfs-fs` carved out of `btrfs-fuse`. `Filesystem<R>` exposes
read ops; 19 read-path integration tests pass; fuse shrinks to a thin
adapter.

### F1.5 — `&self` / `Arc<Inner>` handle ✅

Done. `Filesystem<R: Read + Seek + Send>` is `Clone` (cheap `Arc`
bump). All ops take `&self`. Fuse adapter loses its outer Mutex.
Compile-time `Send + Sync` assertion + multithread test.

### F2 — Async refactor ✅

Done. `Filesystem<R>` ops are all `async fn`. Sync I/O wrapped in
`tokio::task::spawn_blocking`; sync `Mutex` held only inside the
blocking task, never across `.await`. Bound: `R: Read + Seek + Send +
'static`.

`btrfs-fuse` carries an internal multi-thread tokio runtime; each
FUSE callback spawns a task that owns the `Reply*`, awaits the async
op, and replies from the task. FUSE worker threads return
immediately.

All 19 read-path tests under `#[tokio::test]`. New
`concurrent_async_reads` test spawns one task per fixture entry on a
4-worker multi-thread runtime and verifies parallel
`lookup → read → getattr` chains all complete correctly. Compile-time
`Send + Sync` assertion still in place.

Native async I/O remains out of scope (deferred to F8 if profiling
justifies it).

### F3 — Caches ✅

Done. Three caches sit on the read path:

- `LruTreeBlockCache`: `Mutex<LruCache<u64, Arc<TreeBlock>>>` keyed
  by logical address, plugged into `BlockReader` via the
  `TreeBlockCache` trait added to `btrfs-disk`. Default 4096 entries
  (~64 MiB).
- `InodeCache`: `Mutex<LruCache<Inode, Arc<InodeItem>>>`, populated
  on `lookup` / `getattr` / `read_inode_item` / `readlink`. Default
  4096 entries.
- `ExtentMapCache`: `Mutex<LruCache<Inode, Arc<ExtentMap>>>` built
  lazily on first `read` of a file; subsequent reads skip the FS
  tree walk entirely. Default 1024 entries.

`Mutex` rather than `RwLock` because LRU mutation happens on every
access (touching MRU order) — even a "read" needs exclusive access
to the cache structure.

`Filesystem::tree_block_cache_stats() -> CacheStats` exposes lock-free
atomic hit/miss/insertion counters for tests and observability.

The trait + invalidation methods are wired up but the generation
counter for transaction-commit invalidation is deferred to F9 (no
write path yet, no invalidation yet).

Out of scope (deferred): persistent cache (across `Filesystem`
instances), benchmarks under `fs/benches/`. The two-test cache suite
(`fs/tests/cache.rs`) verifies effectiveness directly via the stats
API rather than via timing.

### F4 — Compression test sweep ✅

Done. 33 tests in `fs/tests/compression.rs` (11 per algorithm × 3
algorithms via `compression_suite!` macro). Per-algorithm fixture
built once via `mkfs.btrfs --rootdir --compress <algo>` and shared
across the suite via `OnceLock`.

Coverage per algorithm:
- inline compressed extent (full read + partial-with-offset)
- regular compressed extent on highly-compressible 1 MiB zeros
  (full + partial-offset that lands inside a 128 KiB chunk)
- regular extent on 1 MiB pseudo-random bytes (incompressible —
  exercises the "compress flag set, but extent says None" path)
- 16 MiB multi-extent file with a per-MiB byte pattern (full +
  straddling read across both an inter-extent boundary and a 128 KiB
  internal compression-chunk boundary + last-byte read)
- read at EOF / past EOF returns empty

Bugs the sweep caught:
- zstd: `bulk::decompress` rejects trailing bytes after the first
  zstd frame, so multi-frame compressed extents (anything > 128 KiB
  uncompressed) failed. Switched to the streaming decoder.
- inline compressed extents: the read range math was clamped against
  `inline_size` (on-disk compressed length) rather than `ram_bytes`
  (logical length), so any read of a compressed inline extent
  returned a slice too short. Fixed.

LZO had no new bugs — `decompress_lzo` survived the sweep with all
its per-sector framing edge cases handled correctly.

### F5 — Multi-subvolume traversal ✅

Done. `Filesystem::lookup` detects subvolume crossings (DirItem with
`location.key_type == ROOT_ITEM`) and returns an `Inode` carrying
the new subvol id and objectid 256. Reads, readdir, readlink, and
xattr ops automatically follow into the new subvolume's tree via
`tree_root_for(subvol)`. The `..` synthesised at subvolume roots
resolves via `ROOT_BACKREF` in the root tree.

`Filesystem::list_subvolumes() -> Vec<SubvolInfo>` walks the root
tree, returning id, parent, name, ctime, generation, and read-only
flag for every subvolume. System trees are filtered out via
`is_subvolume_id` (id == 5 OR 256 ≤ id ≤ u64::MAX - 256).

`Filesystem::open_subvol(reader, SubvolId)` opens with a non-default
subvolume as `root()`. `Filesystem::default_subvol() -> SubvolId`
exposes the choice.

Fixture: `mkfs.btrfs --rootdir --subvol sub1 --subvol sub1/nested
--subvol sub2`. Tests resolve names → ids dynamically via
`list_subvolumes` since mkfs id assignment isn't argument-order
deterministic. 9 tests cover lookup crossing, nested-subvol
crossing, readdir of subvol root, `..` resolving via ROOT_BACKREF
to FS_TREE, `..` resolving via ROOT_BACKREF to a non-default parent
subvol, list_subvolumes shape, open_subvol happy path,
open_subvol unknown id (NotFound), open_subvol invalid id
(InvalidInput).

Follow-up (now landed): `btrfs-fuse` exposes `--subvol PATH` and
`--subvolid ID`, mutually exclusive. `--subvol` resolves the path
against each subvolume's full parent-chain path; `--subvolid` takes
the tree id directly. `BtrfsFuse::open_subvol` is the matching
library entry point. The fuse adapter learned its
`mount_subvol` field at the same time — the FUSE root inode (`1`)
now maps onto whatever subvolume the `Filesystem` was opened with,
not unconditionally `SubvolId(5)`.

### F6.1 — Read-only ioctls (fixed-size) ✅

Done. `FUSE_IOCTL` plumbing landed in `btrfs-fuse` and three
fixed-size read-only ioctls dispatched through it:

- `BTRFS_IOC_FS_INFO` — superblock geometry, UUIDs, csum type
- `BTRFS_IOC_GET_FEATURES` — compat / compat_ro / incompat words
- `BTRFS_IOC_GET_SUBVOL_INFO` — full subvolume metadata

`btrfs-fs` grew the supporting `Filesystem::superblock()` and
`Filesystem::get_subvol_info(SubvolId)` accessors, and `SubvolInfo`
gained `dirid`/`uuid`/`parent_uuid`/`received_uuid`/`otime`/transids
(marked `#[non_exhaustive]` for future-proofing).

`fuse/src/ioctl.rs` re-derives the kernel ioctl numbers via const
`_IOR` helpers (bindgen doesn't expand the macro family) and
serialises responses into the on-disk C struct layout without
leaking `btrfs_disk::raw` types into the public API.

5 new tests in `fuse/tests/ioctl.rs`:
- 3 libc::ioctl-driven tests (one per ioctl)
- 1 unknown-ioctl test verifying `ENOTTY`
- 1 CLI E2E test that runs our `btrfs subvolume show` against the
  fuse mount — exercises `BTRFS_IOC_GET_SUBVOL_INFO` end-to-end
  through real CLI consumer code

### F6.2 — Read-only ioctls (fixed-size subset) ✅

Done. Two more ioctls landed on top of F6.1:

- `BTRFS_IOC_DEV_INFO` — per-device geometry. Returns the primary
  device's `dev_item` from the superblock; multi-device images need
  a dev-tree walk (deferred). Unknown devid returns ENODEV.
- `BTRFS_IOC_INO_LOOKUP``(treeid, objectid)` → path within the
  subvolume. Walks the `INODE_REF` chain upwards from `objectid`
  until the subvol root, with a 4096-iteration loop bound to defend
  against corrupted ref cycles. `treeid == 0` resolves against the
  file's containing subvolume.

`btrfs-fs` gained `Filesystem::dev_info(devid)` and
`Filesystem::ino_lookup(subvol, objectid)` plus a re-export of
`DeviceItem`.

End-to-end CLI tests: `btrfs inspect-internal rootid <mount>` uses
`lookup_path_rootid` (which calls `BTRFS_IOC_INO_LOOKUP` with
objectid=`BTRFS_FIRST_FREE_OBJECTID`) and now succeeds against our
fuse mount, returning the default subvol id 5.

### F6.3 — Variable-size ioctls ✅ (with kernel-imposed scope limit)

`BTRFS_IOC_TREE_SEARCH` (v1, fixed-size 4096) is implemented and is
what the upstream `btrfs` CLI actually uses for `subvolume list`,
giving us a working E2E path. `Filesystem::tree_search(filter,
max_buf_size)` in `btrfs-fs` does the underlying tree walk with
compound-key range filtering (matching kernel semantics).

`BTRFS_IOC_GET_SUBVOL_ROOTREF` (fixed 4096) is implemented on top of
`tree_search` against the root tree; pages through children in
255-entry batches via `min_treeid` (matches the kernel ioctl).
`Filesystem::ino_paths(subvol, objectid) -> Vec<Vec<u8>>` is
exposed by `btrfs-fs` for embedders that want every hardlink path.

`BTRFS_IOC_TREE_SEARCH_V2` has a handler that returns
`IoctlOutcome::Retry`, but in practice it cannot complete — see
the FUSE_IOCTL_RETRY restriction below. Same story for
`BTRFS_IOC_INO_PATHS` and `BTRFS_IOC_LOGICAL_INO_V2`: not wired into
dispatch since the retry round-trip can't happen from a normal
libc `ioctl(2)` caller.

**FUSE_IOCTL_RETRY restriction.** Linux's `fuse_do_ioctl` only
accepts a `FUSE_IOCTL_RETRY` reply when the original request had
`FUSE_IOCTL_UNRESTRICTED` set. The standard
`fuse_file_ioctl` / `fuse_dir_ioctl` paths do not set that flag,
so user-space ioctls reaching us via libc get rejected with
`-EIO` after the first retry response — the kernel never re-issues
the call. Confirmed locally and corroborated by the `xfbs/fuser`
PR review. This means every variable-size btrfs ioctl that needs
retry to extend past the cmd-encoded 14-bit size is blocked at
the kernel boundary today.

Unblock options at the kernel layer:
1. Get the kernel to relax the restriction (unlikely; security).
2. Have the FUSE driver implement a CUSE-style init that opts the
   fd into `FUSE_IOCTL_UNRESTRICTED`. Requires plumbing CUSE_INIT
   in fuser (not implemented today; the upstream PR review noted
   this as a separate gap).
3. Skip fuser and roll our own FUSE protocol implementation that
   sets up the fd as unrestricted from the start.

None of the kernel-layer fixes are pursuing this cycle; instead we
route around the restriction at the userspace boundary — see F6.4.

### F6.4 — uapi-level fallback for FUSE-restricted ioctls ✅ (a landed; b/c future)

**Status:** F6.4a (the foundational ENOPROTOOPT contract +
`tree_search_auto` fallback) shipped in commits `6aa4016` and
`25a4af2`. The latter dropped the patched `xfbs/fuser` git
dependency entirely — `btrfs-fuse` is back on released
`fuser = "0.17"` from crates.io and is publishable again.

`ino_paths` and `logical_ino` fallbacks (F6.4b/c) are still
specced below but not implemented; the corresponding ioctls
return ENOPROTOOPT today and have no userspace fallback yet.

The kernel can't relax the retry restriction in our timeline, but
we own both ends of the call: our `btrfs` CLI calls the broken
ioctls through wrappers in `btrfs-uapi`, and our FUSE driver
chooses what each ioctl returns. Pair the two so the round trip
through libc → kernel → FUSE → uapi is self-healing.

**Signal.** For each ioctl that needs retry but can't get it, the
FUSE driver returns `ENOPROTOOPT` up front instead of attempting
`IoctlOutcome::Retry`. The semantic fit is "we recognise this
ioctl, just not in this protocol form" — i.e. not the
indirected/variable-size variant. The pragmatic reason for
`ENOPROTOOPT` specifically (vs. the more obvious `ENOTSUP`):
nothing else in the btrfs ioctl surface ever returns it, and
neither does the VFS for an unsupported op on the wrong fs type,
so it functions as a private channel. If uapi sees it from one of
these specific ioctls, that's overwhelmingly *our* FUSE driver
speaking — we don't risk falling back on a generic
"unsupported op" error from the kernel or another driver.
(`ENOTSUP` would also work; the choice is for clarity, not
correctness — the v1 fallback would surface a real error anyway
if the underlying fs weren't btrfs.)

**Fallback.** Each `btrfs-uapi` wrapper for a restricted-on-FUSE
ioctl catches `ENOPROTOOPT` from its first ioctl call and re-runs
the operation through composition of v1-/fixed-size ioctls that
the FUSE driver does support. The fallback path is a normal Rust
function over the existing wrappers — no new ioctl interfaces.

**Per-ioctl plan:**

- `tree_search_v2(fd, filter, buf_size)` → on `ENOTSUP`, call
  `tree_search` (v1) with the same filter. v1 paginates internally
  with a 4 KiB buffer; semantics are identical, only slower.

- `ino_paths(fd, inum)` → on `ENOTSUP`:
  1. `lookup_path_rootid(fd)` to get the subvol id.
  2. `tree_search` for `objectid=inum, type ∈ {INODE_REF=12,
     INODE_EXTREF=13}`. For each ref extract `(parent_dirid,
     name)` (`INODE_REF`'s parent is `key.offset`; `INODE_EXTREF`
     stores it in the parsed struct).
  3. For each parent: `BTRFS_IOC_INO_LOOKUP(parent)` → path
     string (works on FUSE — fits in 4 KiB).
  4. Concat `parent_path + "/" + name` per link.

- `logical_ino` / `logical_ino_v2(fd, logical, ...)` → on
  `ENOTSUP`:
  1. `tree_search` on tree id 2 (extent tree) for
     `objectid=logical, type ∈ {EXTENT_ITEM=168,
     METADATA_ITEM=169}`.
  2. Parse `EXTENT_ITEM` to enumerate inline backrefs
     (`EXTENT_DATA_REF`, `SHARED_DATA_REF`).
  3. Optionally walk standalone `EXTENT_DATA_REF_KEY=178` /
     `SHARED_DATA_REF_KEY=184` keys for the same logical addr
     when the inline backref pool is full.
  4. For each `EXTENT_DATA_REF`, emit `(inum, offset, root)`.
  5. `SHARED_DATA_REF` requires following the parent backref;
     skipping initially is reasonable.
  6. Needs an `ExtentItem` parser in `btrfs-disk` (likely a new
     module).

- `space_info` is the one read-side ioctl with no v1 fallback —
  the chunk tree it summarises isn't reachable through any
  fixed-size ioctl. Stays unsupported on FUSE for now; the user
  can read the backing image directly via `btrfs-disk` if they
  need this.

**Optional widening.** Other FUSE-btrfs implementations (none
exist today) wouldn't return `ENOPROTOOPT` — the kernel rejects
their retry response with `EIO` instead. If we ever care about
that case, widen the fallback trigger to `ENOPROTOOPT || EIO`,
accepting that genuine disk errors on those specific ioctls would
also trigger the fallback (low risk; the fallback would then
itself fail with a meaningful error).

**Effect on the fuser dependency.** With F6.4 in place, our CLI
never issues the broken ioctls against our FUSE mount, so our
FUSE driver never needs `ReplyIoctl::retry`. The git pin on
xfbs/fuser becomes unnecessary:

- Drop the `tree_search_v2` retry handler from
  `fuse/src/ioctl.rs` (no longer reachable from any consumer).
- Drop the `arg: u64` parameter use everywhere — none of the
  remaining handlers need it.
- Switch `fuse/Cargo.toml` back to released `fuser = "0.17"`
  from crates.io.
- Drop the `allow-git` entry in `deny.toml`.
- Re-enable `publish = true` on `btrfs-fuse`.

**Test plan.** Each shim gets a uapi-level integration test that
runs against our `btrfs-fuse` mount (currently fails with EIO;
passes after the shim). A hidden env var
`BTRFS_FORCE_FUSE_FALLBACK=1` lets the same test exercise the
fallback path against a kernel mount, where it's the only path
under test. Unit tests for the standalone parsers (extent-item
backrefs in particular).

**Recommended sequencing.** F6.4a: detection plumbing + `ENOTSUP`
returns + `tree_search_v2` fallback (smallest, proves the
pattern). F6.4b: `ino_paths` fallback (~50 lines). F6.4c:
`logical_ino` fallback (~150 lines, needs extent-item backref
parser; defer if not needed by any current CLI command).

### F6.3-historical (blocker resolved)

**Scope:**
- `BTRFS_IOC_TREE_SEARCH_V2` — generic tree search; the args struct
  has a flexible `buf[0]` array that exceeds the 14-bit size encoded
  in the ioctl number
- `BTRFS_IOC_LOGICAL_INO_V2` — logical → inode (extent-tree walk);
  variable-size inodes buffer
- `BTRFS_IOC_INO_PATHS` — inode → all paths (hardlink resolution);
  variable-size paths buffer
- `BTRFS_IOC_GET_SUBVOL_ROOTREF` — subvol parent backrefs; 69 KiB
  fixed struct, but still exceeds the 14-bit cap and needs retry
- `BTRFS_IOC_LOGICAL_INO` (older variant) and other admin ioctls
  whose buffers exceed 16 383 bytes

**Blocker:** `fuser` 0.17's `ReplyIoctl` only exposes `ioctl(result,
data)` — there's no `retry(in_iovs, out_iovs)` method. Without it,
FUSE silently truncates input/output to the size encoded in the
ioctl number's 14-bit size field. We can land this after one of:

1. Upstreaming a `ReplyIoctl::retry(...)` to fuser
2. Forking fuser locally
3. Skipping fuser entirely with a custom FUSE protocol implementation

Test plan once unblocked: `btrfs subvolume list <fuse-mount>` for
TREE_SEARCH_V2, `btrfs inspect-internal inode-resolve <ino>` for
INO_PATHS, `btrfs filesystem show <fuse-mount>` for DEV_INFO (already
works with F6.2's fixed-size subset, but fuller multi-device coverage
needs a dev-tree walk).

**Out of scope:** write ioctls (F11), admin ioctls (balance, scrub,
qgroup) — those are kernel-managed operations; users run them
against a real kernel mount.

### F7 — Send stream generation (tier 1 ✅; tier 2/3 future)

Decomposed into three tiers, with tier 1 shipped and the rest
spec'd. Decomposition rationale and trade-offs are in the
session conversation; the short version is on three orthogonal
axes (full vs incremental, stream version, output target) we
prioritised the smallest end-to-end loop first.

**Tier 1 — full v1 sends ✅**

Shipped in commits `38446e6` (encoder), `6742cb8` (walker),
`904da27` (CLI). Surface:

- `btrfs_stream::StreamWriter<W>`: mirror image of `StreamReader`.
  Encodes any `StreamCommand` variant for v1/v2/v3; v2+ DATA
  attribute quirk handled. Roundtrips through the parser.
- `btrfs_fs::Filesystem::send(snapshot, output) -> Result<output>`:
  walks the subvolume tree path-first, emits per-inode
  Mkfile/Mkdir/Symlink/Mknod/Mkfifo/Mksock + xattrs + Write
  chunks (48 KiB cap for v1) + Truncate + Chown/Chmod/Utimes.
  Hardlinks beyond the first ref emit Link rather than
  re-creating. Subvolume crossings skipped. v1 stream only.
- `btrfs send --offline IMAGE [-f OUT]
  [--offline-subvol PATH | --offline-subvolid ID]`: bypasses
  kernel `BTRFS_IOC_SEND` entirely. No `CAP_SYS_ADMIN`, no kernel
  mount, works against FUSE-mounted images. Tier 1 limitations
  enforced via clap's `conflicts_with_all`: `-p`, `-c`,
  `--no-data`, `--proto`, `--compressed-data` all rejected in
  offline mode.

Round-trip test (`send_offline_round_trips_through_kernel_receive`,
privileged) generates a stream offline → pipes to real kernel
`btrfs receive` → diffs file contents on the receive side.

**Tier 2 — incremental sends (future)**

`btrfs send --offline -p PARENT IMAGE`. Walker takes
`parent: Option<SubvolId>` and emits a coordinated diff between
the parent and snapshot trees (item-by-item at each
`(objectid, key_type, offset)` triple). Emits SNAPSHOT (not
SUBVOL), then only the deltas. Common operational use case
(rolling snapshot backups), so this is the natural next step.

**Tier 3 — v2 EncodedWrite passthrough (future)**

For compressed extents, emit `EncodedWrite` directly with the
on-disk compressed bytes rather than decompressing → re-emitting
as plain Write. Saves CPU + bandwidth on the receive side.
Requires reading raw compressed extent payloads (currently
btrfs-fs always decompresses).

**Tier 4 — clone sources (`-c`) and v3 verity (future)**

Lower priority; defer until tier 1/2/3 are solid.

**Online dispatch (future)**

The existing `btrfs send` (no `--offline`) goes through
`BTRFS_IOC_SEND`. A future enhancement: detect FUSE mounts (or
add an `--auto` flag) and route to the in-process path
transparently. Unblocks `btrfs send <fuse-mount>/snap` where
the ioctl currently fails.

### F8 — True parallel I/O

**Scope:**
- Replace `Mutex<BlockReader<R>>` with a small reader pool. For
  `R = File`: pool of `BlockReader<File>` instances each owning a
  `File::try_clone()` (cheap `dup(2)` on Linux). Each
  `spawn_blocking` task checks out a reader from the pool, runs its
  I/O, returns the reader.
- For `R != File` (test cursors etc.): a single mutex'd reader
  remains.
- Cache hits already lock-free from F3; misses now run in parallel
  on different fds. `pread` on different fds is genuinely
  concurrent at the kernel level.

**Test plan:**
- Benchmark: random reads from N tokio tasks; throughput should
  scale until disk QD saturates.
- Stress: 1000-task fan-out + 10× repeat, no deadlocks or
  wrong-data corruption.

**Out of scope:** native async I/O via `tokio-uring` / `monoio`. If
profiling shows `spawn_blocking` overhead is the bottleneck, do it
later — the API surface doesn't change.

### F9 — Write foundation

**Scope:**
- `Filesystem::open_rw(reader: R, writer: W) -> io::Result<Self>`.
  Replaces `OpenFilesystem<R>` with a `RwOpenFilesystem<R, W>`
  internally.
- `Filesystem::tx() -> TxnHandle` — opens a write transaction
  backed by `btrfs_transaction::Transaction`. Holds the write lock
  for the duration; commits on drop or explicit `commit().await`.
- Dirty inode tracking: `RwLock<HashMap<Inode, DirtyInode>>` on
  `Inner`. Populated by write ops, drained by commit.
- Cache invalidation: on commit, bump generation counter; readers
  see stale entries and re-fetch.
- Empty `TxnHandle` that just commits — no operations yet. Proves
  the plumbing.

**Test plan:**
- Integration test: open rw, take tx, commit, verify generation
  bumped. No on-disk changes (tx was empty).
- `btrfs check` passes after empty commit (proves we don't corrupt
  anything by just opening for write).

### F10 — POSIX write operations

**Scope (one PR per group):**
- Directory ops: `create`, `mkdir`, `unlink`, `rmdir`
- File data: `write` (small inline, large extents), `truncate` (up
  and down)
- Naming: `rename`, `link`, `symlink`
- Metadata: `chmod`, `chown`, `utimens`
- Xattrs: `setxattr`, `removexattr`
- Each op: implementation in `fs/src/write/ops.rs` + test in
  `fs/tests/write.rs` + fuse adapter mapping in `fuse/src/fs.rs`.

**Test plan:**
- Per op: write via `btrfs-fs`, read back via `btrfs-fs`, assert.
- Per op: write via `btrfs-fs`, mount with kernel btrfs, read back,
  assert.
- `btrfs check` passes after every write op.

### F11 — Write ioctls

**Scope:**
- `BTRFS_IOC_SUBVOL_CREATE_V2` — create a new subvolume
- `BTRFS_IOC_SNAP_CREATE_V2` — create a snapshot
- `BTRFS_IOC_SNAP_DESTROY_V2` — delete a subvolume
- `BTRFS_IOC_FICLONE` / `FICLONERANGE` — reflink
- `BTRFS_IOC_DEFRAG_RANGE` — defrag (lower priority; can defer)
- `BTRFS_IOC_SET_FEATURES` — feature flag changes (compat_ro etc.)
- `BTRFS_IOC_SET_RECEIVED_SUBVOL` — used by `btrfs receive`
- inode flags ioctls: `FS_IOC_GETFLAGS` / `SETFLAGS`,
  `FS_IOC_FSGETXATTR` / `FSSETXATTR`

**Test plan:**
- Per ioctl: drive via `btrfs-fs` API in a tokio test;
  cross-validate with kernel btrfs by mounting after.
- End-to-end: `cp --reflink=always` against a fuse mount works
  (uses `FICLONERANGE` under the hood).

### F12 — fsync semantics + cross-validation

**Scope:**
- `Filesystem::fsync(ino: Inode) -> io::Result<()>` /
  `fdatasync(ino)`.
- Tree-log integration if needed for performance (defer until
  benchmarks show it matters).
- Crash-safety harness: write a sequence of ops with an interrupt
  injected at random points; verify post-recovery state is
  consistent (no torn writes, all committed ops visible).
- Acceptance test: a corpus of write sequences (POSIX ops + btrfs
  ioctls), run via `btrfs-fs`, then `btrfs check` — must pass
  every time.
- Same-or-better: mount the resulting filesystem with kernel btrfs,
  read back, compare — must match what `btrfs-fs` would read.

**Out of scope:** O_DIRECT, mmap consistency. Those are FUSE-protocol
concerns; if they matter, they go in F13.

### F13 — Hardening + benchmarks + docs

**Scope:**
- Stress tests: large files, deep dir trees, snapshot during write,
  concurrent rw + snapshot.
- Benchmarks vs kernel btrfs on standard workloads (sequential
  read, random read, sequential write, metadata-heavy).
- Documentation: architecture overview, embedder examples (offline
  image inspection, custom filters, server-side embedder).
- Performance tuning per profiling.

## Time estimate

- F2–F8 (read fully): **6–10 weeks** of focused effort
- F9–F12 (write fully): **3–6 months**, depends heavily on
  `btrfs-transaction` stability
- F13: **as needed**

## Tracking

This file is the source of truth. Update it as phases land:
- Mark phases ✅ when complete
- Adjust scope/test plan if reality diverges
- Add follow-up issues at the bottom as they're discovered

When a phase introduces work outside this crate (e.g. F7 adds an
encoder to `btrfs-stream`), call it out in the phase scope.