btrfs-transaction
Atomic, COW-based modification of btrfs filesystem images.
Warning: This crate is experimental. It is a clean-room reimplementation of btrfs's transaction process. As such, it may not fully match the behaviour of the reference implementation. It also has not been tested extensively.
This crate sits on top of btrfs-disk and provides the write-side
machinery that the parsing crate deliberately doesn't: copy-on-write
of tree blocks, leaf insert/delete/split/balance, a delayed reference
queue, free space tree maintenance, and a Transaction type that
groups all of those into an atomic commit. It works on raw image
files or block devices and does not call any ioctls — it only needs
read/write/seek access to the filesystem bytes.
Part of the btrfsutils project.
What's implemented
Tree-block storage and COW
Filesystem: open an image, resolve tree roots, read tree blocks via the chunk tree, write blocks back to disk on all mirrors with correct CRC32C checksums.ExtentBuffer: in-memory tree block with leaf/node helpers for items, key/data offsets, and slot manipulation. Tracks dirty state and pending writes.cow_block: copy-on-write a tree block, allocating a new bytenr, clearing the WRITTEN/RELOC flags, and queuing add/drop delayed refs against the old and new addresses.
B-tree operations
search_slot: search for a key in any tree, with optional COW of the path. SupportsReadOnly,Insert(size), andDeleteintents; the latter two pre-COW and pre-balance the path so the caller can mutate the leaf in place.next_leaf: cursor advance across leaf boundaries.items::insert_item/del_items/update_item/shrink_item: in-place leaf payload manipulation that maintains the descending data offset invariant.- Split / balance: leaf and node split when an insert won't fit; push-left/push-right/merge to keep nodes within fill thresholds on delete.
BtrfsPath: per-level slot+buffer trace from the root to a target leaf, owned and released explicitly so the caller controls borrow scope.
Allocation
- Block group scan: load every
BLOCK_GROUP_ITEMfrom the block group tree (id 11) when present, otherwise from the extent tree (id 2). - Per-kind bump allocator:
Transaction::alloc_block(kind)with separate cursors forBlockGroupKind::MetadataandBlockGroupKind::System. Cursors are seeded by scanning the extent tree for free gaps inside block groups of the requested kind. Pinned blocks (freed earlier in the same transaction) are skipped to preserve crash consistency. alloc_tree_block(tree_id, level): routes the chunk tree (id 3) to a SYSTEM block group and every other tree to a metadata block group, registers a delayedadd_ref, and for SYSTEM allocations updates the superblock'ssys_chunk_arraybootstrap snippet so the next mount can resolve the new chunk root.
Delayed reference queue
DelayedRefQueue: batches reference count changes keyed byDelayedRefKey::Metadata { bytenr, owner_root, level }orDelayedRefKey::Data { bytenr, owner_root, owner_ino, owner_offset }. Add/drop pairs to the same key cancel.- Metadata flush: creates
METADATA_ITEM(skinny) orEXTENT_ITEMrecords withTREE_BLOCK_REFinline backrefs on positive deltas, deletes them on negative deltas, and threads byte deltas into per-block-groupbytes_usedaccounting. - Data flush (drop side): locates the matching
EXTENT_DATA_REFeither inline inside the parentEXTENT_ITEMor as a standalone item (with hash-collision walk forward), decrements both the inlinecountfield and the parentEXTENT_ITEM.refs, and on hitting zero deletes the entireEXTENT_ITEMand trims overlapping csum tree entries. - Csum tree maintenance: when a data extent is fully freed,
walks the csum tree once to collect every overlapping
EXTENT_CSUMitem, then in a second pass deletes whole-coverage items and re-inserts trimmed head/tail fragments under new keys.
Free space tree
- Incremental update:
update_free_space_treeconsumes the per-block-group range deltas accumulated during the delayed-ref flush and applies them to the on-disk FST, deleting and re-insertingFREE_SPACE_EXTENTitems and updating theFREE_SPACE_INFO.extent_count. Bitmap-layout block groups are refused with a clear error. FREE_SPACE_TREE_VALIDstays set: the convergence loop inTransaction::commitrunsflush_delayed_refs → update_root_items → snapshot_roots → update_free_space_treein that order so the FST root change is captured by the next pass and the on-disk FST stays consistent with the extent tree.
Transaction lifecycle
Transaction::start: bumps the in-memory generation, snapshots the current root pointers, and seeds the metadata allocator cursor.Transaction::commit: force-COWs the root tree (so every commit advancesheader.generation), runs the convergence loop to drain delayed refs, root item updates, and FST updates until stable, flushes every dirty block to disk via the chunk tree (writing all DUP/RAID1 mirrors), updates superblock fields and the rotating backup roots, and writes the superblock to all mirrors.Transaction::abort: restores the in-memory root pointer snapshot so the next transaction reads consistent on-disk state.
What's not yet implemented
- Data extent ref additions:
flush_delayed_refsonly handles the drop side for data refs (todo!on positive deltas). No current caller produces data ref additions. - New SYSTEM chunk allocation: if no existing SYSTEM block
group has free space,
ensure_in_sys_chunk_arraycannot carve out a new one. Bails cleanly. - Multi-device replication for chunk tree COW: tested only on single-device filesystems.
- Bitmap-layout free space tree: refused with a clear error.
Testing
Unit tests cover the leaf manipulation primitives, balance logic,
delayed ref merging, search/next_leaf, and the chunk array
bootstrap helpers (in btrfs-disk).
Integration tests build real filesystem images via mkfs.btrfs,
exercise the full COW pipeline through Transaction::commit, and
verify the result both by reopening the image with Filesystem
and by running btrfs check --readonly on it. The fixture image
under cli/tests/commands/fixture.img.gz is also used for
read-path and data-ref drop tests.
A regression suite hits historical bugs (cow pinning, sibling
generation, leaf data compaction, cascading splits, ...). A
proptest-based playground tree harness exercises insert/delete
sequences against an in-memory model and verifies that
Transaction::commit produces a structurally valid filesystem
on every step.
License
Licensed under either of Apache License, Version 2.0 or MIT license at your option.