btrfs-transaction
Atomic, COW-based modification of btrfs filesystem images.
This is a pre-1.0, experimental crate. It is a clean-room reimplementation of btrfs's read-write tree machinery and may have edge cases that testing doesn't cover. Do not use it on filesystems you care about without taking a backup first.
This crate sits on top of btrfs-disk and provides the write-side
machinery that the parsing crate deliberately doesn't: copy-on-write
of tree blocks, leaf insert/delete/split/balance, a delayed reference
queue, free space tree maintenance, and a Transaction type that
groups all of those into an atomic commit. It works on raw image
files or block devices and does not call any ioctls — it only needs
read/write/seek access to the filesystem bytes.
Part of the btrfsutils project.
What's implemented
Tree-block storage and COW
Filesystem: open an image, resolve tree roots, read tree blocks via the chunk tree, write blocks back to disk on all mirrors with correct CRC32C checksums.ExtentBuffer: in-memory tree block with leaf/node helpers for items, key/data offsets, and slot manipulation. Tracks dirty state and pending writes.cow_block: copy-on-write a tree block, allocating a new bytenr, clearing the WRITTEN/RELOC flags, and queuing add/drop delayed refs against the old and new addresses.
B-tree operations
search_slot: search for a key in any tree, with optional COW of the path. SupportsReadOnly,Insert(size), andDeleteintents; the latter two pre-COW and pre-balance the path so the caller can mutate the leaf in place.next_leaf: cursor advance across leaf boundaries.items::insert_item/del_items/update_item/shrink_item: in-place leaf payload manipulation that maintains the descending data offset invariant.- Split / balance: leaf and node split when an insert won't fit; push-left/push-right/merge to keep nodes within fill thresholds on delete.
BtrfsPath: per-level slot+buffer trace from the root to a target leaf, owned and released explicitly so the caller controls borrow scope.
Allocation
- Block group scan: load every
BLOCK_GROUP_ITEMfrom the block group tree (id 11) when present, otherwise from the extent tree (id 2). - Per-kind bump allocator:
Transaction::alloc_block(kind)with separate cursors forBlockGroupKind::Metadata,BlockGroupKind::System, andBlockGroupKind::Data. Cursors are seeded by scanning the extent tree for free gaps inside block groups of the requested kind. Pinned blocks (freed earlier in the same transaction) are skipped to preserve crash consistency. alloc_tree_block(tree_id, level): routes the chunk tree (id 3) to a SYSTEM block group and every other tree to a metadata block group, registers a delayedadd_ref, and for SYSTEM allocations updates the superblock'ssys_chunk_arraybootstrap snippet so the next mount can resolve the new chunk root.
Delayed reference queue
DelayedRefQueue: batches reference count changes keyed byDelayedRefKey::Metadata { bytenr, owner_root, level }orDelayedRefKey::Data { bytenr, owner_root, owner_ino, owner_offset }. Add/drop pairs to the same key cancel.- Metadata flush: creates
METADATA_ITEM(skinny) orEXTENT_ITEMrecords withTREE_BLOCK_REFinline backrefs on positive deltas, deletes them on negative deltas, and threads byte deltas into per-block-groupbytes_usedaccounting. - Data flush (drop side): locates the matching
EXTENT_DATA_REFeither inline inside the parentEXTENT_ITEMor as a standalone item (with hash-collision walk forward), decrements both the inlinecountfield and the parentEXTENT_ITEM.refs, and on hitting zero deletes the entireEXTENT_ITEMand trims overlapping csum tree entries. - Csum tree maintenance: when a data extent is fully freed,
walks the csum tree once to collect every overlapping
EXTENT_CSUMitem, then in a second pass deletes whole-coverage items and re-inserts trimmed head/tail fragments under new keys. - Data flush (add side): on a positive data delta, inserts a
EXTENT_ITEM(24-byte header + inline 29-byteEXTENT_DATA_REF= 53 bytes) and updates the per-block-groupbytes_useddelta plus the FST's allocated-range record.
Data extent write path
Transaction::alloc_data_extent(data, root, ino, file_offset): finds free space in aBlockGroupKind::Datablock group with sectorsize alignment, zero-padsdatato sectorsize, writes it viaBlockReader::write_block(fanning out to every stripe copy), queues a+1EXTENT_DATA_REFdelayed ref, and returns the allocated logical address.Transaction::insert_file_extent(tree, ino, offset, payload): inserts anEXTENT_DATAitem at(ino, EXTENT_DATA, offset).payloadis a serializedFileExtentItem(useto_bytes_regularfor non-inline extents orto_bytes_inlinefor inline-tail extents).Transaction::insert_csums(logical, on_disk_data): computes per-sector standard CRC32C of the on-disk bytes and insertsEXTENT_CSUMitems into the csum tree, splitting payloads that would exceed leaf capacity into multiple keyed items. Errors on non-CRC32C filesystems.Transaction::update_inode_nbytes(tree, ino, delta): signed in-place patch of the inode'snbytesfield at the fixed struct offset. Preserves all other fields (includingflags,rdev,sequence) that round-tripping viaInodeItemArgswould lose.Transaction::insert_inline_extent(tree, ino, file_offset, data): embedsdatadirectly in the FS tree leaf as an inlineEXTENT_DATAitem. No extent-tree entry, no csum entries.INODE.nbytesis bumped by the unaligned payload length.Transaction::write_file_data(tree, ino, file_offset, data, nodatasum, compression): high-level helper that picks inline vs regular based ondata.len()and the per-filesystem inline threshold (max_inline_data_size). For regular extents, splitsdatainto ≤1 MiB chunks, optionally compresses each chunk (zlib, zstd, or LZO; per-chunk fallback to raw when compression doesn't shrink), allocates each, inserts theEXTENT_DATAitem, computes csums (unlessnodatasum), and bumps the inode'snbytesby the logical sector-aligned size.try_compress(data, algorithm): free function that returns the inline-framed compressed bytes only when they shrink. For LZO this produces the inline framing format[4B total_len LE] [4B seg_len LE] [lzo bytes]; for zlib and zstd the raw compressor output is returned.
Inode and directory entry helpers
InodeArgs(crate::inode): full-fields counterpart tobtrfs_disk::items::InodeItemArgscarrying every on-disk inode field (flags,rdev,sequence, four distinct timestamps).InodeArgs::new(transid, mode)provides sensible defaults;with_uniform_time(ts)sets all four stamps for tests.Transaction::create_inode(tree, ino, args): insert anINODE_ITEMat(ino, INODE_ITEM, 0)from anInodeArgs.Transaction::link_dir_entry(tree, parent, child, name, file_type, dir_index, time): insert the three records that make a directory entry visible —INODE_REF,DIR_ITEM,DIR_INDEX— bump the parent dir'ssizeby2 * name.len()per the directory-isize convention, refreshtransid, and updatectime/mtime. Whenparentis the canonical subvolume root dir (BTRFS_FIRST_FREE_OBJECTID= 256), also mirror thesizeupdate into the matchingROOT_ITEM's embedded inode sobtrfs check's root-tree consistency check passes.Transaction::link_subvol_entry(tree, parent, subvol_id, name, dir_index, time): likelink_dir_entrybut for the parent's entry pointing at a subvolume root —DIR_ITEM+DIR_INDEXwith aROOT_ITEMlocation, noINODE_REF(subvol parent linkage usesROOT_REF/ROOT_BACKREFin the root tree instead). Pair withinsert_root_ref.Transaction::set_xattr(tree, ino, name, value): insert anXATTR_ITEMat(ino, XATTR_ITEM, name_hash(name))carrying(name, value). Same on-disk format asDIR_ITEMbut withFT_XATTRand a non-empty value.Transaction::set_inode_nlink(tree, ino, nlink): in-place patch of the inode'snlinkfield. Used by mkfs's hardlink fixup pass.try_compress_regular(data, algorithm, sectorsize): variant for the regular-extent write path. For LZO produces the per-sector framing format[4B total_len LE] { [4B seg_len LE] [lzo bytes] [zero pad] }*with sector-boundary padding and an early-exit heuristic that abandons after 4 sectors if the framed buffer is already past 3 sectors. For zlib and zstd delegates totry_compress.
Whole-tree creation and conversion
Transaction::create_empty_tree(tree_id): allocate one metadata block, initialise it as an empty level-0 leaf with the inherited fsid and chunk_tree_uuid, register the new(tree_id -> bytenr)mapping, and insert aROOT_ITEMinto the root tree. Foundation primitive for whole-tree creation (FS, csum, FST, BGT, UUID, quota, user subvolumes). Refuses bootstrap ids 0..=3.Transaction::insert_root_ref(parent_root, child_root, dirid, sequence, name): insert pairedROOT_REF(parent → child) andROOT_BACKREF(child → parent) records for a subvolume linkage.Transaction::set_root_readonly(tree_id): ORRootItemFlags::RDONLYinto aROOT_ITEM's flags. Used bymkfs --rootdir --subvol ro:.Transaction::set_default_subvol(subvol_id): upsert the"default"DIR_ITEMunderBTRFS_ROOT_TREE_DIR_OBJECTID(id 6). Used bymkfs --rootdir --subvol default:.Transaction::set_device_total_bytes(devid, new_total): patch a device'sDEV_ITEM.total_bytesin the chunk tree and mirror into the superblock's embeddeddev_item. Used bymkfs --rootdir --shrinkandrescue fix-device-size.convert::convert_to_free_space_tree/convert::convert_to_block_group_tree: top-level conversions invoked bybtrfs-tune. Per-step helpersconvert::seed_free_space_treeandconvert::create_block_group_treeare also exposed and used by mkfs'spost_bootstrap.
Free space tree
- Incremental update:
update_free_space_treeconsumes the per-block-group range deltas accumulated during the delayed-ref flush and applies them to the on-disk FST, deleting and re-insertingFREE_SPACE_EXTENTitems and updating theFREE_SPACE_INFO.extent_count. Bitmap-layout block groups are refused with a clear error. FREE_SPACE_TREE_VALIDstays set: the convergence loop inTransaction::commitrunsflush_delayed_refs → update_root_items → snapshot_roots → update_free_space_treein that order so the FST root change is captured by the next pass and the on-disk FST stays consistent with the extent tree.
Transaction lifecycle
Transaction::start: bumps the in-memory generation, snapshots the current root pointers, and seeds the metadata allocator cursor.Transaction::commit: force-COWs the root tree (so every commit advancesheader.generation), runs the convergence loop to drain delayed refs, root item updates, and FST updates until stable, flushes every dirty block to disk via the chunk tree (writing all DUP/RAID1 mirrors), updates superblock fields and the rotating backup roots, and writes the superblock to all mirrors.Transaction::abort: restores the in-memory root pointer snapshot so the next transaction reads consistent on-disk state.
What's not yet implemented
- Striped writes larger than
stripe_lenfor RAID0 / RAID5 / RAID6 data extents:BlockReader::write_blockroutes per-stripe via the chunk cache'splan_writeand handles every RAID profile correctly (parity-aware for RAID5 / RAID6, mirror pairs for RAID10), but a single buffer that spans more than one row would need per-stripe slicing in the executor. Tree blocks (always nodesize ≤ stripe_len) fit in one row; data extents larger thanstripe_lendo not. Defer until there's a concrete consumer. - New SYSTEM chunk allocation: if no existing SYSTEM block
group has free space,
ensure_in_sys_chunk_arraycannot carve out a new one. Bails cleanly. - Bitmap-layout free space tree: refused with a clear error.
Filesystem::create: a from-scratch primitive to replace mkfs's hand-built bootstrap. Would let mkfs delete its remainingLeafBuilder/TreeBuilder/BlockLayout/build_*machinery.
Testing
Unit tests cover the leaf manipulation primitives, balance logic,
delayed ref merging, search/next_leaf, and the chunk array
bootstrap helpers (in btrfs-disk).
Integration tests build real filesystem images via mkfs.btrfs,
exercise the full COW pipeline through Transaction::commit, and
verify the result both by reopening the image with Filesystem
and by running btrfs check --readonly on it. The fixture image
under cli/tests/commands/fixture.img.gz is also used for
read-path and data-ref drop tests.
A regression suite hits historical bugs (cow pinning, sibling
generation, leaf data compaction, cascading splits, ...). A
proptest-based playground tree harness exercises insert/delete
sequences against an in-memory model and verifies that
Transaction::commit produces a structurally valid filesystem
on every step.
License
Licensed under either of Apache License, Version 2.0 or MIT license at your option.