btrfs-transaction 0.13.0

Userspace transaction infrastructure for modifying btrfs filesystems
Documentation
# btrfs-transaction

Atomic, COW-based modification of btrfs filesystem images.

> This is a pre-1.0, experimental crate. It is a clean-room
> reimplementation of btrfs's read-write tree machinery and may
> have edge cases that testing doesn't cover. Do not use it on
> filesystems you care about without taking a backup first.

This crate sits on top of `btrfs-disk` and provides the write-side
machinery that the parsing crate deliberately doesn't: copy-on-write
of tree blocks, leaf insert/delete/split/balance, a delayed reference
queue, free space tree maintenance, and a `Transaction` type that
groups all of those into an atomic commit. It works on raw image
files or block devices and does not call any ioctls — it only needs
read/write/seek access to the filesystem bytes.

Part of the [btrfsutils](https://github.com/rustutils/btrfsutils) project.

## What's implemented

### Tree-block storage and COW

- **`Filesystem`**: open an image, resolve tree roots, read tree
  blocks via the chunk tree, write blocks back to disk on all
  mirrors with correct CRC32C checksums.
- **`ExtentBuffer`**: in-memory tree block with leaf/node helpers
  for items, key/data offsets, and slot manipulation. Tracks dirty
  state and pending writes.
- **`cow_block`**: copy-on-write a tree block, allocating a new
  bytenr, clearing the WRITTEN/RELOC flags, and queuing add/drop
  delayed refs against the old and new addresses.

### B-tree operations

- **`search_slot`**: search for a key in any tree, with optional
  COW of the path. Supports `ReadOnly`, `Insert(size)`, and
  `Delete` intents; the latter two pre-COW and pre-balance the
  path so the caller can mutate the leaf in place.
- **`next_leaf`**: cursor advance across leaf boundaries.
- **`items::insert_item` / `del_items` / `update_item` / `shrink_item`**:
  in-place leaf payload manipulation that maintains the descending
  data offset invariant.
- **Split / balance**: leaf and node split when an insert won't fit;
  push-left/push-right/merge to keep nodes within fill thresholds
  on delete.
- **`BtrfsPath`**: per-level slot+buffer trace from the root to a
  target leaf, owned and released explicitly so the caller controls
  borrow scope.

### Allocation

- **Block group scan**: load every `BLOCK_GROUP_ITEM` from the
  block group tree (id 11) when present, otherwise from the
  extent tree (id 2).
- **Per-kind bump allocator**: `Transaction::alloc_block(kind)`
  with separate cursors for `BlockGroupKind::Metadata`,
  `BlockGroupKind::System`, and `BlockGroupKind::Data`. Cursors
  are seeded by scanning the extent tree for free gaps inside
  block groups of the requested kind. Pinned blocks (freed
  earlier in the same transaction) are skipped to preserve
  crash consistency.
- **`alloc_tree_block(tree_id, level)`**: routes the chunk tree
  (id 3) to a SYSTEM block group and every other tree to a
  metadata block group, registers a delayed `add_ref`, and for
  SYSTEM allocations updates the superblock's `sys_chunk_array`
  bootstrap snippet so the next mount can resolve the new chunk
  root.

### Delayed reference queue

- **`DelayedRefQueue`**: batches reference count changes keyed by
  `DelayedRefKey::Metadata { bytenr, owner_root, level }` or
  `DelayedRefKey::Data { bytenr, owner_root, owner_ino, owner_offset }`.
  Add/drop pairs to the same key cancel.
- **Metadata flush**: creates `METADATA_ITEM` (skinny) or
  `EXTENT_ITEM` records with `TREE_BLOCK_REF` inline backrefs on
  positive deltas, deletes them on negative deltas, and threads
  byte deltas into per-block-group `bytes_used` accounting.
- **Data flush** (drop side): locates the matching
  `EXTENT_DATA_REF` either inline inside the parent `EXTENT_ITEM`
  or as a standalone item (with hash-collision walk forward),
  decrements both the inline `count` field and the parent
  `EXTENT_ITEM.refs`, and on hitting zero deletes the entire
  `EXTENT_ITEM` and trims overlapping csum tree entries.
- **Csum tree maintenance**: when a data extent is fully freed,
  walks the csum tree once to collect every overlapping
  `EXTENT_CSUM` item, then in a second pass deletes whole-coverage
  items and re-inserts trimmed head/tail fragments under new keys.
- **Data flush** (add side): on a positive data delta, inserts a
  `EXTENT_ITEM` (24-byte header + inline 29-byte `EXTENT_DATA_REF`
  = 53 bytes) and updates the per-block-group `bytes_used`
  delta plus the FST's allocated-range record.

### Data extent write path

- **`Transaction::alloc_data_extent(data, root, ino, file_offset)`**:
  finds free space in a `BlockGroupKind::Data` block group with
  sectorsize alignment, zero-pads `data` to sectorsize, writes it
  via `BlockReader::write_block` (fanning out to every stripe
  copy), queues a `+1` `EXTENT_DATA_REF` delayed ref, and returns
  the allocated logical address.
- **`Transaction::insert_file_extent(tree, ino, offset, payload)`**:
  inserts an `EXTENT_DATA` item at `(ino, EXTENT_DATA, offset)`.
  `payload` is a serialized `FileExtentItem` (use
  `to_bytes_regular` for non-inline extents or `to_bytes_inline`
  for inline-tail extents).
- **`Transaction::insert_csums(logical, on_disk_data)`**: computes
  per-sector standard CRC32C of the on-disk bytes and inserts
  `EXTENT_CSUM` items into the csum tree, splitting payloads that
  would exceed leaf capacity into multiple keyed items. Errors on
  non-CRC32C filesystems.
- **`Transaction::update_inode_nbytes(tree, ino, delta)`**: signed
  in-place patch of the inode's `nbytes` field at the fixed struct
  offset. Preserves all other fields (including `flags`, `rdev`,
  `sequence`) that round-tripping via `InodeItemArgs` would lose.
- **`Transaction::insert_inline_extent(tree, ino, file_offset,
  data)`**: embeds `data` directly in the FS tree leaf as an inline
  `EXTENT_DATA` item. No extent-tree entry, no csum entries.
  `INODE.nbytes` is bumped by the unaligned payload length.
- **`Transaction::write_file_data(tree, ino, file_offset, data,
  nodatasum, compression)`**: high-level helper that picks inline
  vs regular based on `data.len()` and the per-filesystem inline
  threshold (`max_inline_data_size`). For regular extents, splits
  `data` into ≤1 MiB chunks, optionally compresses each chunk
  (zlib, zstd, or LZO; per-chunk fallback to raw when compression
  doesn't shrink), allocates each, inserts the `EXTENT_DATA`
  item, computes csums (unless `nodatasum`), and bumps the
  inode's `nbytes` by the logical sector-aligned size.
- **`try_compress(data, algorithm)`**: free function that returns
  the inline-framed compressed bytes only when they shrink. For
  LZO this produces the inline framing format
  `[4B total_len LE] [4B seg_len LE] [lzo bytes]`; for zlib and
  zstd the raw compressor output is returned.

### Inode and directory entry helpers

- **`InodeArgs`** (`crate::inode`): full-fields counterpart to
  `btrfs_disk::items::InodeItemArgs` carrying every on-disk inode
  field (`flags`, `rdev`, `sequence`, four distinct timestamps).
  `InodeArgs::new(transid, mode)` provides sensible defaults;
  `with_uniform_time(ts)` sets all four stamps for tests.
- **`Transaction::create_inode(tree, ino, args)`**: insert an
  `INODE_ITEM` at `(ino, INODE_ITEM, 0)` from an `InodeArgs`.
- **`Transaction::link_dir_entry(tree, parent, child, name,
  file_type, dir_index, time)`**: insert the three records that
  make a directory entry visible — `INODE_REF`, `DIR_ITEM`,
  `DIR_INDEX` — bump the parent dir's `size` by `2 * name.len()`
  per the directory-isize convention, refresh `transid`, and
  update `ctime`/`mtime`. When `parent` is the canonical
  subvolume root dir (`BTRFS_FIRST_FREE_OBJECTID` = 256), also
  mirror the `size` update into the matching `ROOT_ITEM`'s
  embedded inode so `btrfs check`'s root-tree consistency check
  passes.
- **`Transaction::link_subvol_entry(tree, parent, subvol_id,
  name, dir_index, time)`**: like `link_dir_entry` but for the
  parent's entry pointing at a subvolume root — `DIR_ITEM` +
  `DIR_INDEX` with a `ROOT_ITEM` location, no `INODE_REF`
  (subvol parent linkage uses `ROOT_REF` / `ROOT_BACKREF` in the
  root tree instead). Pair with `insert_root_ref`.
- **`Transaction::set_xattr(tree, ino, name, value)`**: insert an
  `XATTR_ITEM` at `(ino, XATTR_ITEM, name_hash(name))` carrying
  `(name, value)`. Same on-disk format as `DIR_ITEM` but with
  `FT_XATTR` and a non-empty value.
- **`Transaction::set_inode_nlink(tree, ino, nlink)`**: in-place
  patch of the inode's `nlink` field. Used by mkfs's hardlink
  fixup pass.
- **`try_compress_regular(data, algorithm, sectorsize)`**: variant
  for the regular-extent write path. For LZO produces the
  per-sector framing format
  `[4B total_len LE] { [4B seg_len LE] [lzo bytes] [zero pad] }*`
  with sector-boundary padding and an early-exit heuristic that
  abandons after 4 sectors if the framed buffer is already past
  3 sectors. For zlib and zstd delegates to `try_compress`.

### Whole-tree creation and conversion

- **`Transaction::create_empty_tree(tree_id)`**: allocate one
  metadata block, initialise it as an empty level-0 leaf with the
  inherited fsid and chunk_tree_uuid, register the new
  `(tree_id -> bytenr)` mapping, and insert a `ROOT_ITEM` into
  the root tree. Foundation primitive for whole-tree creation
  (FS, csum, FST, BGT, UUID, quota, user subvolumes). Refuses
  bootstrap ids 0..=3.
- **`Transaction::insert_root_ref(parent_root, child_root,
  dirid, sequence, name)`**: insert paired `ROOT_REF` (parent →
  child) and `ROOT_BACKREF` (child → parent) records for a
  subvolume linkage.
- **`Transaction::set_root_readonly(tree_id)`**: OR
  `RootItemFlags::RDONLY` into a `ROOT_ITEM`'s flags. Used by
  `mkfs --rootdir --subvol ro:`.
- **`Transaction::set_default_subvol(subvol_id)`**: upsert the
  `"default"` `DIR_ITEM` under `BTRFS_ROOT_TREE_DIR_OBJECTID`
  (id 6). Used by `mkfs --rootdir --subvol default:`.
- **`Transaction::set_device_total_bytes(devid, new_total)`**:
  patch a device's `DEV_ITEM.total_bytes` in the chunk tree and
  mirror into the superblock's embedded `dev_item`. Used by
  `mkfs --rootdir --shrink` and `rescue fix-device-size`.
- **`convert::convert_to_free_space_tree`** /
  **`convert::convert_to_block_group_tree`**: top-level
  conversions invoked by `btrfs-tune`. Per-step helpers
  `convert::seed_free_space_tree` and
  `convert::create_block_group_tree` are also exposed and used
  by mkfs's `post_bootstrap`.

### Free space tree

- **Incremental update**: `update_free_space_tree` consumes the
  per-block-group range deltas accumulated during the delayed-ref
  flush and applies them to the on-disk FST, deleting and
  re-inserting `FREE_SPACE_EXTENT` items and updating the
  `FREE_SPACE_INFO.extent_count`. Bitmap-layout block groups are
  refused with a clear error.
- **`FREE_SPACE_TREE_VALID` stays set**: the convergence loop in
  `Transaction::commit` runs `flush_delayed_refs →
  update_root_items → snapshot_roots → update_free_space_tree`
  in that order so the FST root change is captured by the next
  pass and the on-disk FST stays consistent with the extent tree.

### Transaction lifecycle

- **`Transaction::start`**: bumps the in-memory generation,
  snapshots the current root pointers, and seeds the metadata
  allocator cursor.
- **`Transaction::commit`**: force-COWs the root tree (so every
  commit advances `header.generation`), runs the convergence loop
  to drain delayed refs, root item updates, and FST updates until
  stable, flushes every dirty block to disk via the chunk tree
  (writing all DUP/RAID1 mirrors), updates superblock fields and
  the rotating backup roots, and writes the superblock to all
  mirrors.
- **`Transaction::abort`**: restores the in-memory root pointer
  snapshot so the next transaction reads consistent on-disk state.

## What's not yet implemented

- **Striped writes larger than `stripe_len` for RAID0 / RAID5 /
  RAID6 data extents**: `BlockReader::write_block` routes
  per-stripe via the chunk cache's `plan_write` and handles every
  RAID profile correctly (parity-aware for RAID5 / RAID6, mirror
  pairs for RAID10), but a single buffer that spans more than
  one row would need per-stripe slicing in the executor. Tree
  blocks (always nodesize ≤ stripe_len) fit in one row; data
  extents larger than `stripe_len` do not. Defer until there's a
  concrete consumer.
- **New SYSTEM chunk allocation**: if no existing SYSTEM block
  group has free space, `ensure_in_sys_chunk_array` cannot carve
  out a new one. Bails cleanly.
- **Bitmap-layout free space tree**: refused with a clear error.
- **`Filesystem::create`**: a from-scratch primitive to replace
  mkfs's hand-built bootstrap. Would let mkfs delete its
  remaining `LeafBuilder` / `TreeBuilder` / `BlockLayout` /
  `build_*` machinery.

## Testing

Unit tests cover the leaf manipulation primitives, balance logic,
delayed ref merging, search/next_leaf, and the chunk array
bootstrap helpers (in `btrfs-disk`).

Integration tests build real filesystem images via `mkfs.btrfs`,
exercise the full COW pipeline through `Transaction::commit`, and
verify the result both by reopening the image with `Filesystem`
and by running `btrfs check --readonly` on it. The fixture image
under `cli/tests/commands/fixture.img.gz` is also used for
read-path and data-ref drop tests.

A regression suite hits historical bugs (cow pinning, sibling
generation, leaf data compaction, cascading splits, ...). A
proptest-based playground tree harness exercises insert/delete
sequences against an in-memory model and verifies that
`Transaction::commit` produces a structurally valid filesystem
on every step.

## License

Licensed under either of [Apache License, Version 2.0](LICENSE-APACHE) or
[MIT license](LICENSE-MIT) at your option.