tarzan 🌿
Tar Archive with Random-access Zstd And iNdex
tarzan is a command-line tool for creating and extracting .tar.zst archives that
are fully seekable and self-indexed. It divides the archive into independently
compressed chunks — with chunk boundaries and size tunable to balance compression ratio
against random-access granularity — and embeds a table of contents (TOC) directly
inside the compressed stream as a zstd skippable frame. The underlying tar data is
preserved bit-for-bit; the archive can be decompressed by standard zstd tools, though
doing so discards the indexing and seekability that tarzan provides.
# Wrap any existing tar stream — drop-in for gzip or zstd
|
# List contents instantly — no decompression, reads TOC only
# Extract a single file — decompresses only the relevant chunks
The CLI follows tar's flag conventions where they overlap: -f/--file
names the archive, -v is verbose, -C selects a directory. Subcommands
have tar-style short aliases (tarzan t for list). See What we don't
copy from tar for the bits we leave behind.
Why tarzan?
Standard .tar.gz and .tar.zst archives are sequential. To find a file near the
end, you decompress everything before it. For large archives this is slow, wasteful,
and makes random access effectively impossible without external tooling.
tarzan solves this with three ideas:
1. Tunable chunk compression. The archive is divided into independently compressed zstd frames at configurable chunk boundaries. Chunk size is a tuneable tradeoff: smaller chunks mean finer-grained random access but lower compression ratio (less cross-chunk redundancy); larger chunks compress better but require decompressing more data to reach a given file. The default of 4MB is a reasonable starting point; the right value depends on your workload and access patterns, and benchmarking your specific archive contents is recommended.
2. Embedded TOC. A table of contents — containing filenames, permissions,
ownership, sizes, and per-chunk byte offsets — is stored in a zstd skippable frame
appended to the archive. Any compliant zstd decoder silently ignores skippable frames,
so the archive is fully readable by zstd -d | tar x with no special support.
3. Leading identity frame. The first bytes of every tarzan archive are a small
zstd skippable frame containing the ASCII identifier TRZN followed by a format
version byte. This allows file(1) and other format sniffers to identify tarzan
archives unambiguously, distinct from plain .tar.zst or other zstd-based formats.
Standard zstd tools skip this frame silently.
The result is an archive where:
- The original tar data is stored bit-for-bit intact inside the compressed stream
- Standard tools (
zstd -d | tar x,tar --zstd -xf) can decompress it fully, but do so as a sequential scan, losing the indexing and random-access benefits - Tools that understand the tarzan format can list contents without decompression and extract individual files by seeking directly to their chunks
Installation
tarzan is a single crate that provides both the tarzan command-line binary
and the embeddable library (see Library usage).
From crates.io
From source
# binary at ./target/release/tarzan
Pre-built binaries
Pre-built binaries for Linux (x86_64, aarch64), macOS (x86_64, Apple Silicon), and Windows (x86_64) are available on the releases page.
Windows builds are provided but untested, and have two known limitations:
extracting an archive that contains symlink members fails on those entries, and
Unix permission bits are not restored. (list -v also shows timestamps in UTC
rather than local time on Windows.) Linux and macOS are the tested platforms.
Usage
tarzan wrap — compress an existing tar stream
The primary entry point for pipeline use. Reads a raw tar stream from stdin (or a
file) and writes a tarzan-formatted .tar.zst to stdout (or -f).
The input tar is a positional argument; the output archive is -f/--file,
mirroring tar -cf out.tar. Use - (or omit) for stdin/stdout.
# From stdin to stdout
|
# From a file to a file
# With explicit output path
|
# Control chunk size (default: 4MB)
|
# Set zstd compression level (default: 3)
|
# git archive integration
|
# Remote backup
|
# Verbose: list each member to stderr as it is wrapped
|
For safety, wrap refuses to write the binary archive directly to a
terminal: if -f is omitted and stdout is a TTY, it errors out. Pipe
the output, redirect to a file, or pass -f.
Creating archives from files
tarzan does not implement its own filesystem walker. Use the system
tar to produce the tar stream, and pipe it into tarzan wrap:
# A whole directory
|
# Multiple paths
|
# Change source directory, like `tar -C`
|
# Exclude patterns (tar's own --exclude)
|
# git archive integration
|
# Remote backup
|
This composition is deliberate: real tar handles hard links, sparse
files, xattrs, ACLs, long path/link names (PAX/GNU extensions), and
device files correctly. Re-implementing that surface inside tarzan would
either replicate tar poorly or shell out to it anyway, so we lean on
the canonical tar | tarzan wrap pipeline instead.
tarzan list — list contents
Reads only the TOC skippable frame. Fast regardless of archive size.
Aliased as tarzan t (tar style) and tarzan ls.
# Paths only, one per line
# tar-style short alias
# Long format: mode, owner/group, size, mtime, path — like `tar -tvf`.
# Symlink and hard-link entries show their target as `path -> target`.
# Show -v timestamps in UTC instead of local time, like `tar --utc -tvf`
# Filter by directory prefix, exact path, or shell glob (positional args)
# Machine-readable JSON (respects positional filters)
Long-format output:
drwxr-xr-x 1000/1000 0 B 2024-11-03 14:20 ./
-rw-r--r-- 1000/1000 4.2 KB 2024-11-03 14:22 src/main.rs
-rw-r--r-- 1000/1000 12.1 KB 2024-11-03 14:22 src/lib.rs
lrwxrwxrwx 1000/1000 0 B 2024-11-03 14:22 src/current -> main.rs
-rw-r--r-- 1000/1000 1.1 KB 2024-11-03 14:20 Cargo.toml
Owner is shown numerically (uid/gid) rather than as resolved names —
the TOC stores numbers, and resolving them against the reader's
/etc/passwd would be misleading.
Timestamps are shown in local time, like tar -tvf; pass --utc for
UTC. The stored mtime is a timezone-independent Unix timestamp, so only
the display differs.
--json emits the TOC as a pretty-printed JSON array. Each entry
carries path, type, size, mode, uid, gid, mtime, optional link target,
and chunk offsets:
Each entry in chunks locates one member's bytes inside a compressed
frame. A member larger than the chunk size spans several chunks; small
members are packed together to share a frame, and frame_offset
(omitted when zero) then gives the member's offset within that frame's
decompressed data.
Pipe through jq to slice out fields you don't want (for example
jq 'map(del(.chunks))').
tarzan extract — extract files
Aliased as tarzan x (tar style). Refuses to write members whose path is
absolute or contains .., so extraction always stays inside the
destination directory.
# Extract everything to the current directory
# Extract to a specific directory
# Extract specific files (decompresses only relevant chunks)
# Extract a directory subtree
# Drop leading path components, like `tar --strip-components`
# Skip members by shell-glob pattern (repeatable)
# Print each member as it is extracted
# Do not restore recorded mtimes (extracted files get the current time)
Restored on extract: file contents, directory hierarchy, Unix permission
bits, symlinks (Unix only), hard links, and mtime on files, symlinks, and
directories. Directory mtimes are applied in a deferred pass after all
children are written, so creating a child doesn't bump the parent's
timestamp back; hard links are likewise reconstructed in a second pass
once their target file is on disk. If a hard link's target member is not
part of the extraction — for example a path filter selects the link but
not its target — the link is skipped with a warning. --no-mtime skips
timestamp restoration entirely. Character/block devices and FIFOs are
still skipped with a warning.
For workflows that need full fidelity — device files, FIFOs, xattrs/ACLs, sparse files — fall back to standard tooling. Every tarzan archive is a valid zstd stream:
|
# or
You give up tarzan's random-access seeking but get real tar's full
coverage of the long tail. The trade is: tarzan extract is the fast
path for the common case; tar --zstd -xf is the complete path.
tarzan cat — stream a single file to stdout
Seeks directly to the file using the TOC; decompresses only its chunks.
# Pipe into another tool
|
Only regular-file entries work — hard-link entries reference another member rather than holding their own bytes, and will error. For full-fidelity single-file extraction via standard tools:
That path scans sequentially rather than seeking, but resolves hard links the way real tar does.
tarzan info — show archive metadata
Reads only the TOC frame, so it runs in constant time regardless of archive size.
# Machine-readable JSON object
Format: tarzan v1
File: archive.tar.zst
Size: 487.2 MB
Uncompressed: 2.3 GB
Ratio: 21.1% (archive / uncompressed)
Data frames: 486.4 MB (sum of compressed frames)
Members: 1847
Chunks: 4203
Avg chunk size: 574.5 KB (uncompressed)
Identity frame: TRZN v1
TOC frame: 312.0 KB at offset 487204816
With --json, the same data is emitted as an object (ratio and
avg_chunk_size_bytes are null for an empty archive):
Some fields the legacy README example referenced are intentionally
omitted: the archive does not record a creation timestamp, and the
chunk-size argument is a wrap-time tunable rather than archive metadata
(use Avg chunk size as an observed proxy).
tarzan verify — verify chunk checksums
Silent on success by default; exits non-zero on mismatch. Pass -v
to also print an OK line per verified member.
# Verify all chunk SHA-256s
# Verify a specific file
# Show per-member OK lines
File format and Rust API
The file format specification (frame layout, magic numbers, TOC schema, zstd compatibility) and the Rust library API are documented in the crate module documentation on docs.rs.
Identifying tarzan archives
The identity frame occupies the first 14 bytes of every tarzan archive.
xxd -l 14 reveals it without any special tooling:
# 00000000: 542a 4d18 0600 0000 5452 5a4e 0101 T*M.....TRZN..
# └── 0x184D2A54 ──┘ └TRZN┘
# zstd skippable magic tarzan identifier at offset 8
A file(1) magic pattern is also distributed at
contrib/tarzan.magic. Use the MAGIC= environment
variable rather than -m — on macOS, -m augments the compiled system magic
database, which then wins on strength over the tarzan pattern:
MAGIC=contrib/tarzan.magic
# archive.tar.zst: tarzan archive v1
What we don't copy from tar
tarzan borrows tar's flag conventions where they overlap, but deliberately skips a few of its older ergonomics:
- Bundled short flags (
-xvf). tar lets you mash mode and option letters together as a single argument; modern argument parsers don't, and the form is widely considered tar's most arcane bit. tarzan accepts-x -v -fstyle spacing only. - Mode-flag entry point (
tar -cf). tar selects its operation with a flag letter on the root command. tarzan uses subcommands (tarzan wrap,tarzan list, ...) for better discoverability and shell tab-completion; tar-style short aliases (tarzan t) cover the muscle-memory case. - A separate
createverb / filesystem walker.wrapreads an existing tar stream and adds the tarzan envelope; the canonical archive-creation workflow istar -cf - ... | tarzan wrap -f out.tar.zst. We do not re-implementtar -courselves — real tar already handles hard links, sparse files, xattrs, long path names, and device files correctly, and a partial in-tree walker would silently mishandle those long-tail cases. See Creating archives from files. - Compression-format flags (
-z,-j,-J,--zstd). A tarzan archive is always zstd, so a compression selector would only ever take one value. - Mandatory archive flag with no positional fallback. GNU tar accepts
tar tf archive.taronly because of bundling; without bundling, an archive always needs-f. tarzan uses-f/--fileuniformly, but with subcommands the form stays consistent rather than depending on whether you remembered to merge letters.
Comparison
| tar.gz | tar.zst | tarzan | zip | |
|---|---|---|---|---|
| List without full decompress | ✗ | ✗ | ✓ | ✓ |
| Extract one file efficiently | ✗ | ✗ | ✓ | ✓ |
| Streamable creation | ✓ | ✓ | ✓ | ✗ |
| Standard tool compatible | ✓ | ✓ | ✓ | ✓ |
| Compression ratio | good | better | good†| ok |
| Decompression speed | slow | fast | fast | ok |
| Self-describing format | ✗ | ✗ | ✓ | ✓ |
| Per-file integrity checksums | ✗ | ✗ | ✓ | optional |
†Slightly lower than monolithic .tar.zst due to per-frame independent compression,
which loses redundancy across frame boundaries. Small members are packed together so
redundancy is still captured within a frame; for most archives the difference is under 5%.
Library usage
The tarzan crate exposes a library API for embedding tarzan support in other
tools. Add it to your Cargo.toml:
[]
= "0.1"
Full API documentation — including format details and usage examples — is on docs.rs/tarzan.
Relationship to zstd:chunked
tarzan is inspired by the zstd:chunked format used by the container ecosystem
(Podman, CRI-O, Fedora container images). That format solves the same core problem —
seekable, indexed, compressed tar archives — but is designed around OCI container image
layers and is not officially documented outside its reference implementation in
containers/storage.
tarzan takes the same architectural approach — independent chunk compression, JSON TOC in a skippable frame, full backward compatibility — and applies it to general-purpose archiving with a clean, documented, versioned format specification.
tarzan archives are not wire-compatible with zstd:chunked, but the ideas are directly borrowed from that project. Credit to Giuseppe Scrivano and the containers/storage contributors.
Design decisions
TOC sidecar mode (considered, deferred)
A natural extension of the embedded TOC is to also serialize it as a standalone file
(e.g. archive.tar.toc) that accompanies a plain .tar — enabling random access
without the zstd wrapper, including for tape workflows. This is intentionally
deferred from v1.
Why deferred:
- Drift. Sidecar files get separated from their data through copy, move, or transfer. A stale sidecar fails silently unless every read verifies a whole-tar hash, which is an O(n) scan that partly defeats the point of having an index.
- Schema bifurcation. Per-member offsets mean different things in embedded mode (compressed chunk offsets) vs. sidecar mode (uncompressed tar byte offsets). The format would have to express "this field is valid only in mode X" rules and ship two parsing paths.
- Crowded prior art. ratarmount already ships a SQLite-based tar index. Users who want random access to plain tar have a deployed solution; introducing a competing format needs a stronger motivation than "we could."
- Pitch dilution. tarzan's value proposition is "drop-in seekable
.tar.zst, standard tools still work." A sidecar mode reframes tarzan as a generic tar index format and pulls it into a different and more crowded design space. - Tape is not really solved by a TOC file alone. Useful tape random access needs blocking-factor and (for multi-volume) volume-boundary metadata, not just member offsets. Claiming tape support without that would be misleading.
Forward-compatibility reservations. The v1 TOC schema is nevertheless designed so a sidecar variant remains feasible later without breaking v1 readers:
- Every member entry carries
tar_offset(uncompressed byte offset of the member header in the tar stream). This is independently useful for verification and is the field any future sidecar would need. - A top-level
targetfield (default"embedded") is reserved. Readers must reject unknown values, so adding"sidecar"later is not a breaking change. - Top-level
tar_sha256andtar_sizeare reserved as optional fields, to be populated by future sidecars so readers can detect drift loudly rather than silently using stale offsets.
No file extension or on-disk sidecar layout is specified at this time — once documented, it has to be supported.
Why not GNU tar's --index-file
tar --index-file=FILE is sometimes proposed as the natural sidecar format, but it
is the wrong reference point. It redirects the -v listing to a file — bare paths
at -v, ls -l-style lines at -vv:
drwxr-xr-x andrew/wheel 0 2026-05-18 16:29 ./
-rw-r--r-- andrew/wheel 10 2026-05-18 16:29 ./b.txt
-rw-r--r-- andrew/wheel 6 2026-05-18 16:29 ./sub/c.txt
There are no byte offsets, no checksums, no schema, no versioning, and no extension hook. The file tells you what is in the archive, not where, so it cannot serve as a seek index. Reusing the format would either ship a sidecar that does not actually enable seeking, or extend it past the point of any compatibility with GNU tar. ratarmount's SQLite index is the closest existing format that actually solves the random-access problem and is the better reference if a sidecar mode is ever revisited.
Releasing
Releases are managed by release-plz and cargo-dist.
How it fits together
- release-plz opens a "Release PR" on every push to
main, bumpsCargo.toml, regeneratesCHANGELOG.md, publishes to crates.io, and pushes a semver git tag. - cargo-dist watches for semver tag pushes and builds the platform binaries, then creates the GitHub Release with them attached.
The critical detail: GitHub Actions will not trigger a workflow run from
events (including tag pushes) that are caused by the built-in GITHUB_TOKEN.
release-plz must therefore use a Personal Access Token (PAT) to push the tag so
that GitHub treats it as a real user event and wakes up cargo-dist.
Required secrets
| Secret | Purpose |
|---|---|
RELEASE_PLZ_TOKEN |
PAT with contents: write and pull-requests: write — used by release-plz so its tag push triggers cargo-dist |
CARGO_REGISTRY_TOKEN |
crates.io API token for publishing |
Normal release flow
Step 1 — merge conventional commits to main.
Every push to main triggers the release-plz workflow, which opens (or
updates) a Release PR.
Step 2 — merge the Release PR.
release-plz publishes to crates.io and
pushes a semver git tag (e.g. v0.2.0) authenticated with RELEASE_PLZ_TOKEN.
Step 3 — binaries build automatically. The tag push triggers the cargo-dist Release workflow, which cross-compiles and uploads pre-built archives for:
| Target | Archive |
|---|---|
| Linux x86_64 | tarzan-x86_64-unknown-linux-gnu.tar.gz |
| Linux aarch64 | tarzan-aarch64-unknown-linux-gnu.tar.gz |
| macOS x86_64 | tarzan-x86_64-apple-darwin.tar.gz |
| macOS Apple Silicon | tarzan-aarch64-apple-darwin.tar.gz |
| Windows x86_64 | tarzan-x86_64-pc-windows-msvc.zip |
All archives include the binary, README.md, LICENSE-MIT, LICENSE-APACHE,
and THIRD-PARTY-LICENSES. The completed release appears on the
releases page.
Recovering a release that reached crates.io but has no GitHub Release
This happens when release-plz pushed the tag using GITHUB_TOKEN (before the
PAT was configured) — cargo-dist never saw the event. The tag already exists on
the remote, so a plain push is rejected. Delete and re-push it to re-trigger:
Replace v0.1.1 with the actual tag name (git ls-remote --tags origin lists
what is there).
Contributing
Contributions are welcome. Please read CONTRIBUTING.md before opening a pull request.
Areas of particular interest:
- Windows support (currently untested)
- Ratarmount backend using the embedded TOC
- Benchmarks against pixz, zip, and plain tar.zst on realistic workloads
- Submission of the magic pattern to the upstream
filedatabase
License
Licensed under either of
- Apache License, Version 2.0 (LICENSE-APACHE)
- MIT License (LICENSE-MIT)
at your option.
tarzan binaries statically include the zstd C library. The zstd C library is under a dual BSD/GPLv2 license. Full license texts for zstd and every other dependency compiled into tarzan are in THIRD-PARTY-LICENSES, which is bundled in every release archive.