kache 0.6.0

Zero-copy, content-addressed Rust build cache. No copies, no wasted disk — just hardlinks locally and S3 for sharing.
---
title: Deduplication
description: How content-addressed storage and zero-copy restores eliminate redundant disk usage.
---

# Deduplication

Rust's build system compiles the same crate multiple times in a workspace — once per unique combination of features, targets, and profiles. Without a cache, each compilation writes a fresh copy of the output files. kache avoids this through two mechanisms: content-addressed storage and zero-copy restores (reflink where the filesystem supports it, hardlink or copy otherwise).

## Content-addressed blobs

Every compiled artifact stored in kache's local store is named after its blake3 hash. If two different crate compilations produce identical output — which happens frequently for crates that appear in multiple feature sets — they share a single blob file. Only one physical copy exists on disk, no matter how many cache entries reference it.

The store directory layout looks like this:

```
~/.cache/kache/
└── store/
    └── blobs/
        ├── ab/
        │   └── abcdef1234...   ← a blob, named by hash
        └── cd/
            └── cdabef5678...
```

The cache directory is `~/.cache/kache` on Linux, `~/Library/Caches/kache` on macOS, and `%LOCALAPPDATA%\kache` on Windows; override it with `KACHE_CACHE_DIR`. Blobs live two levels deep, under `store/blobs/<first-2-hex>/<hash>`; cache-entry directories (`<cache_key>/meta.json`) sit directly under `store/`.

The SQLite index (`index.db`, a sibling of `store/`, not inside it) tracks which blobs belong to which cache entries. When an entry is evicted by GC, the blob is removed only if no other entry references it.

## Zero-copy restores

When kache restores a cache hit, it doesn't copy blob files into `target/` if it can avoid it. On every platform it first tries a **reflink** — a copy-on-write clone that is zero-copy *and* gives the restored file an independent inode — where the filesystem supports it (APFS on macOS, btrfs, XFS with reflink, the common dev case). Mutations to a reflinked file never propagate back to the cache blob.

On filesystems without copy-on-write (e.g. ext4 without `reflink`, tmpfs, NTFS), kache falls back per artifact type:

- **Hardlink** for immutable artifacts (`.rlib` / `.rmeta` / `.d`): an additional directory entry pointing to the same inode, so the data exists once on disk. A hardlink that fails — for example across mount points — falls back to a plain copy.
- **Copy** for mutable, OS-loaded outputs (`bin`, `dylib`, `cdylib`, `proc-macro`): the build may modify the file after linking (codesigning, stripping), so kache never hardlinks these. Where reflink is available they are still reflinked (then made executable); only the non-CoW fallback is a byte copy.

The single-physical-copy / shared-inode property only holds for the hardlink fallback. Reflinks already store the data once via copy-on-write but each restore has its own inode.

## The dedup line in the monitor

The monitor header carries a single dedup line, for example:

```
Dedup: 1.2 GiB saved (38.4%)    Blobs: 1.9 GiB physical    Hardlinks: 240 MiB via 312 hardlinks    Scan: idle
```

The pieces map to two independent measurements:

- **`Dedup: <bytes> saved (<pct>%)` / `Blobs: <bytes> physical`** — content-addressed blob savings: how much is saved by storing each unique output once (logical − physical blob bytes), and the physical size of the blob store. This figure is cross-platform.
- **`Hardlinks: <bytes> via <N> hardlinks`** — the inode-link scan: bytes that the hardlink fallback keeps from being separate copies, summed from blob link counts. When no blob has extra links this reads `none — restores prefer reflink/CoW` — the expected state on copy-on-write filesystems (APFS, btrfs, XFS-with-reflink), where restores are reflinks with independent inodes (so `nlink` stays 1) and the real savings are the cross-platform `Dedup` figure above. This scan is Unix-only; on Windows it always reports zero, because Windows does not expose a portable inode link count.
- **`Scan: <calculating | idle | not scanned>`** — `calculating` means the background scan is running right now, `idle` means it finished and is waiting for the next interval, `not scanned` means it hasn't run yet. A freshly built project may not be reflected until the next scan.

The dedup line is informational — it doesn't affect behavior and doesn't need to reach any particular value to confirm kache is working correctly.

## `dup` events are different

The monitor's `dup` outcome is not the same thing as the dedup storage
line above. A `dup` event means a cache key missed, the compiler ran, and the
compiled output hashes were already present before storing. That usually points
to equivalent builds under different keys, such as over-specific arguments,
paths, or environment inputs. It still counts as compiled work.