canon-archive 0.2.2

A CLI tool for organizing large media libraries into a canonical archive
# Source vs. Object

Understanding the relationship between sources and objects is key to how Canon handles deduplication and archive tracking.

## Sources Are Locations

When a [root](roots.md) is scanned, Canon indexes every file it finds as a **source**. Each source represents a specific file at a specific path.

## Objects Are Content

When sources are hashed, Canon creates or links them to **objects**. An object represents the underlying content, independent of where it was found.

```
Source A: /backup1/photos/IMG_001.jpg  ─┐
Source B: /backup2/old/IMG_001.jpg     ─┼─► Object (hash: abc123...)
Source C: /downloads/photo.jpg         ─┘
```

All three sources above have identical content, so they reference the same object.

## Fact Sharing

When a source is linked to an object:

- **Content facts** (like EXIF metadata) can be stored on the object and become available to all sources with that hash
- **Source facts** (like file path) remain specific to each source

This allows metadata to flow between different copies of the same content. Import a fact once, and it's available everywhere that content exists.

## Archive Tracking

Canon uses the source-object relationship to track archiving progress:

- When you archive a file, Canon copies it to an archive root and records the object's hash
- Any source with that same hash is now considered "archived"
- The `coverage` command shows how many of your sources exist in an archive

## Hashing

By default, Canon hashes all files during scanning. Since hashing can be time-consuming for large collections, you can:

- Use `--no-hash` during scan to skip hashing initially
- Hash selectively via the [enrichment pipeline]../commands/enrich/index.md, targeting specific file types

Unhashed sources cannot be linked to objects, so they cannot be deduplicated or tracked for archive coverage.