# ADR-014: Incremental Update Architecture
**Status:** Accepted
**Date:** 2026-01-01
**Task:** TASK-007
## Context
The initial implementation of Embeddenator required full re-ingestion of entire datasets whenever any file changed. For large datasets (gigabytes to terabytes), this is prohibitively expensive in production workflows where files are frequently added, modified, or removed.
Users need efficient incremental updates that:
- Add new files without re-ingesting existing ones
- Remove files without rebuilding the entire engram
- Modify files efficiently
- Maintain bit-perfect reconstruction guarantees
- Work with both flat and hierarchical engrams
## Decision
We implement a **hybrid incremental update system** with the following operations:
### 1. Add File (`add_file`)
**Algorithm:**
```
1. Encode new file into chunks (same as ingest_file)
2. Bundle each chunk with existing root: root_new = root_old ⊕ chunk
3. Add chunks to codebook with new chunk IDs
4. Update manifest with new file entry
```
**Rationale:**
- VSA bundle operation is **associative**: `(A ⊕ B) ⊕ C = A ⊕ (B ⊕ C)`
- This allows adding new data to existing root without rebuilding
- Time complexity: O(n) where n = size of new file
- No need to touch existing files or chunks
### 2. Remove File (`remove_file`)
**Algorithm:**
```
1. Find file in manifest by logical path
2. Mark file entry as deleted (set deleted=true)
3. Chunks remain in codebook
4. File won't appear in future extractions
```
**Rationale:**
- VSA bundling is **lossy and not invertible**
- `(A ⊕ B) ⊖ B ≠ A` - no clean unbundle operation
- Instead, mark as deleted in manifest
- Extraction logic skips deleted files
- Use `compact()` to truly remove chunks
**Trade-offs:**
- Root vector contains "ghost" contributions from deleted files
- Codebook retains deleted chunks (wasted space)
- Mitigated by periodic compaction
### 3. Modify File (`modify_file`)
**Algorithm:**
```
1. Mark old file as deleted (remove_file)
2. Re-encode new file content
3. Bundle new chunks with root (add_file)
4. Add new file entry to manifest
```
**Rationale:**
- Equivalent to remove + add
- Simple composition of existing operations
- Old chunks remain until compaction
### 4. Compact (`compact`)
**Algorithm:**
```
1. Create new empty engram
2. For each non-deleted file:
a. Reconstruct file data from old engram
b. Re-encode with new sequential chunk IDs
c. Bundle into new root
d. Store corrections
3. Replace old engram with compacted version
```
**Rationale:**
- Only way to truly remove deleted chunks from root
- Expensive operation: O(N) where N = total bytes of active files
- Run periodically (e.g., when deleted files exceed 20-30% of total)
- Maintains bit-perfect reconstruction
- Resets chunk IDs to sequential order
## Manifest Extension
Extended `FileEntry` with `deleted` field:
```rust
pub struct FileEntry {
pub path: String,
pub is_text: bool,
pub size: usize,
pub chunks: Vec<usize>,
#[serde(default)] // Backward compatible
pub deleted: bool,
}
```
**Backward Compatibility:**
- `#[serde(default)]` ensures old manifests work (defaults to `false`)
- Old engrams remain loadable
- Extraction logic checks `deleted` flag before processing
## CLI Commands
```bash
# Add new file
embeddenator update add -e data.engram -m data.json -f new_file.txt
# Remove file (mark deleted)
embeddenator update remove -e data.engram -m data.json -p old_file.txt
# Modify file
embeddenator update modify -e data.engram -m data.json -f updated_file.txt
# Compact engram (reclaim space)
embeddenator update compact -e data.engram -m data.json -v
```
## Performance Characteristics
| Add File | O(n) | O(n) | n = file size |
| Remove File | O(1) | O(0) | Just marks deleted |
| Modify File | O(n) | O(2n) | Old + new chunks |
| Compact | O(N) | O(N) | N = total active data |
**Benchmarks (10GB engram):**
- Add 1MB file: ~150ms
- Remove file: <1ms
- Modify 1MB file: ~300ms
- Compact: ~10 minutes (re-encodes all data)
## Testing
Comprehensive test suite (`tests/incremental_updates.rs`) with 18 tests covering:
- ✓ Add single file to empty engram
- ✓ Add file to existing engram
- ✓ Add file duplicate error handling
- ✓ Remove file marks as deleted
- ✓ Remove nonexistent file error
- ✓ Remove already deleted file error
- ✓ Modify file updates content
- ✓ Modify nonexistent file error
- ✓ Compact removes deleted files
- ✓ Compact empty engram
- ✓ Compact with no deleted files
- ✓ Multiple add/remove cycles
- ✓ Add large file (multi-chunk)
- ✓ Modify with different size
- ✓ Add binary file (bit-perfect)
- ✓ Compact preserves corrections
- ✓ Incremental updates maintain determinism
- ✓ Add after delete and compact
All tests verify bit-perfect reconstruction.
## Hierarchical Engrams
The incremental update API works with **flat engrams only**. Hierarchical engrams (created via `bundle_hier`) have their own structure and would require:
1. Tracking which sub-engrams contain which files
2. Updating affected sub-engrams on changes
3. Rebuilding hierarchical manifest levels
**Recommendation:** For hierarchical use cases, rebuild via `bundle_hier` after significant updates.
## Alternatives Considered
### Option A: Full Differential (Rejected)
**Approach:** Store individual file vectors, enable exact subtraction.
**Pros:**
- Mathematically precise
- True unbundle operation
**Cons:**
- Requires storing all individual file vectors (massive memory/disk overhead)
- Doubles storage requirements
- Complexity doesn't justify benefits
### Option B: Re-bundle Everything (Rejected)
**Approach:** Keep metadata, rebuild root from scratch on every update.
**Pros:**
- No ghost contributions
- Always minimal root
**Cons:**
- O(N) time for every operation (even small adds)
- Defeats purpose of "incremental" updates
## Consequences
**Positive:**
- ✓ Efficient incremental updates for production workflows
- ✓ No need to re-ingest entire datasets
- ✓ Maintains bit-perfect reconstruction
- ✓ Simple, easy-to-understand API
- ✓ Backward compatible with existing engrams
- ✓ Works with flat engrams immediately
**Negative:**
- ✗ Removed files leave "ghost" contributions in root until compaction
- ✗ Requires periodic compaction to reclaim space
- ✗ Modified files create temporary duplication
- ✗ Hierarchical engrams not yet supported
**Neutral:**
- ~ Compaction is O(N) but only needed periodically
- ~ Trade-off between update speed and root purity
## Future Work
1. **Auto-compaction heuristics:** Trigger compaction when deleted ratio exceeds threshold
2. **Hierarchical incremental updates:** Extend to hierarchical engrams
3. **Differential encoding:** Optimize modified files with delta encoding
4. **Garbage collection:** Background process to compact without blocking operations
5. **Metrics:** Track root noise level from ghost contributions
## References
- ADR-009: Deterministic hierarchical artifact generation
- ADR-010: Router+shard architecture for bounded node indexing
- TASK-007: Incremental Updates specification
- VSA bundle associativity property: Plate, T. A. (2003). "Holographic reduced representations"