embeddenator 0.20.0-alpha.1

# ADR-007: Codebook Security and Reversible Encoding

## Status

Proposed

## Date

2025-12-23

## Context

### Security Requirement

The current documentation describes the codebook as storing plaintext chunk data (actual bytes). This poses a significant security vulnerability:

```rust
// Current (INSECURE):
Codebook: HashMap<ChunkID, Vec<u8>>  // Plaintext bytes stored directly
```

**Problem**: Anyone with access to the engram files has immediate access to all data in plaintext form, defeating any security benefits of the holographic encoding.

### Design Goal

We require a codebook encoding mechanism that:

1. **Mathematically Simple**: Trivial to encode/decode WITH the key
2. **Mathematically Impossible**: Computationally infeasible without the key
3. **Quantum Resistant**: Not vulnerable to quantum algorithms (Shor's, Grover's)
4. **Classical Compute Resistant**: Not vulnerable to brute force or pattern analysis
5. **Mutational and Transformative**: Data transformation inherent in the encoding
6. **VSA-Compatible**: Works with VSA-as-a-lens approach for selective decryption
7. **Bulk Operations**: Efficient encryption of entire codebooks
8. **Selective Decryption**: Decrypt only needed chunks without full codebook decryption

## Decision

We implement a **VSA-Lens Reversible Encoding** system for codebook security:

### 1. VSA-as-a-Lens Cryptographic Primitive

**Core Concept**: Use VSA vectors as cryptographic lenses that transform data through high-dimensional holographic operations.

```rust
struct SecureCodebook {
    // Encrypted chunk storage
    encrypted_chunks: HashMap<ChunkID, EncryptedChunk>,
    
    // VSA-based encryption parameters
    lens_dimensionality: usize,      // e.g., 100,000
    lens_seed: [u8; 32],             // 256-bit master seed
}

struct EncryptedChunk {
    transformed_data: Vec<u8>,        // XOR with lens projection
    lens_position: SparseVec,         // Unique position in VSA space
    integrity_vector: SparseVec,      // For tamper detection
}
```

### 2. Reversible Encoding Algorithm

**Encoding (Encryption)**:

```rust
fn encode_chunk(chunk_data: &[u8], chunk_id: &ChunkID, master_lens: &MasterLens) -> EncryptedChunk {
    // 1. Generate chunk-specific lens from master seed + chunk ID
    let chunk_lens = master_lens.derive_lens(chunk_id);
    
    // 2. Create high-dimensional projection
    let lens_projection = chunk_lens.project_to_bytes(chunk_data.len());
    
    // 3. XOR transformation (reversible, mutational)
    let transformed_data: Vec<u8> = chunk_data
        .iter()
        .zip(lens_projection.iter())
        .map(|(data_byte, lens_byte)| data_byte ^ lens_byte)
        .collect();
    
    // 4. Generate integrity vector
    let integrity_vector = chunk_lens.bind(&chunk_lens.from_data(chunk_data));
    
    EncryptedChunk {
        transformed_data,
        lens_position: chunk_lens.position,
        integrity_vector,
    }
}
```

**Decoding (Decryption)**:

```rust
fn decode_chunk(encrypted: &EncryptedChunk, chunk_id: &ChunkID, master_lens: &MasterLens) -> Vec<u8> {
    // 1. Regenerate chunk-specific lens (requires master seed)
    let chunk_lens = master_lens.derive_lens(chunk_id);
    
    // 2. Verify integrity
    assert!(chunk_lens.position.cosine_similarity(&encrypted.lens_position) > 0.99);
    
    // 3. Regenerate projection
    let lens_projection = chunk_lens.project_to_bytes(encrypted.transformed_data.len());
    
    // 4. XOR to recover original (XOR is self-inverse)
    let original_data: Vec<u8> = encrypted.transformed_data
        .iter()
        .zip(lens_projection.iter())
        .map(|(encrypted_byte, lens_byte)| encrypted_byte ^ lens_byte)
        .collect();
    
    // 5. Verify integrity
    let recovered_integrity = chunk_lens.bind(&chunk_lens.from_data(&original_data));
    assert!(recovered_integrity.cosine_similarity(&encrypted.integrity_vector) > 0.99);
    
    original_data
}
```

### 3. Master Lens Derivation

**Lens Hierarchy**:

```rust
struct MasterLens {
    master_seed: [u8; 32],           // 256-bit secret key
    dimensionality: usize,            // 100K for high security
    base_vectors: Vec<SparseVec>,    // Pre-computed base vectors
}

impl MasterLens {
    fn derive_lens(&self, chunk_id: &ChunkID) -> ChunkLens {
        // Derive deterministic but unpredictable lens from master seed + chunk ID
        let mut hasher = Blake3::new();
        hasher.update(&self.master_seed);
        hasher.update(chunk_id.as_bytes());
        let lens_seed = hasher.finalize();
        
        // Generate sparse vector from seed (deterministic)
        let position = SparseVec::from_seed(lens_seed, self.dimensionality);
        
        ChunkLens {
            position,
            dimensionality: self.dimensionality,
            seed: lens_seed,
        }
    }
}

struct ChunkLens {
    position: SparseVec,              // Unique position in VSA space
    dimensionality: usize,
    seed: [u8; 32],
}

impl ChunkLens {
    fn project_to_bytes(&self, byte_count: usize) -> Vec<u8> {
        // Project high-dimensional sparse vector to byte stream
        let mut output = Vec::with_capacity(byte_count);
        let mut hasher = Blake3::new();
        hasher.update(&self.seed);
        
        // Use lens position indices to seed CSPRNG
        for i in 0..byte_count {
            hasher.update(&i.to_le_bytes());
            let hash = hasher.finalize();
            output.push(hash.as_bytes()[0]);
            hasher = Blake3::new();
            hasher.update(&hash.as_bytes()[1..32]);
        }
        
        output
    }
}
```

### 4. Security Properties

#### Quantum Resistance

**Why XOR + VSA-derived keystream is quantum-resistant**:

1. **No algebraic structure**: Unlike RSA (factoring) or ECC (discrete log), there's no algebraic problem to solve
2. **Information-theoretic security**: XOR with true random stream approaches one-time pad security
3. **High-dimensional chaos**: 100K-dimensional VSA space provides enormous search space (3^100000 possibilities)
4. **No period detection**: Blake3 + VSA prevents Grover's algorithm from finding patterns

**Grover's Algorithm Resistance**:
- Grover's provides O(√N) speedup for unstructured search
- For 256-bit key: classical 2^256 → quantum 2^128
- Still infeasible: 2^128 operations beyond any quantum computer

#### Classical Compute Resistance

**Brute Force Resistance**:
```
Master seed: 256 bits = 2^256 possibilities
Time to brute force at 1 billion attempts/sec: 
  2^256 / 10^9 ≈ 10^68 seconds ≈ 10^60 years
```

**Pattern Analysis Resistance**:
- XOR destroys all patterns in ciphertext
- VSA-derived keystream appears random (Blake3 CSPRNG)
- No frequency analysis possible
- No known-plaintext attacks (each chunk uses unique lens)

#### Mutational Properties

**Data Transformation**:
1. **Bit-level mutation**: Every bit XORed with derived pseudorandom bit
2. **Holographic dispersion**: Single bit change affects VSA lens derivation
3. **Avalanche effect**: Changing master seed changes all lenses completely
4. **Position-dependent**: Chunk ID affects lens, preventing chunk reordering attacks

### 5. VSA-as-a-Lens Selective Decryption

**Bulk Encryption, Selective Decryption**:

```rust
// Encrypt entire codebook efficiently
fn encrypt_codebook_bulk(chunks: &HashMap<ChunkID, Vec<u8>>, master_lens: &MasterLens) 
    -> SecureCodebook 
{
    let encrypted_chunks: HashMap<_, _> = chunks
        .par_iter()  // Parallel encryption
        .map(|(id, data)| {
            (*id, encode_chunk(data, id, master_lens))
        })
        .collect();
    
    SecureCodebook {
        encrypted_chunks,
        lens_dimensionality: master_lens.dimensionality,
        lens_seed: master_lens.master_seed,
    }
}

// Decrypt only needed chunks (selective)
fn decrypt_chunk_selective(codebook: &SecureCodebook, chunk_id: &ChunkID, master_lens: &MasterLens) 
    -> Vec<u8> 
{
    let encrypted = codebook.encrypted_chunks.get(chunk_id)
        .expect("Chunk not found");
    
    decode_chunk(encrypted, chunk_id, master_lens)
}

// VSA query guides decryption (lens approach)
fn reconstruct_file(engram: &Engram, secure_codebook: &SecureCodebook, 
                    file_manifest: &FileManifest, master_lens: &MasterLens) 
    -> Vec<u8> 
{
    let mut file_data = Vec::new();
    
    for chunk_ref in &file_manifest.chunks {
        // 1. VSA finds chunk ID (holographic indexing)
        let chunk_id = engram.query_chunk(&chunk_ref.vector);
        
        // 2. Selectively decrypt just this chunk (no bulk decryption needed)
        let chunk_bytes = decrypt_chunk_selective(secure_codebook, &chunk_id, master_lens);
        
        // 3. Append to file
        file_data.extend_from_slice(&chunk_bytes);
    }
    
    file_data
}
```

### 6. Key Management

**Master Lens Storage**:

```rust
// DO NOT store in engram files
// DO NOT store in manifest
// Store separately with strong protection

struct KeyManagement {
    // Option 1: Environment variable
    master_seed: Option<[u8; 32]>,  // From EMBEDDENATOR_MASTER_KEY
    
    // Option 2: Key file
    key_file_path: Option<PathBuf>,  // ~/.embeddenator/master.key
    
    // Option 3: Hardware security module
    hsm_handle: Option<HSMHandle>,
}

impl KeyManagement {
    fn load_master_key() -> Result<[u8; 32]> {
        // Try environment variable first
        if let Ok(key_hex) = std::env::var("EMBEDDENATOR_MASTER_KEY") {
            return hex::decode(key_hex)?.try_into()
                .map_err(|_| Error::InvalidKeyLength);
        }
        
        // Try key file
        let key_path = dirs::home_dir()
            .ok_or(Error::NoHomeDir)?
            .join(".embeddenator/master.key");
        
        if key_path.exists() {
            let key_bytes = std::fs::read(key_path)?;
            return key_bytes.try_into()
                .map_err(|_| Error::InvalidKeyLength);
        }
        
        Err(Error::NoMasterKey)
    }
}
```

## Consequences

### Positive

- **Security by Default**: No plaintext data in codebook, secure even without additional encryption
- **Quantum Resistant**: No algebraic structure vulnerable to quantum algorithms
- **Mathematically Simple**: XOR is trivial to compute (nanoseconds per byte)
- **Perfectly Reversible**: XOR is self-inverse, guaranteed bit-perfect decryption with key
- **Selective Decryption**: Decrypt only needed chunks, not entire codebook
- **VSA-Compatible**: Works seamlessly with holographic indexing
- **Zero Performance Impact**: XOR is hardware-accelerated, ~1-2 cycles per byte
- **Tamper Detection**: Integrity vectors detect modifications

### Negative

- **Key Management Burden**: Users must securely store master key
- **Key Loss = Data Loss**: No key recovery mechanism (by design)
- **Not Searchable**: Cannot perform operations on encrypted codebook without decryption
- **Additional Complexity**: Encoding/decoding layer adds code complexity
- **Backward Incompatibility**: Existing engrams would need migration

### Neutral

- **Not Full Encryption**: This is obfuscation + access control, not military-grade encryption
- **Layerable**: Can add AES/ChaCha20 on top for defense-in-depth
- **Performance**: XOR is ~10GB/s on modern CPUs, negligible overhead

## Implementation Roadmap

### Phase 1: Core Encoding (Weeks 1-2)
- [ ] Implement `MasterLens` and `ChunkLens` structures
- [ ] Implement `encode_chunk` and `decode_chunk` functions
- [ ] Add Blake3 dependency for cryptographic hashing
- [ ] Create comprehensive unit tests

### Phase 2: Integration (Weeks 3-4)
- [ ] Modify `EmbrFS` to use `SecureCodebook` instead of plaintext codebook
- [ ] Update ingestion to encrypt chunks during encoding
- [ ] Update extraction to decrypt chunks during reconstruction
- [ ] Add key management utilities

### Phase 3: Validation (Week 5)
- [ ] Security audit of encoding mechanism
- [ ] Performance benchmarking (expect <1% overhead)
- [ ] Migration tools for existing engrams
- [ ] Documentation updates

### Phase 4: Advanced Features (Weeks 6-8)
- [ ] Hierarchical key derivation for package isolation
- [ ] Per-package lens derivation from master lens
- [ ] Selective package decryption
- [ ] Key rotation mechanisms

## Performance Analysis

### Encoding/Decoding Overhead

**Per-Chunk Cost**:
```
Blake3 hash (32 bytes): ~50 ns
VSA lens derivation: ~100 μs (one-time per chunk)
XOR transformation (4KB): ~400 ns (10 GB/s throughput)
Integrity check: ~10 μs (cosine similarity)

Total per chunk: ~110 μs (dominated by lens derivation)
```

**Impact on Ingestion**:
```
Current ingestion: 1ms per MB (1000 μs)
Encoding overhead: 110 μs per 4KB chunk = 27.5 μs per KB = 27.5 ms per MB

New ingestion time: 1ms + 27.5ms = 28.5ms per MB
Overhead: ~2.75% (acceptable)
```

**Impact on Extraction**:
```
Similar overhead: ~2.75% slower
Still achieves <100ms for 10K tokens with decryption
```

## Security Analysis

### Threat Model

**Protected Against**:
- ✅ Unauthorized data access (requires master key)
- ✅ Data exfiltration (encrypted at rest)
- ✅ Pattern analysis attacks (XOR destroys patterns)
- ✅ Known-plaintext attacks (unique lens per chunk)
- ✅ Quantum attacks (no algebraic structure)
- ✅ Brute force (2^256 keyspace)
- ✅ Tampering (integrity vectors detect modifications)

**NOT Protected Against** (require additional layers):
- ❌ Side-channel attacks (timing, power analysis)
- ❌ Memory dumps during decryption (plaintext in RAM)
- ❌ Key compromise (no forward secrecy without key rotation)
- ❌ Rubber-hose cryptanalysis (physical coercion)

### Recommended Additional Layers

For high-security applications, add:

1. **AES-256-GCM** on top of VSA encoding
2. **Memory locking** for decrypted chunks (mlock)
3. **Secure deletion** of plaintext after use
4. **Key rotation** mechanisms
5. **Hardware security modules** for key storage

## References

- [One-Time Pad](https://en.wikipedia.org/wiki/One-time_pad) - Information-theoretic security
- [Blake3](https://github.com/BLAKE3-team/BLAKE3) - Cryptographic hash function
- [Grover's Algorithm](https://en.wikipedia.org/wiki/Grover%27s_algorithm) - Quantum search
- [Post-Quantum Cryptography](https://csrc.nist.gov/projects/post-quantum-cryptography)
- ADR-001: Sparse Ternary VSA (foundational VSA operations)
- ADR-005: Hologram Package Isolation (selective operations)
- ADR-006: Dimensionality Scaling (high-dimensional security)

## Notes

### Why Not Standard Encryption?

**Traditional encryption** (AES, ChaCha20) is excellent but:
- Adds another dependency
- Doesn't leverage the holographic structure
- Requires separate key management infrastructure

**VSA-Lens Encoding**:
- Leverages existing VSA infrastructure
- Natural fit with holographic indexing
- Can be combined with traditional encryption for defense-in-depth

### Mathematical Triviality

With the key, decryption is literally:
```rust
decrypted_byte = encrypted_byte ^ lens_byte
```

A single XOR operation. Doesn't get more trivial than that.

Without the key, finding the lens bytes requires:
- Breaking Blake3 (no known attacks)
- Searching 2^256 keyspace (infeasible)
- Or searching 3^100000 VSA space (even more infeasible)

This is the essence of modern cryptography: asymmetric computational cost.