self_encryption 0.35.0

Self encrypting files (convergent encryption plus obfuscation)
Documentation
# CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

## Development Commands

### Building and Testing

```bash
# Format code (MANDATORY before commits)
cargo fmt --all

# Run clippy linter with strict settings
cargo clippy --all-features -- -D warnings

# Run all Rust tests
cargo test --release

# Run comprehensive test script (includes Python tests)
./scripts/test.sh

# Build Python package with maturin
maturin develop --features python

# Run Python tests
pytest tests/ -v

# Run benchmarks
cargo bench

# Check for unused dependencies
cargo udeps --all-targets

# Publish dry run
cargo publish --dry-run
```

### Single Test Execution

```bash
# Run a specific Rust test
cargo test test_name --release

# Run a specific Python test
pytest tests/test_file.py::test_name -v

# Run tests with output
cargo test -- --nocapture
```

## Architecture Overview

### Core Encryption Process

The self_encryption crate implements convergent encryption with obfuscation through a three-stage process:

1. **Content Chunking**: Files are split into chunks (up to 1MB each)
2. **Per-Chunk Processing**:
   - Compression (Brotli with configurable quality)
   - Encryption (AES-256-CBC)
   - XOR obfuscation
3. **Key Derivation**: Each chunk's encryption keys are derived from a circular dependency pattern:
   - Chunks 0 and 1 have special handling due to circular dependencies
   - For chunk N (where N ≥ 2): uses hashes from chunks N, (N+1) % total, (N+2) % total
   - Creates interdependency where modifying any chunk affects multiple others

### Key Components

- **`src/lib.rs`**: Main library interface, exports public API including `encrypt`, `decrypt_full_set`
- **`src/encrypt.rs`**: Core encryption logic, handles chunk processing and key generation
- **`src/decrypt.rs`**: Decryption logic, reverses the encryption process
- **`src/data_map.rs`**: DataMap structure that stores chunk metadata (src/dst hashes, sizes, indices)
- **`src/stream.rs`**: Streaming encryption/decryption for memory-efficient large file handling
- **`src/chunk.rs`**: Chunk data structures (`EncryptedChunk`, `ChunkInfo`) and validation
- **`src/aes.rs`**: AES encryption implementation using CBC mode
- **`src/utils.rs`**: Utility functions for key derivation, hash extraction, chunk size calculation
- **`src/python.rs`**: PyO3 bindings for Python interface
- **`src/error.rs`**: Error types and handling

### Storage Backend Design

The library uses a trait-based design for flexible storage backends:
- Store functions: `Fn(XorName, Bytes) -> Result<()>`
- Retrieve functions: `Fn(XorName) -> Result<Bytes>`
- Supports memory, disk, or custom storage implementations

### DataMap Hierarchy

For large files, DataMaps can be shrunk hierarchically:
- Serialize large DataMap → Encrypt as data → Create new smaller DataMap
- Process repeats until manageable size reached
- `child` field tracks hierarchy level

## Critical Constraints

- **Minimum file size**: 3072 bytes (3 * MIN_CHUNK_SIZE) for self-encryption
- **Chunk size**: Maximum 1MB per chunk
- **Key security**: The returned secret key from encryption requires secure handling
- **Hash verification**: All chunks are self-validating through SHA3-256 hashes

## Python Bindings

The Python interface is built with PyO3 and maturin:
- CLI tool: `self-encryption` command
- Module: `self_encryption` Python package
- Supports both in-memory and streaming operations

## CI/CD Workflow

- **PR checks**: Format, clippy, tests, coverage, unused deps
- **Warnings as errors**: `RUSTFLAGS="-D warnings"` enforced in CI
- **Code coverage**: Uses cargo-llvm-cov and reports to coveralls.io
- **32-bit testing**: Includes i686 target testing
- **Python package**: Automated publishing via GitHub Actions

## Performance Considerations

- Parallel chunk processing via rayon in standard implementation
- Streaming APIs for memory efficiency with large files
- Benchmarks in `benches/lib.rs` for tracking performance
- Optimized compression settings in Brotli
- Chunk size optimization based on file size

## StreamSelfEncryptor Implementation Notes

The streaming implementation differs from the standard implementation in several important ways:

### Design Differences

1. **Memory Usage**: 
   - Standard: Loads entire file into memory, processes all chunks at once
   - Streaming: Processes one chunk at a time, O(1) memory usage

2. **API Pattern**:
   - Standard: Functional approach with `encrypt(bytes) -> (DataMap, Vec<EncryptedChunk>)`
   - Streaming: Stateful object with `next_encryption()` returning chunks incrementally

3. **Chunk Processing**:
   - Standard: Special handling for chunks 0 and 1 (deferred processing due to circular dependencies)
   - Streaming: Processes all chunks uniformly (potential issue)

### Known Issues with StreamSelfEncryptor

1. **First Two Chunks**: Does not implement the special handling for chunks 0 and 1 that the standard implementation uses. This could lead to incorrect encryption in edge cases.

2. **Error Handling**: Less robust error handling compared to standard implementation, particularly around chunk validation.

3. **File System Dependency**: StreamSelfDecryptor uses temporary files extensively, which adds complexity and potential failure points.

### When to Use Each Implementation

- **Standard Implementation**: Use for files that fit comfortably in memory (< 1GB)
- **Streaming Implementation**: Use for large files where memory usage is a concern
- **Note**: Both implementations produce compatible output when working correctly

### Potential Improvements Needed

1. **Unify Chunk Processing**: Align StreamSelfEncryptor's chunk processing with standard implementation, especially for chunks 0 and 1
2. **Error Handling**: Improve error handling in streaming implementation to match standard implementation's robustness
3. **Reduce File System Operations**: Consider memory-mapping or buffering strategies for StreamSelfDecryptor
4. **Progress Callbacks**: Add progress reporting capabilities to streaming implementation
5. **Test Coverage**: Ensure streaming implementation has comprehensive tests for edge cases
6. **API Consistency**: Consider refactoring to provide more consistent APIs between implementations