eth-id 0.1.0 - Docs.rs

# ETH.id Architecture

## Version 1.0.0

---

## Design Philosophy

ETH.id combines Zero-Knowledge Proofs with Large Language Models to create a new verification primitive: **cryptographically provable answers to natural language questions about documents, without exposing the documents themselves**.

The key insight: ZK proves mathematical relationships, LLMs understand semantic meaning. Together, they cover the full spectrum of document verification needs.

---

## Technology Stack

### Core Language: Rust

**Why Rust**:
- Memory safety without garbage collection
- Zero-cost abstractions for performance
- Excellent cryptography ecosystem
- `zeroize` crate for secure memory handling
- Strong type system prevents entire classes of bugs

**Alternatives considered**:
- Go: Lacks memory safety guarantees, GC pauses
- Python: Too slow for crypto operations, no memory safety
- C++: Memory safety issues, complex build systems

---

### Zero-Knowledge: Noir + Barretenberg

**Why Noir**:
- Rust-like syntax (low learning curve)
- PLONK backend (efficient proofs)
- Off-chain verification (no gas costs)
- Mature tooling (nargo CLI)
- Aztec Network backing (long-term support)

**Why not Circom**:
- Different syntax (JavaScript-like)
- Groth16 requires trusted setup per circuit
- Less ergonomic for Rust developers

**Why not Halo2**:
- Steeper learning curve
- Less mature tooling
- Smaller ecosystem

**Off-chain first**: ETH.id verifies proofs locally by default. On-chain publishing is optional and future work.

---

### LLM Providers

#### 1. Claude (Anthropic)

**Default for semantic claims**:
- Best instruction following
- Strong reasoning capabilities
- JSON mode reliability
- Constitutional AI alignment

**API**: `https://api.anthropic.com/v1/messages`

#### 2. OpenAI (GPT-4)

**Alternative provider**:
- Widely available
- Good performance
- Established ecosystem

**API**: `https://api.openai.com/v1/chat/completions`

#### 3. Ollama (Local)

**Privacy-first option**:
- Runs entirely offline
- No API keys needed
- Full data sovereignty
- Models: llama3.2, mistral, etc.

**API**: `http://localhost:11434/api/generate`

**Why Ollama is first-class**: Privacy is not a feature, it's a requirement. Offline-first ensures ETH.id works without trusting any external provider.

---

## System Architecture

### High-Level Flow

```
┌─────────────────────────────────────────────────────────────┐
│                         User's Machine                       │
│                                                              │
│  ┌──────────────┐                                           │
│  │   Document   │                                           │
│  │   (PDF/IMG)  │                                           │
│  └──────┬───────┘                                           │
│         │                                                    │
│         ▼                                                    │
│  ┌──────────────┐      ┌─────────────────┐                │
│  │   Parser     │─────▶│ ParsedDocument  │                │
│  │  (Offline)   │      │  (in memory)    │                │
│  └──────────────┘      └────────┬────────┘                │
│                                  │                          │
│                                  ▼                          │
│                         ┌────────────────┐                 │
│                         │ Claim Engine   │                 │
│                         │ (NLP → Types)  │                 │
│                         └────────┬───────┘                 │
│                                  │                          │
│                                  ▼                          │
│                         ┌────────────────┐                 │
│                         │ Privacy Filter │                 │
│                         │ (Minimize data)│                 │
│                         └────────┬───────┘                 │
│                                  │                          │
│                    ┌─────────────┴──────────────┐          │
│                    ▼                            ▼          │
│           ┌─────────────────┐         ┌──────────────┐    │
│           │  ZK Circuit     │         │ LLM Verifier │    │
│           │  (Deterministic)│         │  (Semantic)  │    │
│           └────────┬────────┘         └──────┬───────┘    │
│                    │                          │            │
│                    │         ┌────────────────┘            │
│                    │         │  (Filtered data only)       │
│                    ▼         ▼                             │
│           ┌──────────────────────┐                        │
│           │  Verification Result │                        │
│           │   (Boolean + Proof)  │                        │
│           └──────────┬───────────┘                        │
│                      │                                     │
│         ┌────────────┴────────────┐                       │
│         ▼                         ▼                        │
│  ┌─────────────┐         ┌──────────────┐                │
│  │ Attestation │         │  Audit Log   │                │
│  │   Bundle    │         │ (Hash only)  │                │
│  └─────────────┘         └──────────────┘                │
│                                                            │
└─────────────────────────────────────────────────────────────┘
                              │
                              │ (Only if LLM mode)
                              ▼
                    ┌──────────────────┐
                    │  LLM Provider    │
                    │ (Claude/OpenAI/  │
                    │    Ollama)       │
                    └──────────────────┘
```

---

## Module Architecture

### 1. Parser Module (`src/parser/`)

**Responsibility**: Extract structured data from documents

**Submodules**:
- `pdf.rs`: PDF text extraction via `pdf-extract`
- `image.rs`: Image metadata + base64 encoding (OCR optional)
- `json.rs`: Structured JSON parsing
- `text.rs`: Plain text parsing

**Key Design Decision**: Parser is 100% offline. No network calls, no external dependencies beyond file I/O.

**Data Flow**:
```
File → bytes → Parser → ParsedDocument (in-memory struct)
```

**Security**: `ParsedDocument` implements `Drop` with `zeroize` to clear sensitive data from memory.

---

### 2. Claims Module (`src/claims/`)

**Responsibility**: Transform natural language claims into typed queries

**Submodules**:
- `types.rs`: Claim type definitions (DateClaim, IdentityClaim, etc.)
- `engine.rs`: NLP parsing with regex patterns
- `validator.rs`: Claim validation logic

**Key Design Decision**: Claims are NOT free-form strings internally. They are parsed into Rust enums with validation. This prevents prompt injection at the type level.

**Claim Types**:
```rust
pub enum ClaimQuery {
    Date(DateClaim),        // Age, expiry, date ranges
    Identity(IdentityClaim), // CPF, RG, name matching
    Amount(AmountClaim),     // Salary, balance thresholds
    Signature(SignatureClaim), // Document signing
    Presence(PresenceClaim),  // Field existence
    Comparative(ComparativeClaim), // Field comparisons
}
```

**Example Parsing**:
```
Input: "maior de 18 anos"
Output: ClaimQuery::Date(DateClaim {
    operation: AgeGreaterThan,
    age_threshold: Some(18),
    ...
})
```

---

### 3. Privacy Module (`src/privacy/`)

**Responsibility**: Minimize data exposure before LLM/ZK processing

**Submodules**:
- `filter.rs`: Main privacy filter logic
- `minimizer.rs`: Extract only relevant fields
- `virtualizer.rs`: Compute results locally (e.g., age from birth date)

**Filter Modes**:

1. **Virtualization**: Compute locally, send only result
   - Use case: Age verification
   - Input: Birth date
   - Output: Boolean (age > threshold)

2. **Hash Partial**: Mask sensitive parts
   - Use case: CPF verification
   - Input: 123.456.789-00
   - Output: 123.***.***-00

3. **Minimization**: Extract only relevant fields
   - Use case: Amount verification
   - Input: Full document
   - Output: Single "amount" field

**Key Design Decision**: Privacy Filter operates BEFORE any external call. It's impossible to bypass because it's structurally enforced.

---

### 4. Verifier Module (`src/verifier/`)

**Responsibility**: Execute verification via LLM or ZK

**Submodules**:
- `openai.rs`: OpenAI GPT-4 integration
- `claude.rs`: Anthropic Claude integration
- `ollama.rs`: Local Ollama integration

**System Prompt Strategy**:
```
You are a zero-knowledge verification system.
Answer ONLY with JSON: {"answer": bool, "confidence": float, "reasoning": string}
Never request full documents.
If data is unclear, answer false.
```

**Response Parsing**: Strict JSON parsing with fallback to regex extraction. Invalid responses are errors, not warnings.

---

### 5. Attestation Module (`src/attestation/`)

**Responsibility**: Generate cryptographic attestation bundles

**Bundle Structure**:
```json
{
  "version": "1.0.0",
  "session_id": "uuid",
  "timestamp": "ISO8601",
  "document_hash": "SHA256",
  "claim": "natural language",
  "result": {"answer": bool, "confidence": float},
  "proof_type": "ZK" | "LLM",
  "bundle_hash": "SHA256 of entire bundle"
}
```

**Integrity Verification**:
```rust
pub fn verify_integrity(&self) -> bool {
    let computed_hash = self.compute_hash();
    computed_hash == self.bundle_hash
}
```

**Key Design Decision**: Bundles are immutable and self-verifying. Tampering is detectable via hash mismatch.

---

### 6. Audit Module (`src/audit/`)

**Responsibility**: Maintain append-only verification log

**Log Location**: `~/.eth-id/audit/audit.json`

**Entry Format**:
```json
{
  "session_id": "uuid",
  "timestamp": "ISO8601",
  "document_hash": "SHA256",
  "claim": "string",
  "result": bool,
  "proof_type": "zk" | "llm"
}
```

**Key Design Decision**: Audit log contains ONLY hashes, never document content. It's safe to share for compliance purposes.

---

## Decision Log

### Why ZK for Deterministic Claims?

**Deterministic claims** (age > 18, amount > 5000) have a single correct answer given the input. ZK provides:
- Mathematical proof (not probabilistic)
- Complete privacy (zero-knowledge property)
- Verifiable by anyone (proof is public)
- No trust in LLM needed

**Cost**: Circuit development time, proving overhead (~seconds)

**Benefit**: Cryptographic guarantee, perfect for legal/compliance use cases

---

### Why LLM for Semantic Claims?

**Semantic claims** (document is signed, clause is present) require understanding context and meaning. LLM provides:
- Natural language understanding
- Flexible reasoning
- Handles edge cases humans would understand

**Cost**: Privacy trade-off (filtered data sent to provider), probabilistic answers

**Benefit**: Handles claims that are impossible to encode in circuits

---

### Why Offline-First?

**Philosophy**: Privacy should be the default, not an option.

**Implementation**:
- All document parsing is local
- ZK circuits run locally (Barretenberg)
- Ollama runs locally
- `--offline` flag blocks network entirely

**Trade-off**: Users must run Ollama locally (1-2GB download)

**Benefit**: Complete data sovereignty, no trust in external parties

---

### Why Rust Over Other Languages?

**Memory Safety**: Documents contain PII. Memory leaks or use-after-free could expose sensitive data. Rust prevents this at compile time.

**Performance**: Cryptographic operations (hashing, ZK proving) are CPU-intensive. Rust's zero-cost abstractions provide C-level performance.

**Ecosystem**: `pdf-extract`, `image`, `sha2`, `zeroize` are mature and well-maintained.

**Type Safety**: Claim types prevent entire classes of bugs. Rust's enum system is perfect for modeling claim variants.

---

## Limitations and Future Work

### Current Limitations

1. **OCR Not Implemented**: Image documents require manual text extraction
2. **Limited Date Formats**: Only common formats (DD/MM/YYYY, YYYY-MM-DD) supported
3. **No Circuit Compilation**: ZK circuits are placeholders, require Noir toolchain
4. **Single Document**: Batch verification not yet implemented
5. **No Revocation**: Attestations cannot be revoked once created

### Future Enhancements

1. **OCR Integration**: Tesseract for scanned documents
2. **More ZK Circuits**: hash_match, date_range, field_presence
3. **On-Chain Publishing**: Optional Ethereum attestation publishing
4. **Multi-Document**: Verify claims across multiple documents
5. **Revocation Lists**: CRL-style attestation revocation
6. **Mobile Support**: iOS/Android apps via Rust FFI

---

## Performance Characteristics

### Document Parsing
- **PDF (1MB)**: ~100-200ms
- **Image (5MB)**: ~300-500ms
- **JSON (100KB)**: ~10-20ms

### Privacy Filter
- **Virtualization**: ~1-5ms (local computation)
- **Hash Partial**: ~1ms (regex + masking)
- **Minimization**: ~5-10ms (field extraction)

### LLM Verification
- **OpenAI**: ~1-3s (network latency + inference)
- **Claude**: ~1-2s (network latency + inference)
- **Ollama (local)**: ~500ms-2s (depends on model size)

### ZK Proving (Estimated)
- **age_check**: ~2-5s (circuit compilation + proving)
- **amount_threshold**: ~1-3s
- **Verification**: ~10-50ms (very fast)

---

## Security Properties

### Cryptographic

1. **SHA-256 Hashing**: 256-bit collision resistance
2. **PLONK Proofs**: Soundness under discrete log assumption
3. **TLS**: All LLM API calls use HTTPS

### Memory Safety

1. **No Buffer Overflows**: Rust prevents at compile time
2. **No Use-After-Free**: Borrow checker enforces
3. **Zeroization**: Sensitive data cleared on drop

### Privacy

1. **No Disk Writes**: Documents never persisted
2. **Minimal Disclosure**: Privacy Filter enforces
3. **Hash-Only Logs**: Audit trail is privacy-preserving

---

## Testing Strategy

### Unit Tests
- Claim parsing (regex patterns)
- Privacy Filter modes
- CPF validation
- Date parsing

### Integration Tests
- End-to-end verification flow
- Attestation bundle creation
- Audit log integrity

### Adversarial Tests
- Prompt injection attempts
- Privacy Filter bypass attempts
- Log reconstruction attacks

---

## Deployment Modes

### Local Development
```bash
cargo run -- verify --doc test.pdf --claim "over 18" --debug
```

### Production (with OpenAI)
```bash
export OPENAI_API_KEY=sk-...
eth verify --doc passport.pdf --claim "over 21 years old"
```

### Maximum Privacy (Offline)
```bash
# Start Ollama
ollama serve

# Run verification
eth verify --doc id.pdf --claim "over 18" --offline --provider ollama
```

---

## Comparison with Alternatives

### vs Traditional KYC
- **ETH.id**: Document stays local, minimal disclosure
- **KYC**: Full document uploaded, stored indefinitely

### vs Manual Verification
- **ETH.id**: Cryptographic proof, auditable
- **Manual**: No proof, trust-based

### vs Blockchain Identity
- **ETH.id**: Off-chain, no gas costs, instant
- **Blockchain**: On-chain, expensive, slow

---

## Version History

- **1.0.0** (2026-02-24): Initial architecture