# ETH.id Architecture
## Version 1.0.0
---
## Design Philosophy
ETH.id combines Zero-Knowledge Proofs with Large Language Models to create a new verification primitive: **cryptographically provable answers to natural language questions about documents, without exposing the documents themselves**.
The key insight: ZK proves mathematical relationships, LLMs understand semantic meaning. Together, they cover the full spectrum of document verification needs.
---
## Technology Stack
### Core Language: Rust
**Why Rust**:
- Memory safety without garbage collection
- Zero-cost abstractions for performance
- Excellent cryptography ecosystem
- `zeroize` crate for secure memory handling
- Strong type system prevents entire classes of bugs
**Alternatives considered**:
- Go: Lacks memory safety guarantees, GC pauses
- Python: Too slow for crypto operations, no memory safety
- C++: Memory safety issues, complex build systems
---
### Zero-Knowledge: Noir + Barretenberg
**Why Noir**:
- Rust-like syntax (low learning curve)
- PLONK backend (efficient proofs)
- Off-chain verification (no gas costs)
- Mature tooling (nargo CLI)
- Aztec Network backing (long-term support)
**Why not Circom**:
- Different syntax (JavaScript-like)
- Groth16 requires trusted setup per circuit
- Less ergonomic for Rust developers
**Why not Halo2**:
- Steeper learning curve
- Less mature tooling
- Smaller ecosystem
**Off-chain first**: ETH.id verifies proofs locally by default. On-chain publishing is optional and future work.
---
### LLM Providers
#### 1. Claude (Anthropic)
**Default for semantic claims**:
- Best instruction following
- Strong reasoning capabilities
- JSON mode reliability
- Constitutional AI alignment
**API**: `https://api.anthropic.com/v1/messages`
#### 2. OpenAI (GPT-4)
**Alternative provider**:
- Widely available
- Good performance
- Established ecosystem
**API**: `https://api.openai.com/v1/chat/completions`
#### 3. Ollama (Local)
**Privacy-first option**:
- Runs entirely offline
- No API keys needed
- Full data sovereignty
- Models: llama3.2, mistral, etc.
**API**: `http://localhost:11434/api/generate`
**Why Ollama is first-class**: Privacy is not a feature, it's a requirement. Offline-first ensures ETH.id works without trusting any external provider.
---
## System Architecture
### High-Level Flow
```
┌─────────────────────────────────────────────────────────────┐
│ User's Machine │
│ │
│ ┌──────────────┐ │
│ │ Document │ │
│ │ (PDF/IMG) │ │
│ └──────┬───────┘ │
│ │ │
│ ▼ │
│ ┌──────────────┐ ┌─────────────────┐ │
│ │ Parser │─────▶│ ParsedDocument │ │
│ │ (Offline) │ │ (in memory) │ │
│ └──────────────┘ └────────┬────────┘ │
│ │ │
│ ▼ │
│ ┌────────────────┐ │
│ │ Claim Engine │ │
│ │ (NLP → Types) │ │
│ └────────┬───────┘ │
│ │ │
│ ▼ │
│ ┌────────────────┐ │
│ │ Privacy Filter │ │
│ │ (Minimize data)│ │
│ └────────┬───────┘ │
│ │ │
│ ┌─────────────┴──────────────┐ │
│ ▼ ▼ │
│ ┌─────────────────┐ ┌──────────────┐ │
│ │ ZK Circuit │ │ LLM Verifier │ │
│ │ (Deterministic)│ │ (Semantic) │ │
│ └────────┬────────┘ └──────┬───────┘ │
│ │ │ │
│ │ ┌────────────────┘ │
│ │ │ (Filtered data only) │
│ ▼ ▼ │
│ ┌──────────────────────┐ │
│ │ Verification Result │ │
│ │ (Boolean + Proof) │ │
│ └──────────┬───────────┘ │
│ │ │
│ ┌────────────┴────────────┐ │
│ ▼ ▼ │
│ ┌─────────────┐ ┌──────────────┐ │
│ │ Attestation │ │ Audit Log │ │
│ │ Bundle │ │ (Hash only) │ │
│ └─────────────┘ └──────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
│
│ (Only if LLM mode)
▼
┌──────────────────┐
│ LLM Provider │
│ (Claude/OpenAI/ │
│ Ollama) │
└──────────────────┘
```
---
## Module Architecture
### 1. Parser Module (`src/parser/`)
**Responsibility**: Extract structured data from documents
**Submodules**:
- `pdf.rs`: PDF text extraction via `pdf-extract`
- `image.rs`: Image metadata + base64 encoding (OCR optional)
- `json.rs`: Structured JSON parsing
- `text.rs`: Plain text parsing
**Key Design Decision**: Parser is 100% offline. No network calls, no external dependencies beyond file I/O.
**Data Flow**:
```
File → bytes → Parser → ParsedDocument (in-memory struct)
```
**Security**: `ParsedDocument` implements `Drop` with `zeroize` to clear sensitive data from memory.
---
### 2. Claims Module (`src/claims/`)
**Responsibility**: Transform natural language claims into typed queries
**Submodules**:
- `types.rs`: Claim type definitions (DateClaim, IdentityClaim, etc.)
- `engine.rs`: NLP parsing with regex patterns
- `validator.rs`: Claim validation logic
**Key Design Decision**: Claims are NOT free-form strings internally. They are parsed into Rust enums with validation. This prevents prompt injection at the type level.
**Claim Types**:
```rust
pub enum ClaimQuery {
Date(DateClaim), // Age, expiry, date ranges
Identity(IdentityClaim), // CPF, RG, name matching
Amount(AmountClaim), // Salary, balance thresholds
Signature(SignatureClaim), // Document signing
Presence(PresenceClaim), // Field existence
Comparative(ComparativeClaim), // Field comparisons
}
```
**Example Parsing**:
```
Input: "maior de 18 anos"
Output: ClaimQuery::Date(DateClaim {
operation: AgeGreaterThan,
age_threshold: Some(18),
...
})
```
---
### 3. Privacy Module (`src/privacy/`)
**Responsibility**: Minimize data exposure before LLM/ZK processing
**Submodules**:
- `filter.rs`: Main privacy filter logic
- `minimizer.rs`: Extract only relevant fields
- `virtualizer.rs`: Compute results locally (e.g., age from birth date)
**Filter Modes**:
1. **Virtualization**: Compute locally, send only result
- Use case: Age verification
- Input: Birth date
- Output: Boolean (age > threshold)
2. **Hash Partial**: Mask sensitive parts
- Use case: CPF verification
- Input: 123.456.789-00
- Output: 123.***.***-00
3. **Minimization**: Extract only relevant fields
- Use case: Amount verification
- Input: Full document
- Output: Single "amount" field
**Key Design Decision**: Privacy Filter operates BEFORE any external call. It's impossible to bypass because it's structurally enforced.
---
### 4. Verifier Module (`src/verifier/`)
**Responsibility**: Execute verification via LLM or ZK
**Submodules**:
- `openai.rs`: OpenAI GPT-4 integration
- `claude.rs`: Anthropic Claude integration
- `ollama.rs`: Local Ollama integration
**System Prompt Strategy**:
```
You are a zero-knowledge verification system.
Answer ONLY with JSON: {"answer": bool, "confidence": float, "reasoning": string}
Never request full documents.
If data is unclear, answer false.
```
**Response Parsing**: Strict JSON parsing with fallback to regex extraction. Invalid responses are errors, not warnings.
---
### 5. Attestation Module (`src/attestation/`)
**Responsibility**: Generate cryptographic attestation bundles
**Bundle Structure**:
```json
{
"version": "1.0.0",
"session_id": "uuid",
"timestamp": "ISO8601",
"document_hash": "SHA256",
"claim": "natural language",
"result": {"answer": bool, "confidence": float},
"proof_type": "ZK" | "LLM",
"bundle_hash": "SHA256 of entire bundle"
}
```
**Integrity Verification**:
```rust
pub fn verify_integrity(&self) -> bool {
let computed_hash = self.compute_hash();
computed_hash == self.bundle_hash
}
```
**Key Design Decision**: Bundles are immutable and self-verifying. Tampering is detectable via hash mismatch.
---
### 6. Audit Module (`src/audit/`)
**Responsibility**: Maintain append-only verification log
**Log Location**: `~/.eth-id/audit/audit.json`
**Entry Format**:
```json
{
"session_id": "uuid",
"timestamp": "ISO8601",
"document_hash": "SHA256",
"claim": "string",
"result": bool,
"proof_type": "zk" | "llm"
}
```
**Key Design Decision**: Audit log contains ONLY hashes, never document content. It's safe to share for compliance purposes.
---
## Decision Log
### Why ZK for Deterministic Claims?
**Deterministic claims** (age > 18, amount > 5000) have a single correct answer given the input. ZK provides:
- Mathematical proof (not probabilistic)
- Complete privacy (zero-knowledge property)
- Verifiable by anyone (proof is public)
- No trust in LLM needed
**Cost**: Circuit development time, proving overhead (~seconds)
**Benefit**: Cryptographic guarantee, perfect for legal/compliance use cases
---
### Why LLM for Semantic Claims?
**Semantic claims** (document is signed, clause is present) require understanding context and meaning. LLM provides:
- Natural language understanding
- Flexible reasoning
- Handles edge cases humans would understand
**Cost**: Privacy trade-off (filtered data sent to provider), probabilistic answers
**Benefit**: Handles claims that are impossible to encode in circuits
---
### Why Offline-First?
**Philosophy**: Privacy should be the default, not an option.
**Implementation**:
- All document parsing is local
- ZK circuits run locally (Barretenberg)
- Ollama runs locally
- `--offline` flag blocks network entirely
**Trade-off**: Users must run Ollama locally (1-2GB download)
**Benefit**: Complete data sovereignty, no trust in external parties
---
### Why Rust Over Other Languages?
**Memory Safety**: Documents contain PII. Memory leaks or use-after-free could expose sensitive data. Rust prevents this at compile time.
**Performance**: Cryptographic operations (hashing, ZK proving) are CPU-intensive. Rust's zero-cost abstractions provide C-level performance.
**Ecosystem**: `pdf-extract`, `image`, `sha2`, `zeroize` are mature and well-maintained.
**Type Safety**: Claim types prevent entire classes of bugs. Rust's enum system is perfect for modeling claim variants.
---
## Limitations and Future Work
### Current Limitations
1. **OCR Not Implemented**: Image documents require manual text extraction
2. **Limited Date Formats**: Only common formats (DD/MM/YYYY, YYYY-MM-DD) supported
3. **No Circuit Compilation**: ZK circuits are placeholders, require Noir toolchain
4. **Single Document**: Batch verification not yet implemented
5. **No Revocation**: Attestations cannot be revoked once created
### Future Enhancements
1. **OCR Integration**: Tesseract for scanned documents
2. **More ZK Circuits**: hash_match, date_range, field_presence
3. **On-Chain Publishing**: Optional Ethereum attestation publishing
4. **Multi-Document**: Verify claims across multiple documents
5. **Revocation Lists**: CRL-style attestation revocation
6. **Mobile Support**: iOS/Android apps via Rust FFI
---
## Performance Characteristics
### Document Parsing
- **PDF (1MB)**: ~100-200ms
- **Image (5MB)**: ~300-500ms
- **JSON (100KB)**: ~10-20ms
### Privacy Filter
- **Virtualization**: ~1-5ms (local computation)
- **Hash Partial**: ~1ms (regex + masking)
- **Minimization**: ~5-10ms (field extraction)
### LLM Verification
- **OpenAI**: ~1-3s (network latency + inference)
- **Claude**: ~1-2s (network latency + inference)
- **Ollama (local)**: ~500ms-2s (depends on model size)
### ZK Proving (Estimated)
- **age_check**: ~2-5s (circuit compilation + proving)
- **amount_threshold**: ~1-3s
- **Verification**: ~10-50ms (very fast)
---
## Security Properties
### Cryptographic
1. **SHA-256 Hashing**: 256-bit collision resistance
2. **PLONK Proofs**: Soundness under discrete log assumption
3. **TLS**: All LLM API calls use HTTPS
### Memory Safety
1. **No Buffer Overflows**: Rust prevents at compile time
2. **No Use-After-Free**: Borrow checker enforces
3. **Zeroization**: Sensitive data cleared on drop
### Privacy
1. **No Disk Writes**: Documents never persisted
2. **Minimal Disclosure**: Privacy Filter enforces
3. **Hash-Only Logs**: Audit trail is privacy-preserving
---
## Testing Strategy
### Unit Tests
- Claim parsing (regex patterns)
- Privacy Filter modes
- CPF validation
- Date parsing
### Integration Tests
- End-to-end verification flow
- Attestation bundle creation
- Audit log integrity
### Adversarial Tests
- Prompt injection attempts
- Privacy Filter bypass attempts
- Log reconstruction attacks
---
## Deployment Modes
### Local Development
```bash
cargo run -- verify --doc test.pdf --claim "over 18" --debug
```
### Production (with OpenAI)
```bash
export OPENAI_API_KEY=sk-...
eth verify --doc passport.pdf --claim "over 21 years old"
```
### Maximum Privacy (Offline)
```bash
# Start Ollama
ollama serve
# Run verification
eth verify --doc id.pdf --claim "over 18" --offline --provider ollama
```
---
## Comparison with Alternatives
### vs Traditional KYC
- **ETH.id**: Document stays local, minimal disclosure
- **KYC**: Full document uploaded, stored indefinitely
### vs Manual Verification
- **ETH.id**: Cryptographic proof, auditable
- **Manual**: No proof, trust-based
### vs Blockchain Identity
- **ETH.id**: Off-chain, no gas costs, instant
- **Blockchain**: On-chain, expensive, slow
---
## Version History
- **1.0.0** (2026-02-24): Initial architecture