oxify-authz 0.1.0

# Architecture Decision Records (ADR)

This document captures key architectural decisions made during the development of oxify-authz, including context, rationale, and consequences.

---

## ADR-001: Adopt Zanzibar-Style ReBAC Model

**Date:** 2026-01-19
**Status:** ✅ Accepted
**Deciders:** Core Team

### Context

Traditional Role-Based Access Control (RBAC) struggles with:
- Complex organizational hierarchies (e.g., nested teams, inherited permissions)
- Fine-grained resource-level permissions (e.g., sharing specific documents)
- Dynamic permission relationships (e.g., "editors of documents in folder X")

### Decision

Implement Google Zanzibar-style Relationship-Based Access Control (ReBAC):
- Store permissions as relation tuples: `(namespace, object_id, relation, subject)`
- Support transitive relationships via graph traversal
- Enable hierarchical permissions through parent resource references

### Rationale

1. **Flexibility**: ReBAC handles complex scenarios RBAC cannot
   - Example: "user:alice is a viewer of doc:123 because she's a member of team:eng which has viewer access"

2. **Proven at Scale**: Google Zanzibar serves billions of requests/sec
   - Paper: https://research.google/pubs/pub48190/

3. **Industry Adoption**: SpiceDB, Ory Keto, Auth0 FGA all use ReBAC

### Consequences

**Positive:**
- Handles complex organizational structures naturally
- Fine-grained permissions without explosion of roles
- Easier to audit ("who can access X?" vs "what roles allow X?")

**Negative:**
- More complex than simple RBAC (learning curve)
- Requires careful index optimization for performance
- Potential for circular dependencies (mitigated by cycle detection)

**Mitigation:**
- Provide migration guide from RBAC → ReBAC
- Implement Leopard index for O(1) transitive checks
- Add cycle detection in permission graph

---

## ADR-002: Hybrid Storage Architecture (PostgreSQL + In-Memory Cache)

**Date:** 2026-01-19
**Status:** ✅ Accepted
**Deciders:** Core Team

### Context

Authorization checks are in the critical path of every request. Latency requirements:
- p99 < 100μs for cached checks
- p99 < 5ms for uncached checks
- Must handle 100M+ tuples for enterprise customers

### Decision

Implement multi-tier hybrid storage:
- **L0: Leopard Reachability Index** - O(1) materialized transitive closures
- **L1: In-Memory LRU Cache** - Process-local, 10K-100K entries
- **L2: Redis Distributed Cache** - Shared across API servers
- **L3: PostgreSQL** - Source of truth, durability

### Rationale

1. **Performance**: 99.9% of checks hit L1/L2 cache (<100μs)
2. **Durability**: PostgreSQL ensures no permission data loss
3. **Scalability**: Redis enables horizontal scaling of API servers
4. **Cost-Effective**: Cheaper than pure in-memory database

### Consequences

**Positive:**
- <100μs p99 latency for cached checks ✅
- Can scale to billions of tuples
- No single point of failure (cache miss → PostgreSQL)

**Negative:**
- Cache invalidation complexity (write amplification)
- Eventual consistency between cache layers
- Higher operational complexity

**Mitigation:**
- Implement cache warming on startup
- Use Bloom filters to reduce unnecessary cache checks
- Provide cache monitoring and metrics

---

## ADR-003: Multi-Tenancy via Logical Partitioning

**Date:** 2026-01-19
**Status:** ✅ Accepted
**Deciders:** Core Team

### Context

SaaS authorization service must support thousands of tenants with:
- Strong isolation (no cross-tenant data leakage)
- Per-tenant quotas (prevent noisy neighbor problem)
- Efficient resource utilization (shared infrastructure)

### Decision

Use logical partitioning with tenant_id column:
```sql
CREATE TABLE relation_tuples (
    tenant_id VARCHAR(255) NOT NULL,
    namespace VARCHAR(255) NOT NULL,
    object_id VARCHAR(255) NOT NULL,
    relation VARCHAR(255) NOT NULL,
    subject_type VARCHAR(50) NOT NULL,
    subject_id VARCHAR(255) NOT NULL,
    PRIMARY KEY (tenant_id, namespace, object_id, relation, subject_type, subject_id)
);

CREATE INDEX idx_tenant_tuples ON relation_tuples(tenant_id);
```

### Rationale

**Considered Alternatives:**
1. **Physical DB per tenant** - Expensive, doesn't scale to 10K+ tenants
2. **Schema per tenant** - PostgreSQL has limits, migration complexity
3. **Logical partitioning** - ✅ Chosen for balance of isolation & efficiency

**Why Logical Partitioning:**
- Scales to millions of tenants
- Shared query planning and optimization
- Easy to add tenant-level quotas
- Can upgrade to physical partitions if needed

### Consequences

**Positive:**
- Cost-effective multi-tenancy
- Query optimizer benefits all tenants
- Simple operational model

**Negative:**
- Risk of query bugs causing cross-tenant leaks (must validate tenant_id in all queries)
- Noisy neighbor (mitigated by quotas)

**Mitigation:**
- Tenant-aware API that auto-injects tenant_id
- Comprehensive integration tests for tenant isolation
- Per-tenant rate limiting and quota enforcement

---

## ADR-004: Leopard Reachability Index for Transitive Checks

**Date:** 2026-01-19
**Status:** ✅ Accepted
**Deciders:** Performance Team

### Context

Transitive permission checks (e.g., "is alice a member of org:acme through team:eng?") require graph traversal:
- Naive BFS/DFS: O(V + E) per check, too slow
- Pre-computation: Trade-off between freshness and performance

### Decision

Implement Leopard-style reachability index:
- Materialize transitive closures for common paths
- Dual indexing: by_subject and by_object for fast lookups
- Incremental updates on tuple writes

**Algorithm:**
```rust
// Pre-compute: "Who can access X?"
// Store: object_id → Set<subject_id>
leopard_index.insert(object_id, reachable_subjects);

// Check: O(1) lookup instead of graph traversal
fn check(object_id, subject_id) -> bool {
    leopard_index.get(object_id).contains(subject_id)
}
```

### Rationale

1. **Performance**: O(1) vs O(V+E) - 100x-1000x faster
2. **Predictable Latency**: No variance from graph complexity
3. **Battery Included**: Zanzibar paper describes similar optimization

**Benchmarks:**
- Without Leopard: 5-50ms for deep hierarchies
- With Leopard: <1ms for all depths ✅

### Consequences

**Positive:**
- Sub-millisecond transitive checks
- Scales to deep organizational hierarchies (>10 levels)

**Negative:**
- Higher memory usage (trade-off: speed vs memory)
- Index update latency on writes (100-500ms)

**Mitigation:**
- Configurable index depth limits
- Async background index updates
- Statistics tracking for monitoring

---

## ADR-005: Conditional Permissions with Request Context

**Date:** 2026-01-19
**Status:** ✅ Accepted
**Deciders:** Security Team

### Context

Modern authorization requires context-aware policies:
- Time-based: "Access only during business hours"
- Location-based: "Access only from corporate network"
- Attribute-based: "Access only if MFA verified"

### Decision

Extend tuples with optional conditions:
```rust
pub struct RelationTuple {
    pub namespace: String,
    pub object_id: String,
    pub relation: String,
    pub subject: Subject,
    pub condition: Option<RelationshipCondition>, // NEW
}

pub enum RelationshipCondition {
    TimeWindow { not_before, not_after },
    IpAddress { allowed_ips },
    Attribute { key, value },
    All { conditions },  // AND
    Any { conditions },  // OR
}
```

**Check with Context:**
```rust
let allowed = engine.check(CheckRequest {
    namespace: "document",
    object_id: "sensitive",
    relation: "view",
    subject: Subject::User("alice"),
    context: Some(RequestContext::new()
        .with_client_ip(client_ip)
        .with_attribute("mfa_verified", "true")
    ),
}).await?;
```

### Rationale

1. **Zero-Trust Security**: Verify identity, device, location, time
2. **Compliance**: SOC 2, HIPAA, GDPR require contextual access control
3. **Flexibility**: Combine conditions with AND/OR logic

### Consequences

**Positive:**
- Enables advanced security policies
- Single authorization check (no need for separate policy engine)
- Composable conditions (reusable building blocks)

**Negative:**
- Slightly higher complexity
- Condition evaluation overhead (~10μs per check)

**Mitigation:**
- Conditions are optional (backward compatible)
- Cache condition evaluation results
- Provide condition templates for common use cases

---

## ADR-006: Audit Logging with Sampling

**Date:** 2026-01-19
**Status:** ✅ Accepted
**Deciders:** Compliance Team

### Context

Compliance requirements (SOC 2, HIPAA, GDPR) mandate audit trails:
- "Who accessed what data and when?"
- Immutable, tamper-proof logs
- Long retention (7+ years for HIPAA)

**Challenge:** Logging 1M checks/sec → 86B events/day → storage explosion

### Decision

Implement configurable audit sampling:
```rust
pub struct AuditConfig {
    pub sample_rate: f64,        // 0.0-1.0 (e.g., 0.1 = 10%)
    pub log_denied: bool,        // Always log denials
    pub log_mutations: bool,     // Always log writes/deletes
    pub log_sensitive_resources: HashSet<String>,  // Always log these
}
```

**Strategy:**
- Sample routine checks (e.g., 10%)
- **Always log:**
  - ❌ Denied access attempts (security)
  - ✍️ Permission mutations (audit trail)
  - 🔒 Sensitive resource access (compliance)

### Rationale

1. **Cost-Effective**: 10% sampling → 90% storage reduction
2. **Security-First**: All security events captured
3. **Compliance**: Meets regulatory requirements

**Research:**
- Google's production systems use sampling
- Industry standard: 1-10% for non-sensitive, 100% for sensitive

### Consequences

**Positive:**
- Affordable audit logging at scale
- Complete security event capture
- Tamper-proof with integrity hashing

**Negative:**
- Sampled data less useful for debugging (mitigated: increase rate in dev)

**Mitigation:**
- Default 10% sampling (configurable)
- Provide query helpers for compliance reports
- Export audit logs to SIEM systems

---

## ADR-007: AI-Powered Anomaly Detection

**Date:** 2026-01-19
**Status:** ✅ Accepted
**Deciders:** Security Team

### Context

Security threats require real-time detection:
- Privilege escalation attempts (user probing for admin access)
- Unusual access patterns (3 AM access from unusual IP)
- Account compromise (burst of failed checks)

Traditional signature-based detection misses novel attacks.

### Decision

Implement ML-powered anomaly detection:
```rust
pub struct AnomalyDetector {
    // Learn baseline behavior per user
    subject_stats: HashMap<String, SubjectStats>,

    // Detect deviations using statistical methods
    config: AnomalyConfig {
        zscore_threshold: 2.5,      // Statistical outlier
        max_access_rate: 100/min,   // Rate limit
        enable_temporal: true,      // Time-of-day analysis
        enable_privesc: true,       // Denial rate tracking
    }
}
```

**Detection Methods:**
1. **Frequency Anomaly**: Z-score analysis (unusual access frequency)
2. **Temporal Anomaly**: Time-of-day analysis (3 AM access from 9-5 user)
3. **Resource Anomaly**: First-time access to sensitive resource
4. **Privilege Escalation**: High denial rate (>30%)
5. **Rate Limiting**: Burst detection

### Rationale

1. **Proactive Security**: Detect attacks before damage
2. **Reduces Alert Fatigue**: ML reduces false positives vs rule-based
3. **Adaptive**: Learns normal behavior per user

**Research:**
- UEBA (User and Entity Behavior Analytics) industry standard
- Similar to AWS GuardDuty, Azure Sentinel

### Consequences

**Positive:**
- Early detection of compromised accounts
- Automated response (e.g., temporary account freeze)
- Security team gets actionable alerts

**Negative:**
- False positives during baseline building period
- Requires tuning per deployment

**Mitigation:**
- Configurable baseline period (default: 100 events)
- Severity scoring (only alert on high severity)
- Integration with existing SIEM systems

---

## ADR-008: Permission Recommendations Engine

**Date:** 2026-01-19
**Status:** ✅ Accepted
**Deciders:** Product Team

### Context

Permission sprawl is common in growing organizations:
- Unused permissions (granted but never used)
- Over-permissive access (admin when viewer would suffice)
- Redundant tuples (can be simplified via hierarchy)

Manual review of 100K+ tuples is infeasible.

### Decision

Implement automated recommendation engine:
```rust
pub struct RecommendationEngine {
    // Track tuple usage over time
    tuple_usage: HashMap<String, TupleUsage>,

    // Analyze patterns
    access_patterns: HashMap<(String, String), usize>,
}

pub enum RecommendationType {
    UnusedPermission,      // <10% usage → revoke?
    HierarchicalRedundancy, // Parent already grants access
    Consolidation,          // 5+ similar tuples → create role
    RoleSuggestion,         // 3+ users with identical permissions
    Conflict,               // Multiple access levels (review)
}
```

**Priority System:**
- 🔴 **Critical**: Security risk (over-permissive)
- 🟠 **High**: Unused permissions (attack surface)
- 🟡 **Medium**: Optimization opportunity (consolidation)
- 🟢 **Low**: Nice-to-have (aesthetic improvements)

### Rationale

1. **Least Privilege**: Automated identification of over-permissions
2. **Cost Savings**: Reduce tuple count → lower storage costs
3. **Security Posture**: Smaller attack surface

**Research:**
- AWS Access Analyzer (similar concept)
- Microsoft Entra Permission Management

### Consequences

**Positive:**
- Continuous permission optimization
- Actionable insights for security teams
- Reduced operational burden

**Negative:**
- Recommendations require human judgment (can't auto-apply)
- Usage tracking adds minimal overhead

**Mitigation:**
- Clear prioritization (focus on high/critical)
- Detailed impact estimates ("removes X tuples")
- Integration with workflow tools (Jira, ServiceNow)

---

## ADR-009: gRPC API for High-Performance RPC

**Date:** 2026-01-19
**Status:** ✅ Accepted
**Deciders:** API Team

### Context

Authorization checks are latency-sensitive:
- REST/JSON overhead: 50-200μs serialization
- HTTP/1.1: Head-of-line blocking
- Need streaming for watch API (real-time permission updates)

### Decision

Provide gRPC API alongside REST:
```protobuf
service AuthorizationService {
  rpc Check(CheckRequest) returns (CheckResponse);
  rpc BatchCheck(BatchCheckRequest) returns (BatchCheckResponse);
  rpc Write(WriteRequest) returns (WriteResponse);
  rpc Watch(WatchRequest) returns (stream WatchResponse);  // Streaming!
}
```

**Protocol Buffers:**
- Binary encoding (5-10x smaller than JSON)
- Code generation for type safety
- Backward/forward compatibility

**HTTP/2:**
- Multiplexing (no head-of-line blocking)
- Stream compression
- Server push for watch API

### Rationale

1. **Performance**: 30-50% lower latency than REST
2. **Streaming**: Watch API for real-time updates
3. **Type Safety**: Protobuf schemas prevent runtime errors
4. **Industry Standard**: Kubernetes, etcd, Istio all use gRPC

### Consequences

**Positive:**
- Lower latency (especially for batch operations)
- Streaming watch API enables reactive UIs
- Cross-language clients (code generation)

**Negative:**
- Less human-readable than JSON (debugging)
- Requires HTTP/2 (older proxies may not support)

**Mitigation:**
- Provide both gRPC and REST APIs
- gRPC-Web for browser compatibility
- grpcurl for debugging

---

## ADR-010: Edge Computing Support with CRDTs

**Date:** 2026-01-19
**Status:** ✅ Accepted
**Deciders:** Infrastructure Team

### Context

Global applications need low-latency authorization:
- CDN edge locations (Cloudflare Workers, AWS Lambda@Edge)
- Central database adds 50-200ms latency
- Need eventual consistency for distributed writes

### Decision

Implement lightweight edge engine with CRDT-based synchronization:
```rust
pub struct EdgeEngine {
    // Embedded authorization engine (no DB dependency)
    local_tuples: HashMap<String, CRDTuple>,

    // Background sync with central database
    sync_interval: Duration,

    // Conflict resolution via CRDT
    crdt_resolver: LWWResolver,  // Last-Write-Wins
}
```

**Architecture:**
```
┌─────────────┐      ┌─────────────┐      ┌─────────────┐
│  Edge US    │◄────►│  Central DB │◄────►│  Edge EU    │
│  (read/write)│      │ (PostgreSQL)│      │  (read/write)│
└─────────────┘      └─────────────┘      └─────────────┘
       ▲                                           ▲
       └───────────CRDT Merge (eventual)──────────┘
```

**CRDT Strategy:**
- Last-Write-Wins (LWW) for simplicity
- Tombstones for deletions
- Garbage collection after sync

### Rationale

1. **Latency**: <10ms authorization at edge (vs 100ms to central DB)
2. **Availability**: Edge works even if central DB is down
3. **Cost**: Reduce cross-region DB traffic

**Use Cases:**
- CDN edge authorization (Cloudflare, Fastly)
- Mobile offline-first apps
- IoT devices with intermittent connectivity

### Consequences

**Positive:**
- Sub-10ms latency worldwide
- High availability (eventual consistency)
- Works offline

**Negative:**
- Eventual consistency (stale reads possible)
- CRDT merge conflicts (rare with LWW)

**Mitigation:**
- Configurable sync interval (default: 30s)
- Critical permissions bypass cache (read-through)
- Monitoring for replication lag

---

## Summary of Key Decisions

| ADR | Decision | Status | Impact |
|-----|----------|--------|--------|
| 001 | Zanzibar ReBAC Model | ✅ | Foundation |
| 002 | Hybrid Storage (PG + Cache) | ✅ | <100μs latency |
| 003 | Logical Multi-Tenancy | ✅ | Scalable SaaS |
| 004 | Leopard Reachability Index | ✅ | O(1) transitive |
| 005 | Conditional Permissions | ✅ | Zero-trust |
| 006 | Audit Logging (Sampling) | ✅ | Compliance |
| 007 | AI Anomaly Detection | ✅ | Proactive security |
| 008 | Permission Recommendations | ✅ | Least privilege |
| 009 | gRPC API | ✅ | Low latency RPC |
| 010 | Edge Computing (CRDT) | ✅ | Global <10ms |

---

**Document Version:** 1.0
**Last Updated:** 2026-01-19
**Maintained By:** OxiFY Authorization Team

## References

- [Google Zanzibar Paper](https://research.google/pubs/pub48190/)
- [SpiceDB Architecture](https://github.com/authzed/spicedb)
- [AWS IAM Best Practices](https://docs.aws.amazon.com/IAM/latest/UserGuide/best-practices.html)
- [NIST Zero Trust Architecture](https://www.nist.gov/publications/zero-trust-architecture)