# Architecture Decision Records (ADR)
This document captures key architectural decisions made during the development of oxify-authz, including context, rationale, and consequences.
---
## ADR-001: Adopt Zanzibar-Style ReBAC Model
**Date:** 2026-01-19
**Status:** ✅ Accepted
**Deciders:** Core Team
### Context
Traditional Role-Based Access Control (RBAC) struggles with:
- Complex organizational hierarchies (e.g., nested teams, inherited permissions)
- Fine-grained resource-level permissions (e.g., sharing specific documents)
- Dynamic permission relationships (e.g., "editors of documents in folder X")
### Decision
Implement Google Zanzibar-style Relationship-Based Access Control (ReBAC):
- Store permissions as relation tuples: `(namespace, object_id, relation, subject)`
- Support transitive relationships via graph traversal
- Enable hierarchical permissions through parent resource references
### Rationale
1. **Flexibility**: ReBAC handles complex scenarios RBAC cannot
- Example: "user:alice is a viewer of doc:123 because she's a member of team:eng which has viewer access"
2. **Proven at Scale**: Google Zanzibar serves billions of requests/sec
- Paper: https://research.google/pubs/pub48190/
3. **Industry Adoption**: SpiceDB, Ory Keto, Auth0 FGA all use ReBAC
### Consequences
**Positive:**
- Handles complex organizational structures naturally
- Fine-grained permissions without explosion of roles
- Easier to audit ("who can access X?" vs "what roles allow X?")
**Negative:**
- More complex than simple RBAC (learning curve)
- Requires careful index optimization for performance
- Potential for circular dependencies (mitigated by cycle detection)
**Mitigation:**
- Provide migration guide from RBAC → ReBAC
- Implement Leopard index for O(1) transitive checks
- Add cycle detection in permission graph
---
## ADR-002: Hybrid Storage Architecture (PostgreSQL + In-Memory Cache)
**Date:** 2026-01-19
**Status:** ✅ Accepted
**Deciders:** Core Team
### Context
Authorization checks are in the critical path of every request. Latency requirements:
- p99 < 100μs for cached checks
- p99 < 5ms for uncached checks
- Must handle 100M+ tuples for enterprise customers
### Decision
Implement multi-tier hybrid storage:
- **L0: Leopard Reachability Index** - O(1) materialized transitive closures
- **L1: In-Memory LRU Cache** - Process-local, 10K-100K entries
- **L2: Redis Distributed Cache** - Shared across API servers
- **L3: PostgreSQL** - Source of truth, durability
### Rationale
1. **Performance**: 99.9% of checks hit L1/L2 cache (<100μs)
2. **Durability**: PostgreSQL ensures no permission data loss
3. **Scalability**: Redis enables horizontal scaling of API servers
4. **Cost-Effective**: Cheaper than pure in-memory database
### Consequences
**Positive:**
- <100μs p99 latency for cached checks ✅
- Can scale to billions of tuples
- No single point of failure (cache miss → PostgreSQL)
**Negative:**
- Cache invalidation complexity (write amplification)
- Eventual consistency between cache layers
- Higher operational complexity
**Mitigation:**
- Implement cache warming on startup
- Use Bloom filters to reduce unnecessary cache checks
- Provide cache monitoring and metrics
---
## ADR-003: Multi-Tenancy via Logical Partitioning
**Date:** 2026-01-19
**Status:** ✅ Accepted
**Deciders:** Core Team
### Context
SaaS authorization service must support thousands of tenants with:
- Strong isolation (no cross-tenant data leakage)
- Per-tenant quotas (prevent noisy neighbor problem)
- Efficient resource utilization (shared infrastructure)
### Decision
Use logical partitioning with tenant_id column:
```sql
CREATE TABLE relation_tuples (
tenant_id VARCHAR(255) NOT NULL,
namespace VARCHAR(255) NOT NULL,
object_id VARCHAR(255) NOT NULL,
relation VARCHAR(255) NOT NULL,
subject_type VARCHAR(50) NOT NULL,
subject_id VARCHAR(255) NOT NULL,
PRIMARY KEY (tenant_id, namespace, object_id, relation, subject_type, subject_id)
);
CREATE INDEX idx_tenant_tuples ON relation_tuples(tenant_id);
```
### Rationale
**Considered Alternatives:**
1. **Physical DB per tenant** - Expensive, doesn't scale to 10K+ tenants
2. **Schema per tenant** - PostgreSQL has limits, migration complexity
3. **Logical partitioning** - ✅ Chosen for balance of isolation & efficiency
**Why Logical Partitioning:**
- Scales to millions of tenants
- Shared query planning and optimization
- Easy to add tenant-level quotas
- Can upgrade to physical partitions if needed
### Consequences
**Positive:**
- Cost-effective multi-tenancy
- Query optimizer benefits all tenants
- Simple operational model
**Negative:**
- Risk of query bugs causing cross-tenant leaks (must validate tenant_id in all queries)
- Noisy neighbor (mitigated by quotas)
**Mitigation:**
- Tenant-aware API that auto-injects tenant_id
- Comprehensive integration tests for tenant isolation
- Per-tenant rate limiting and quota enforcement
---
## ADR-004: Leopard Reachability Index for Transitive Checks
**Date:** 2026-01-19
**Status:** ✅ Accepted
**Deciders:** Performance Team
### Context
Transitive permission checks (e.g., "is alice a member of org:acme through team:eng?") require graph traversal:
- Naive BFS/DFS: O(V + E) per check, too slow
- Pre-computation: Trade-off between freshness and performance
### Decision
Implement Leopard-style reachability index:
- Materialize transitive closures for common paths
- Dual indexing: by_subject and by_object for fast lookups
- Incremental updates on tuple writes
**Algorithm:**
```rust
// Pre-compute: "Who can access X?"
// Store: object_id → Set<subject_id>
leopard_index.insert(object_id, reachable_subjects);
// Check: O(1) lookup instead of graph traversal
fn check(object_id, subject_id) -> bool {
leopard_index.get(object_id).contains(subject_id)
}
```
### Rationale
1. **Performance**: O(1) vs O(V+E) - 100x-1000x faster
2. **Predictable Latency**: No variance from graph complexity
3. **Battery Included**: Zanzibar paper describes similar optimization
**Benchmarks:**
- Without Leopard: 5-50ms for deep hierarchies
- With Leopard: <1ms for all depths ✅
### Consequences
**Positive:**
- Sub-millisecond transitive checks
- Scales to deep organizational hierarchies (>10 levels)
**Negative:**
- Higher memory usage (trade-off: speed vs memory)
- Index update latency on writes (100-500ms)
**Mitigation:**
- Configurable index depth limits
- Async background index updates
- Statistics tracking for monitoring
---
## ADR-005: Conditional Permissions with Request Context
**Date:** 2026-01-19
**Status:** ✅ Accepted
**Deciders:** Security Team
### Context
Modern authorization requires context-aware policies:
- Time-based: "Access only during business hours"
- Location-based: "Access only from corporate network"
- Attribute-based: "Access only if MFA verified"
### Decision
Extend tuples with optional conditions:
```rust
pub struct RelationTuple {
pub namespace: String,
pub object_id: String,
pub relation: String,
pub subject: Subject,
pub condition: Option<RelationshipCondition>, // NEW
}
pub enum RelationshipCondition {
TimeWindow { not_before, not_after },
IpAddress { allowed_ips },
Attribute { key, value },
All { conditions }, // AND
Any { conditions }, // OR
}
```
**Check with Context:**
```rust
let allowed = engine.check(CheckRequest {
namespace: "document",
object_id: "sensitive",
relation: "view",
subject: Subject::User("alice"),
context: Some(RequestContext::new()
.with_client_ip(client_ip)
.with_attribute("mfa_verified", "true")
),
}).await?;
```
### Rationale
1. **Zero-Trust Security**: Verify identity, device, location, time
2. **Compliance**: SOC 2, HIPAA, GDPR require contextual access control
3. **Flexibility**: Combine conditions with AND/OR logic
### Consequences
**Positive:**
- Enables advanced security policies
- Single authorization check (no need for separate policy engine)
- Composable conditions (reusable building blocks)
**Negative:**
- Slightly higher complexity
- Condition evaluation overhead (~10μs per check)
**Mitigation:**
- Conditions are optional (backward compatible)
- Cache condition evaluation results
- Provide condition templates for common use cases
---
## ADR-006: Audit Logging with Sampling
**Date:** 2026-01-19
**Status:** ✅ Accepted
**Deciders:** Compliance Team
### Context
Compliance requirements (SOC 2, HIPAA, GDPR) mandate audit trails:
- "Who accessed what data and when?"
- Immutable, tamper-proof logs
- Long retention (7+ years for HIPAA)
**Challenge:** Logging 1M checks/sec → 86B events/day → storage explosion
### Decision
Implement configurable audit sampling:
```rust
pub struct AuditConfig {
pub sample_rate: f64, // 0.0-1.0 (e.g., 0.1 = 10%)
pub log_denied: bool, // Always log denials
pub log_mutations: bool, // Always log writes/deletes
pub log_sensitive_resources: HashSet<String>, // Always log these
}
```
**Strategy:**
- Sample routine checks (e.g., 10%)
- **Always log:**
- ❌ Denied access attempts (security)
- ✍️ Permission mutations (audit trail)
- 🔒 Sensitive resource access (compliance)
### Rationale
1. **Cost-Effective**: 10% sampling → 90% storage reduction
2. **Security-First**: All security events captured
3. **Compliance**: Meets regulatory requirements
**Research:**
- Google's production systems use sampling
- Industry standard: 1-10% for non-sensitive, 100% for sensitive
### Consequences
**Positive:**
- Affordable audit logging at scale
- Complete security event capture
- Tamper-proof with integrity hashing
**Negative:**
- Sampled data less useful for debugging (mitigated: increase rate in dev)
**Mitigation:**
- Default 10% sampling (configurable)
- Provide query helpers for compliance reports
- Export audit logs to SIEM systems
---
## ADR-007: AI-Powered Anomaly Detection
**Date:** 2026-01-19
**Status:** ✅ Accepted
**Deciders:** Security Team
### Context
Security threats require real-time detection:
- Privilege escalation attempts (user probing for admin access)
- Unusual access patterns (3 AM access from unusual IP)
- Account compromise (burst of failed checks)
Traditional signature-based detection misses novel attacks.
### Decision
Implement ML-powered anomaly detection:
```rust
pub struct AnomalyDetector {
// Learn baseline behavior per user
subject_stats: HashMap<String, SubjectStats>,
// Detect deviations using statistical methods
config: AnomalyConfig {
zscore_threshold: 2.5, // Statistical outlier
max_access_rate: 100/min, // Rate limit
enable_temporal: true, // Time-of-day analysis
enable_privesc: true, // Denial rate tracking
}
}
```
**Detection Methods:**
1. **Frequency Anomaly**: Z-score analysis (unusual access frequency)
2. **Temporal Anomaly**: Time-of-day analysis (3 AM access from 9-5 user)
3. **Resource Anomaly**: First-time access to sensitive resource
4. **Privilege Escalation**: High denial rate (>30%)
5. **Rate Limiting**: Burst detection
### Rationale
1. **Proactive Security**: Detect attacks before damage
2. **Reduces Alert Fatigue**: ML reduces false positives vs rule-based
3. **Adaptive**: Learns normal behavior per user
**Research:**
- UEBA (User and Entity Behavior Analytics) industry standard
- Similar to AWS GuardDuty, Azure Sentinel
### Consequences
**Positive:**
- Early detection of compromised accounts
- Automated response (e.g., temporary account freeze)
- Security team gets actionable alerts
**Negative:**
- False positives during baseline building period
- Requires tuning per deployment
**Mitigation:**
- Configurable baseline period (default: 100 events)
- Severity scoring (only alert on high severity)
- Integration with existing SIEM systems
---
## ADR-008: Permission Recommendations Engine
**Date:** 2026-01-19
**Status:** ✅ Accepted
**Deciders:** Product Team
### Context
Permission sprawl is common in growing organizations:
- Unused permissions (granted but never used)
- Over-permissive access (admin when viewer would suffice)
- Redundant tuples (can be simplified via hierarchy)
Manual review of 100K+ tuples is infeasible.
### Decision
Implement automated recommendation engine:
```rust
pub struct RecommendationEngine {
// Track tuple usage over time
tuple_usage: HashMap<String, TupleUsage>,
// Analyze patterns
access_patterns: HashMap<(String, String), usize>,
}
pub enum RecommendationType {
UnusedPermission, // <10% usage → revoke?
HierarchicalRedundancy, // Parent already grants access
Consolidation, // 5+ similar tuples → create role
RoleSuggestion, // 3+ users with identical permissions
Conflict, // Multiple access levels (review)
}
```
**Priority System:**
- 🔴 **Critical**: Security risk (over-permissive)
- 🟠 **High**: Unused permissions (attack surface)
- 🟡 **Medium**: Optimization opportunity (consolidation)
- 🟢 **Low**: Nice-to-have (aesthetic improvements)
### Rationale
1. **Least Privilege**: Automated identification of over-permissions
2. **Cost Savings**: Reduce tuple count → lower storage costs
3. **Security Posture**: Smaller attack surface
**Research:**
- AWS Access Analyzer (similar concept)
- Microsoft Entra Permission Management
### Consequences
**Positive:**
- Continuous permission optimization
- Actionable insights for security teams
- Reduced operational burden
**Negative:**
- Recommendations require human judgment (can't auto-apply)
- Usage tracking adds minimal overhead
**Mitigation:**
- Clear prioritization (focus on high/critical)
- Detailed impact estimates ("removes X tuples")
- Integration with workflow tools (Jira, ServiceNow)
---
## ADR-009: gRPC API for High-Performance RPC
**Date:** 2026-01-19
**Status:** ✅ Accepted
**Deciders:** API Team
### Context
Authorization checks are latency-sensitive:
- REST/JSON overhead: 50-200μs serialization
- HTTP/1.1: Head-of-line blocking
- Need streaming for watch API (real-time permission updates)
### Decision
Provide gRPC API alongside REST:
```protobuf
service AuthorizationService {
rpc Check(CheckRequest) returns (CheckResponse);
rpc BatchCheck(BatchCheckRequest) returns (BatchCheckResponse);
rpc Write(WriteRequest) returns (WriteResponse);
rpc Watch(WatchRequest) returns (stream WatchResponse); // Streaming!
}
```
**Protocol Buffers:**
- Binary encoding (5-10x smaller than JSON)
- Code generation for type safety
- Backward/forward compatibility
**HTTP/2:**
- Multiplexing (no head-of-line blocking)
- Stream compression
- Server push for watch API
### Rationale
1. **Performance**: 30-50% lower latency than REST
2. **Streaming**: Watch API for real-time updates
3. **Type Safety**: Protobuf schemas prevent runtime errors
4. **Industry Standard**: Kubernetes, etcd, Istio all use gRPC
### Consequences
**Positive:**
- Lower latency (especially for batch operations)
- Streaming watch API enables reactive UIs
- Cross-language clients (code generation)
**Negative:**
- Less human-readable than JSON (debugging)
- Requires HTTP/2 (older proxies may not support)
**Mitigation:**
- Provide both gRPC and REST APIs
- gRPC-Web for browser compatibility
- grpcurl for debugging
---
## ADR-010: Edge Computing Support with CRDTs
**Date:** 2026-01-19
**Status:** ✅ Accepted
**Deciders:** Infrastructure Team
### Context
Global applications need low-latency authorization:
- CDN edge locations (Cloudflare Workers, AWS Lambda@Edge)
- Central database adds 50-200ms latency
- Need eventual consistency for distributed writes
### Decision
Implement lightweight edge engine with CRDT-based synchronization:
```rust
pub struct EdgeEngine {
// Embedded authorization engine (no DB dependency)
local_tuples: HashMap<String, CRDTuple>,
// Background sync with central database
sync_interval: Duration,
// Conflict resolution via CRDT
crdt_resolver: LWWResolver, // Last-Write-Wins
}
```
**Architecture:**
```
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Edge US │◄────►│ Central DB │◄────►│ Edge EU │
│ (read/write)│ │ (PostgreSQL)│ │ (read/write)│
└─────────────┘ └─────────────┘ └─────────────┘
▲ ▲
└───────────CRDT Merge (eventual)──────────┘
```
**CRDT Strategy:**
- Last-Write-Wins (LWW) for simplicity
- Tombstones for deletions
- Garbage collection after sync
### Rationale
1. **Latency**: <10ms authorization at edge (vs 100ms to central DB)
2. **Availability**: Edge works even if central DB is down
3. **Cost**: Reduce cross-region DB traffic
**Use Cases:**
- CDN edge authorization (Cloudflare, Fastly)
- Mobile offline-first apps
- IoT devices with intermittent connectivity
### Consequences
**Positive:**
- Sub-10ms latency worldwide
- High availability (eventual consistency)
- Works offline
**Negative:**
- Eventual consistency (stale reads possible)
- CRDT merge conflicts (rare with LWW)
**Mitigation:**
- Configurable sync interval (default: 30s)
- Critical permissions bypass cache (read-through)
- Monitoring for replication lag
---
## Summary of Key Decisions
| 001 | Zanzibar ReBAC Model | ✅ | Foundation |
| 002 | Hybrid Storage (PG + Cache) | ✅ | <100μs latency |
| 003 | Logical Multi-Tenancy | ✅ | Scalable SaaS |
| 004 | Leopard Reachability Index | ✅ | O(1) transitive |
| 005 | Conditional Permissions | ✅ | Zero-trust |
| 006 | Audit Logging (Sampling) | ✅ | Compliance |
| 007 | AI Anomaly Detection | ✅ | Proactive security |
| 008 | Permission Recommendations | ✅ | Least privilege |
| 009 | gRPC API | ✅ | Low latency RPC |
| 010 | Edge Computing (CRDT) | ✅ | Global <10ms |
---
**Document Version:** 1.0
**Last Updated:** 2026-01-19
**Maintained By:** OxiFY Authorization Team
## References
- [Google Zanzibar Paper](https://research.google/pubs/pub48190/)
- [SpiceDB Architecture](https://github.com/authzed/spicedb)
- [AWS IAM Best Practices](https://docs.aws.amazon.com/IAM/latest/UserGuide/best-practices.html)
- [NIST Zero Trust Architecture](https://www.nist.gov/publications/zero-trust-architecture)