torsh-core 0.1.1

# API Design Rationale - torsh-core

This document explains the key design decisions, trade-offs, and rationale behind the torsh-core API design.

## Table of Contents

- [Core Design Principles](#core-design-principles)
- [Type System Design](#type-system-design)
- [Shape System Design](#shape-system-design)
- [Error Handling Strategy](#error-handling-strategy)
- [Device Abstraction](#device-abstraction)
- [Memory Management](#memory-management)
- [Performance vs Safety Trade-offs](#performance-vs-safety-trade-offs)
- [API Stability Considerations](#api-stability-considerations)

## Core Design Principles

### 1. Zero-Cost Abstractions

**Rationale**: Deep learning frameworks are performance-critical. Users should not pay runtime costs for abstractions they don't use.

**Implementation**:
- Phantom types for compile-time device tracking with zero runtime overhead
- Inline small functions that are hot paths
- Const generics for compile-time shape validation
- Static dispatch where possible

**Example**:
```rust
// Zero-cost device tracking at compile time
struct Tensor<D: PhantomDevice, T: DType> {
    data: Storage,
    _phantom: PhantomData<(D, T)>,  // Zero size at runtime
}
```

**Trade-offs**:
- **Pro**: Maximum performance, no runtime overhead
- **Con**: More complex type signatures, longer compile times
- **Decision**: Worth it for production ML workloads where runtime performance is critical

### 2. Type Safety Over Convenience

**Rationale**: Catch errors at compile time rather than runtime. Silent bugs in ML systems can lead to incorrect model training.

**Implementation**:
- Strong typing for devices, dtypes, and shapes
- No implicit conversions between incompatible types
- Explicit error handling with Result types

**Example**:
```rust
// This won't compile - device mismatch caught at compile time
let cpu_tensor: Tensor<CpuDevice, F32> = ...;
let gpu_tensor: Tensor<CudaDevice, F32> = ...;
// let result = cpu_tensor + gpu_tensor; // ❌ Compile error!
```

**Trade-offs**:
- **Pro**: Prevents entire classes of runtime errors
- **Con**: More verbose code, steeper learning curve
- **Decision**: Safety is more important than convenience in production systems

### 3. SciRS2 Integration First

**Rationale**: Leverage the existing Rust scientific computing ecosystem rather than reinventing the wheel.

**Implementation**:
- All external dependencies go through scirs2-core
- Unified access patterns (ndarray, random, numeric)
- Zero-copy conversions where possible

**Example**:
```rust
// ✅ CORRECT: Use scirs2-core abstractions
use scirs2_core::ndarray::{Array, array};
use scirs2_core::random::{thread_rng, Normal};

// ❌ WRONG: Direct external dependencies
// use ndarray::{Array, array};  // POLICY VIOLATION
```

**Trade-offs**:
- **Pro**: Consistent APIs, centralized maintenance, better integration
- **Con**: Extra abstraction layer, dependency on scirs2 ecosystem
- **Decision**: Long-term maintainability outweighs short-term convenience

## Type System Design

### DType Enum vs Trait-Based Design

**Decision**: Use an enum for DType with trait implementations for specific types.

**Rationale**:
1. **Pattern Matching**: Enum allows exhaustive pattern matching
2. **Runtime Type Information**: Need to know dtype at runtime for operations
3. **Serialization**: Enum is easier to serialize/deserialize
4. **Type Promotion**: Centralized promotion rules in one place

**Alternative Considered**: Trait-based system with generic parameters
```rust
// Alternative (NOT chosen):
trait DType {
    fn size(&self) -> usize;
    fn is_float(&self) -> bool;
}
struct F32Type;
impl DType for F32Type { ... }
```

**Why Rejected**:
- Would lose runtime type information
- Pattern matching becomes impossible
- Type promotion rules would be scattered

**Example**:
```rust
// ✅ CHOSEN: Enum with traits
pub enum DType {
    F32, F64, I32, I64,
    C64, C128,  // Complex types
    QInt8, QUInt8,  // Quantized types
}

// Trait for actual element types
pub trait TensorElement: Copy + Send + Sync {
    const DTYPE: DType;
    fn to_dtype() -> DType { Self::DTYPE }
}
```

### Type Promotion System

**Decision**: Automatic type promotion with explicit rules.

**Rationale**:
1. **User Convenience**: Mixed-precision operations "just work"
2. **NumPy Compatibility**: Matches expectations from Python users
3. **Safety**: Explicit promotion rules prevent precision loss surprises

**Implementation**:
```rust
impl DType {
    pub fn promote_with(&self, other: DType) -> DType {
        // Explicit promotion matrix
        match (self, other) {
            (F64, _) | (_, F64) => F64,  // F64 takes precedence
            (F32, _) | (_, F32) => F32,
            (C128, _) | (_, C128) => C128,  // Complex promotes
            // ... explicit rules for all type combinations
        }
    }
}
```

**Trade-offs**:
- **Pro**: Intuitive for users, prevents common errors
- **Con**: Potential for unexpected precision changes
- **Mitigation**: Comprehensive documentation and warning system

## Shape System Design

### Immutable Shapes with Caching

**Decision**: Shapes are immutable value types with cached stride computation.

**Rationale**:
1. **Thread Safety**: Immutable shapes are automatically thread-safe
2. **Functional Style**: Encourages immutable data transformations
3. **Caching**: Computed strides can be safely cached and shared
4. **Hash Keys**: Immutable shapes work well as HashMap keys

**Implementation**:
```rust
#[derive(Clone, PartialEq, Eq, Hash)]
pub struct Shape {
    dims: Arc<[usize]>,  // Immutable, shared
    // Cached strides accessed via STRIDE_CACHE
}

// Thread-local cache for hot paths
thread_local! {
    static STRIDE_CACHE: RefCell<HashMap<Shape, Vec<usize>>> = ...;
}

// Global LRU cache for cross-thread sharing
static GLOBAL_STRIDE_CACHE: Lazy<Mutex<LruCache<...>>> = ...;
```

**Alternative Considered**: Mutable shapes with internal mutability
```rust
// Alternative (NOT chosen):
pub struct Shape {
    dims: Vec<usize>,
    cached_strides: Cell<Option<Vec<usize>>>,
}
```

**Why Rejected**:
- Not thread-safe without synchronization
- Cannot be used as HashMap keys
- Harder to reason about ownership and borrowing
- Memory overhead for each Shape instance

**Trade-offs**:
- **Pro**: Thread-safe, functional, efficient caching
- **Con**: Creating new shapes on modification (mitigated by Arc sharing)
- **Decision**: Immutability aligns with Rust's ownership model

### Stride Computation Strategy

**Decision**: Two-tier caching (thread-local + global LRU).

**Rationale**:
1. **Hot Path Optimization**: Thread-local cache has no synchronization overhead
2. **Cross-Thread Sharing**: Global cache prevents redundant computation
3. **Memory Efficiency**: LRU eviction prevents unbounded growth

**Performance Characteristics**:
- Thread-local hit: ~1-2 ns (raw HashMap lookup)
- Global cache hit: ~50-100 ns (mutex + LRU)
- Cache miss: ~500-1000 ns (computation + insertion)

**Trade-offs**:
- **Pro**: Excellent performance for repeated shapes
- **Con**: Memory overhead for cache storage
- **Decision**: Performance gain justifies memory cost in ML workloads

## Error Handling Strategy

### Modular Error Types with Unified Enum

**Decision**: Specialized error modules unified through TorshError enum.

**Rationale**:
1. **Organization**: Errors grouped by domain (shape, index, general)
2. **Extensibility**: Easy to add new error categories
3. **Backward Compatibility**: Unified enum provides stable API
4. **Context-Rich**: Each error type can have specialized fields

**Implementation**:
```rust
pub enum TorshError {
    Shape(ShapeError),
    Index(IndexError),
    General(GeneralError),
    // Legacy compatibility variants
    ShapeMismatch { expected: Vec<usize>, got: Vec<usize> },
    // ...
}
```

**Alternative Considered**: Single flat error enum
```rust
// Alternative (NOT chosen):
pub enum TorshError {
    ShapeMismatch,
    IndexOutOfBounds,
    DeviceError,
    // ... all errors at same level
}
```

**Why Rejected**:
- Hard to organize as error types grow
- No logical grouping of related errors
- Difficult to add error-specific methods

### Source Location Tracking

**Decision**: Automatic location tracking using `std::panic::Location`.

**Rationale**:
1. **Debugging**: Know exactly where errors originated
2. **Zero Cost**: Only captured when errors occur
3. **Automatic**: No manual annotation required

**Implementation**:
```rust
#[track_caller]
pub fn new_error(msg: &str) -> TorshError {
    let location = std::panic::Location::caller();
    TorshError::WithLocation {
        message: msg.to_string(),
        file: location.file(),
        line: location.line(),
    }
}
```

**Trade-offs**:
- **Pro**: Excellent debugging experience
- **Con**: Slight overhead on error paths (acceptable since errors are rare)
- **Decision**: Developer experience worth the cost

### Standard Error Codes for FFI

**Decision**: Provide POSIX-compatible error codes alongside Rust errors.

**Rationale**:
1. **C/C++ Interop**: FFI boundaries need integer error codes
2. **Tooling**: Standard codes work with existing error handling tools
3. **Portability**: errno-compatible codes are universally understood

**Implementation**:
```rust
pub enum StandardErrorCode {
    InvalidArgument = 22,  // EINVAL
    OutOfMemory = 12,      // ENOMEM
    // Custom codes for framework-specific errors
    ShapeMismatch = 1001,
    DTypeMismatch = 1011,
}
```

## Device Abstraction

### Trait-Based Device System

**Decision**: Device trait with phantom type markers.

**Rationale**:
1. **Extensibility**: Easy to add new device backends
2. **Type Safety**: Phantom types catch device mismatches at compile time
3. **Dynamic Dispatch**: Trait objects allow runtime device selection
4. **Zero Cost**: Phantom types have no runtime overhead

**Implementation**:
```rust
pub trait Device: Send + Sync {
    fn device_type(&self) -> DeviceType;
    fn is_available(&self) -> bool;
    fn synchronize(&self) -> Result<()>;
}

// Phantom type markers for compile-time tracking
pub trait PhantomDevice: 'static {
    fn device_type_static() -> DeviceType;
}

pub struct PhantomCpu;
impl PhantomDevice for PhantomCpu {
    fn device_type_static() -> DeviceType { DeviceType::Cpu }
}
```

**Trade-offs**:
- **Pro**: Flexible, type-safe, zero-cost
- **Con**: Complex type system with phantom types
- **Decision**: Type safety worth the complexity

### Device Capability System

**Decision**: Rich capability queries with performance scoring.

**Rationale**:
1. **Automatic Selection**: Choose best device for workload
2. **Graceful Degradation**: Fall back when features unavailable
3. **Future-Proof**: Easy to add new capabilities

**Implementation**:
```rust
pub struct DeviceCapabilities {
    pub compute_capability: ComputeCapability,
    pub memory_gb: f32,
    pub supports_half_precision: bool,
    pub supports_double_precision: bool,
    pub simd_features: SimdFeatures,
    pub performance_score: f32,
}

impl DeviceCapabilities {
    pub fn score_for_workload(&self, workload: &WorkloadProfile) -> f32 {
        // Heuristic scoring based on workload requirements
        match workload.workload_type {
            WorkloadType::Training => self.training_score(),
            WorkloadType::Inference => self.inference_score(),
            WorkloadType::DataProcessing => self.data_processing_score(),
        }
    }
}
```

**Trade-offs**:
- **Pro**: Intelligent device selection, better resource utilization
- **Con**: Heuristics may not always be optimal
- **Mitigation**: Allow manual device override

## Memory Management

### Storage Abstraction with Registry Pattern

**Decision**: Pluggable storage backends with automatic selection.

**Rationale**:
1. **Flexibility**: Different workloads need different memory strategies
2. **Extensibility**: Users can provide custom allocators
3. **Automatic Selection**: System chooses best allocator for use case

**Implementation**:
```rust
pub trait Storage: Send + Sync {
    fn allocate(&self, size: usize, alignment: usize) -> Result<*mut u8>;
    fn deallocate(&self, ptr: *mut u8, size: usize, alignment: usize);
}

// Registry pattern for allocator management
pub struct AllocatorRegistry {
    allocators: HashMap<String, Box<dyn Storage>>,
    metadata: HashMap<String, AllocatorMetadata>,
}

impl AllocatorRegistry {
    pub fn find_best_for_backend(&self, backend: BackendType) -> Option<&dyn Storage> {
        // Automatic selection based on backend requirements
    }
}
```

**Alternative Considered**: Single global allocator
```rust
// Alternative (NOT chosen):
static GLOBAL_ALLOCATOR: GlobalAlloc = SystemAlloc;
```

**Why Rejected**:
- No flexibility for specialized allocators
- Cannot optimize for specific use cases
- Difficult to support NUMA, pinned memory, etc.

### Memory Pooling Strategy

**Decision**: Size-class based pooling for small allocations.

**Rationale**:
1. **Performance**: Reduces allocation overhead by 10-100x
2. **Fragmentation**: Size classes reduce external fragmentation
3. **Thread-Local**: Minimize synchronization overhead

**Implementation**:
```rust
thread_local! {
    static MEMORY_POOL: RefCell<SizeClassPool> = RefCell::new(
        SizeClassPool::new(&[64, 256, 1024, 4096])
    );
}

pub struct SizeClassPool {
    pools: Vec<Vec<*mut u8>>,  // One pool per size class
    size_classes: Vec<usize>,
}
```

**Performance Impact**:
- Small allocations (< 4KB): 10-50x faster than system malloc
- Large allocations: Fallback to system allocator
- Memory overhead: ~10% for pool bookkeeping

**Trade-offs**:
- **Pro**: Significant performance improvement for small tensors
- **Con**: Memory overhead, complexity
- **Decision**: Performance gain justifies overhead in ML workloads

### NUMA Awareness

**Decision**: Optional NUMA-aware allocation with multiple policies.

**Rationale**:
1. **Large Systems**: Critical for multi-socket servers
2. **Flexibility**: Different policies for different workloads
3. **Opt-In**: No overhead for single-socket systems

**Policies**:
- **LocalPreferred**: Try local node, fall back to remote
- **LocalOnly**: Fail if local node unavailable
- **Interleave**: Round-robin across nodes
- **Bind**: Explicitly bind to specific node

**Trade-offs**:
- **Pro**: Better performance on NUMA systems (2-5x for memory-bound ops)
- **Con**: Additional complexity, platform-specific code
- **Decision**: Essential for high-performance computing workloads

## Performance vs Safety Trade-offs

### Bounds Checking Strategy

**Decision**: Bounds checking in debug, unchecked in release (with opt-in).

**Rationale**:
1. **Development**: Catch errors during development
2. **Production**: Maximum performance in release builds
3. **Flexibility**: Users can enable checks via RuntimeConfig

**Implementation**:
```rust
#[inline]
pub fn get(&self, index: usize) -> f32 {
    #[cfg(debug_assertions)]
    assert!(index < self.len(), "Index out of bounds");

    unsafe { *self.data.add(index) }
}

// Optional runtime checking
pub fn get_checked(&self, index: usize) -> Result<f32> {
    if index >= self.len() {
        return Err(TorshError::IndexOutOfBounds { index, size: self.len() });
    }
    Ok(unsafe { *self.data.add(index) })
}
```

**Trade-offs**:
- **Pro**: Maximum performance in production, safety in development
- **Con**: Different behavior in debug/release
- **Mitigation**: Comprehensive test suite catches issues

### SIMD Optimization Trade-offs

**Decision**: Platform-specific SIMD with portable fallback.

**Rationale**:
1. **Performance**: 2-8x speedup for element-wise operations
2. **Portability**: Fallback ensures correctness on all platforms
3. **Maintainability**: Separate implementations are easier to optimize

**Implementation**:
```rust
#[cfg(target_feature = "avx2")]
pub fn add(a: &[f32], b: &[f32]) -> Vec<f32> {
    simd_avx2::add(a, b)
}

#[cfg(all(target_arch = "aarch64", target_feature = "neon"))]
pub fn add(a: &[f32], b: &[f32]) -> Vec<f32> {
    simd_neon::add(a, b)
}

#[cfg(not(any(target_feature = "avx2", target_feature = "neon")))]
pub fn add(a: &[f32], b: &[f32]) -> Vec<f32> {
    a.iter().zip(b.iter()).map(|(x, y)| x + y).collect()
}
```

**Trade-offs**:
- **Pro**: Significant performance gains on modern CPUs
- **Con**: More code to maintain, platform-specific testing
- **Decision**: Performance critical for ML workloads

## API Stability Considerations

### Deprecation Strategy

**Decision**: Soft deprecation with migration guides.

**Rationale**:
1. **User Experience**: Gradual migration is less disruptive
2. **Compatibility**: Old code continues to work
3. **Guidance**: Clear migration paths reduce friction

**Implementation**:
```rust
#[deprecated(
    since = "0.2.0",
    note = "Use `Shape::new()` instead. See migration guide: ..."
)]
pub fn create_shape(dims: Vec<usize>) -> Shape {
    Shape::new(dims)
}
```

**Process**:
1. **Soft Deprecation** (1-2 releases): Mark as deprecated, provide migration guide
2. **Hard Deprecation** (2-3 releases): Remove from documentation
3. **Removal** (Major version): Remove from codebase

### Semantic Versioning

**Decision**: Strict semver with stability guarantees.

**Rationale**:
1. **Predictability**: Users know when breaking changes occur
2. **Trust**: Builds confidence in the framework
3. **Ecosystem**: Compatible with Cargo's dependency resolution

**Guarantees**:
- **Patch** (0.1.x): Bug fixes only, no API changes
- **Minor** (0.x.0): New features, deprecations, no breaking changes
- **Major** (x.0.0): Breaking changes allowed

## Future-Proofing

### Extension Points

**Design Decision**: Provide clear extension points for:
1. Custom data types via `TensorElement` trait
2. Custom devices via `Device` trait
3. Custom allocators via `Storage` trait
4. Custom error types via `From` implementations

**Rationale**: Cannot predict all future use cases, must allow extension.

### Feature Flags

**Decision**: Granular feature flags for optional functionality.

**Implementation**:
```toml
[features]
default = ["std"]
std = []
parallel = ["rayon"]
simd = []
cuda = ["cuda-sys"]
metal = ["metal-rs"]
serialize = ["serde"]
```

**Rationale**:
1. **Binary Size**: Only include what's needed
2. **Compilation Time**: Faster builds with fewer features
3. **Dependencies**: Avoid unnecessary dependencies

## Conclusion

These design decisions prioritize:
1. **Safety**: Catch errors at compile time
2. **Performance**: Zero-cost abstractions, SIMD, caching
3. **Flexibility**: Extensible through traits and registries
4. **Maintainability**: Clear separation of concerns
5. **Integration**: Deep SciRS2 integration

Trade-offs are made consciously with production ML workloads in mind. The result is a framework that is both safe and fast, with clear paths for future enhancement.

---

*Last Updated: 2025-10-23*
*Version: 0.1.0*