torsh-core 0.1.1

# torsh-core Architecture

This document describes the architecture of the `torsh-core` crate, the foundational layer of the ToRSh deep learning framework.

## Table of Contents

- [Overview](#overview)
- [Core Principles](#core-principles)
- [Module Organization](#module-organization)
- [Component Relationships](#component-relationships)
- [Key Design Patterns](#key-design-patterns)
- [Extension Points](#extension-points)
- [Performance Considerations](#performance-considerations)

## Overview

`torsh-core` provides the fundamental building blocks for the ToRSh framework:

- **Type System**: DType, Shape, and type promotion
- **Device Abstraction**: Platform-independent device representation
- **Error Handling**: Comprehensive error system with context
- **Memory Management**: Efficient memory allocation and pooling
- **Storage Backends**: Unified interface for different memory layouts
- **Debugging Tools**: Runtime introspection and profiling

### Design Philosophy

1. **Zero-cost abstractions**: Performance critical paths have minimal overhead
2. **Type safety**: Compile-time and runtime validation
3. **Extensibility**: Easy to add new devices, dtypes, and backends
4. **SciRS2 Integration**: Deep integration with the scirs2 ecosystem
5. **Production-ready**: Comprehensive error handling and debugging tools

## Core Principles

### 1. Modular Design

Each major component is isolated in its own module with clear interfaces:

```
torsh-core/
├── dtype/          # Data type system
├── shape/          # Tensor shape management
├── device/         # Device abstraction
├── error/          # Error handling
├── storage/        # Memory management
└── ...
```

### 2. Layered Architecture

```
┌─────────────────────────────────────────┐
│     High-Level APIs & Utilities         │  Examples, profiling, debugging
├─────────────────────────────────────────┤
│        Core Abstractions                │  DType, Shape, Device
├─────────────────────────────────────────┤
│      Memory & Storage Layer             │  Allocators, pooling, NUMA
├─────────────────────────────────────────┤
│     Platform-Specific Backends          │  CPU, CUDA, Metal, WebGPU
└─────────────────────────────────────────┘
```

### 3. Separation of Concerns

- **Types** (DType, Shape) are pure data structures
- **Devices** provide computational capabilities
- **Storage** manages memory allocation
- **Errors** handle all failure modes
- **Utilities** add debugging and profiling

## Module Organization

### Core Types Module Graph

```
dtype.rs ──────┐
              ├──> TensorElement ──> Operations
shape.rs ─────┤
              └──> Validation ──────> Error Handling
device.rs ─────────────────────────> Backend Selection
```

### Data Type System (`dtype/`)

```rust
pub enum DType {
    // Integer types
    U8, I8, I16, I32, I64,
    // Float types
    F16, BF16, F32, F64,
    // Complex types
    C64, C128,
    // Quantized types
    QInt8, QUInt8,
}
```

**Key Features:**
- Type promotion system for mixed-precision operations
- IEEE 754 compliance checking
- Custom data type support through traits
- Automatic type conversion with safety checks

**Dependencies:**
- Uses `scirs2_core::numeric` for numerical traits
- Integrates with `scirs2_core::ndarray` for array operations

### Shape Management (`shape/`)

```
┌────────────────┐
│  Shape (Core)  │
└────────┬───────┘
         │
    ┌────┴────┬──────────┬─────────────┐
    │         │          │             │
┌───▼───┐ ┌──▼──┐  ┌────▼─────┐  ┌───▼────┐
│Stride │ │Cache│  │Validation│  │ Utils  │
│       │ │     │  │          │  │        │
└───────┘ └─────┘  └──────────┘  └────────┘
```

**Components:**
- `shape.rs`: Core shape representation with dimension tracking
- `shape_utils.rs`: Common shape operations and patterns
- `shape_validation.rs`: Validation with visual error messages
- `shape_debug.rs`: ASCII visualization and debugging

**Design Decisions:**
- Shapes are immutable for thread safety
- Stride caching for performance (thread-local + global)
- Symbolic shape support for dynamic graphs

### Device Abstraction (`device/`)

```
                    ┌─────────────┐
                    │   Device    │
                    │   (Trait)   │
                    └──────┬──────┘
                           │
         ┌─────────────────┼─────────────────┐
         │                 │                 │
    ┌────▼────┐      ┌─────▼────┐     ┌────▼─────┐
    │  CPU    │      │   CUDA   │     │  Metal   │
    │         │      │          │     │          │
    └─────────┘      └──────────┘     └──────────┘
```

**Submodules:**
- `device/core.rs`: Device trait and base implementations
- `device/capabilities.rs`: Feature detection and scoring
- `device/discovery.rs`: Automatic device selection
- `device/management.rs`: Device pools and health monitoring
- `device/phantom.rs`: Type-level device tracking

**Phantom Types for Compile-Time Safety:**

```rust
// Compile-time device type checking
let tensor: Tensor<CpuDevice, F32> = ...;
let gpu_tensor: Tensor<CudaDevice, F32> = ...;

// This won't compile:
// let result = tensor + gpu_tensor; // Error: device mismatch!

// Type-safe device groups
let devices: DeviceGroup<CudaDevice, 4> = ...;
```

### Error Handling (`error/`)

```
                  ┌──────────────┐
                  │  TorshError  │
                  └───────┬──────┘
                          │
        ┌─────────────────┼─────────────────┐
        │                 │                 │
   ┌────▼─────┐     ┌─────▼────┐    ┌──────▼──────┐
   │  Shape   │     │  Index   │    │   General   │
   │  Error   │     │  Error   │    │    Error    │
   └──────────┘     └──────────┘    └─────────────┘
```

**Features:**
- Modular error types (shape, index, general)
- Rich error context with stack traces
- Standard error codes for FFI interoperability
- Error recovery mechanisms
- Source location tracking

**Error Code Mapping:**

```rust
// ToRSh errors map to standard POSIX-like codes
TorshError::OutOfMemory     -> ENOMEM (12)
TorshError::InvalidArgument -> EINVAL (22)
TorshError::NotImplemented  -> ENOSYS (38)

// Custom codes for framework-specific errors
TorshError::ShapeMismatch   -> 1001
TorshError::DTypeMismatch   -> 1011
TorshError::DeviceError     -> 1021
```

### Storage System (`storage/`)

```
┌──────────────────────────────────┐
│   Storage Trait (Abstract)       │
└────────────┬─────────────────────┘
             │
    ┌────────┴────────┬────────────┬──────────┐
    │                 │            │          │
┌───▼────┐   ┌───────▼──┐   ┌─────▼───┐  ┌──▼─────┐
│Aligned │   │  NUMA    │   │ Mapped  │  │  Pool  │
│        │   │          │   │ Storage │  │        │
└────────┘   └──────────┘   └─────────┘  └────────┘
```

**Memory Management Strategies:**

1. **Aligned Storage**: SIMD-friendly memory alignment
2. **NUMA-Aware**: Optimize for multi-socket systems
3. **Memory-Mapped**: Lazy loading for large tensors
4. **Memory Pooling**: Reduce allocation overhead

**Registry Pattern:**

```rust
// Register custom allocators
registry.register(
    "custom_allocator",
    AllocatorMetadata { ... },
    Box::new(MyAllocator::new())
);

// Automatic allocator selection
let allocator = registry.find_best_for_backend(backend_type);
```

## Component Relationships

### Data Flow: Tensor Operation

```
┌──────────┐
│   User   │
└────┬─────┘
     │ operation()
     ▼
┌────────────────┐
│  Validation    │  ◄── Shape, DType checks
└────┬───────────┘
     │ validated
     ▼
┌────────────────┐
│ Device Select  │  ◄── Device capabilities
└────┬───────────┘
     │ device chosen
     ▼
┌────────────────┐
│ Memory Alloc   │  ◄── Storage backend
└────┬───────────┘
     │ memory ready
     ▼
┌────────────────┐
│  Computation   │  ◄── Backend execution
└────┬───────────┘
     │ result
     ▼
┌────────────────┐
│    Return      │
└────────────────┘
```

### Type Promotion Flow

```
Operation(tensor_f32, tensor_i32)
         │
         ▼
┌─────────────────────┐
│  Type Compatibility │
│       Check         │
└──────────┬──────────┘
           │
           ▼ (incompatible)
┌─────────────────────┐
│   Type Promotion    │
│   f32 + i32 → f32   │
└──────────┬──────────┘
           │
           ▼
┌─────────────────────┐
│  Execute Operation  │
└─────────────────────┘
```

### Device Discovery & Selection

```
┌─────────────────┐
│ Discover Devices│
└────────┬────────┘
         │
         ▼
┌─────────────────────┐
│  Query Capabilities │  ◄── SIMD, memory, etc.
└────────┬────────────┘
         │
         ▼
┌─────────────────────┐
│  Score Performance  │  ◄── Workload profile
└────────┬────────────┘
         │
         ▼
┌─────────────────────┐
│  Select Best Device │
└─────────────────────┘
```

## Key Design Patterns

### 1. Builder Pattern

Used extensively for configuration:

```rust
let config = RuntimeConfig::builder()
    .debug_level(DebugLevel::Verbose)
    .validation_level(ValidationLevel::Strict)
    .enable_profiling(true)
    .build();
```

### 2. Registry Pattern

For extensible component registration:

```rust
// Device registry
DeviceRegistry::register(device_type, factory);

// Allocator registry
AllocatorRegistry::register(name, metadata, allocator);
```

### 3. Phantom Types

For compile-time type safety:

```rust
struct Tensor<D: PhantomDevice, T: DType> {
    data: Storage,
    _phantom: PhantomData<(D, T)>,
}
```

### 4. Strategy Pattern

For algorithm selection:

```rust
trait AllocationStrategy {
    fn allocate(&self, size: usize) -> Result<*mut u8>;
}

// Different strategies: NUMA, pooled, aligned
```

### 5. Observer Pattern

For monitoring and telemetry:

```rust
// Performance profiler observes operations
profiler.record_operation("matmul", duration);

// Memory debugger tracks allocations
debugger.record_allocation(size, layout);
```

### 6. Flyweight Pattern

For shape stride caching:

```rust
// Reuse computed strides across tensors
let strides = STRIDE_CACHE.get_or_compute(shape);
```

## Extension Points

### Adding a New Data Type

1. Define the type in `dtype/extended.rs`
2. Implement `TensorElement` trait
3. Add to `DType` enum
4. Implement type promotion rules
5. Add test cases

### Adding a New Device Backend

1. Implement `Device` trait in `device/implementations.rs`
2. Add device capabilities
3. Register device factory
4. Implement memory allocator
5. Add backend-specific optimizations

### Adding Custom Storage

1. Implement `Storage` trait
2. Register allocator in registry
3. Specify allocation requirements
4. Add metadata for discovery

## Performance Considerations

### Hot Paths

1. **Tensor indexing**: Uses raw pointers, bounds checking only in debug
2. **Shape validation**: Cached strides, thread-local caches
3. **Type promotion**: Compile-time when possible, minimal runtime overhead
4. **Memory allocation**: Pooled for small tensors, aligned for SIMD

### SIMD Optimization

```rust
#[cfg(target_feature = "avx2")]
fn simd_add(a: &[f32], b: &[f32]) -> Vec<f32> {
    use std::arch::x86_64::*;
    // AVX2 vectorized implementation
}

#[cfg(target_feature = "neon")]
fn simd_add(a: &[f32], b: &[f32]) -> Vec<f32> {
    use std::arch::aarch64::*;
    // NEON vectorized implementation
}
```

### Memory Layout Optimization

- **C-contiguous**: Default, best for row-major operations
- **F-contiguous**: Better for column-major operations
- **Strided**: Flexible but slower
- **Aligned**: 32/64-byte alignment for SIMD

### Cache Efficiency

```rust
// Thread-local stride cache
thread_local! {
    static STRIDE_CACHE: RefCell<HashMap<Shape, Vec<usize>>> = ...;
}

// Global LRU cache with eviction
static GLOBAL_STRIDE_CACHE: Lazy<Mutex<LruCache<...>>> = ...;
```

## Runtime Configuration

### Debug Levels

```rust
pub enum DebugLevel {
    None,       // No debug output
    Essential,  // Critical errors only
    Standard,   // Normal debug info
    Verbose,    // Detailed debug info
    Paranoid,   // Everything, including internals
}
```

### Validation Levels

```rust
pub enum ValidationLevel {
    Essential,  // Only check critical invariants
    Standard,   // Normal validation
    Strict,     // Thorough validation
    Maximum,    // Every possible check
}
```

### Configuration Presets

- **Development**: Verbose debugging, strict validation
- **Testing**: Standard debugging, strict validation
- **Production**: Essential debugging, essential validation
- **Profiling**: Minimal debugging, standard validation

## Testing Strategy

### Unit Tests

- Per-module tests in `#[cfg(test)]` blocks
- Cover edge cases and error conditions
- Property-based testing with `proptest`

### Integration Tests

- Backend integration tests
- Cross-module interaction tests
- SciRS2 integration verification

### Benchmark Tests

- Criterion benchmarks in `benches/`
- Performance regression detection
- Platform-specific optimizations

### Fuzz Testing

- Cargo-fuzz targets for shape operations
- Random input generation
- Invariant checking

## Future Directions

### Planned Enhancements

1. **Graph-based shape inference** for optimization
2. **Automatic memory layout optimization**
3. **Distributed tensor metadata management**
4. **Enhanced compile-time type checking**
5. **WebGPU compute shader integration**

### Research Topics

1. Cache-oblivious algorithms for shape operations
2. Tensor expression templates for optimization
3. Type-level automatic differentiation
4. Neuromorphic computing data structures

## References

- [PyTorch Tensor Implementation](https://pytorch.org/)
- [TensorFlow Core](https://www.tensorflow.org/)
- [ndarray Rust Crate](https://docs.rs/ndarray/)
- [SciRS2 Documentation](https://github.com/cool-japan/scirs)
- [IEEE 754 Floating-Point Standard](https://en.wikipedia.org/wiki/IEEE_754)

## Contributing

When contributing to torsh-core, please:

1. Follow the module organization patterns
2. Add comprehensive tests for new features
3. Update this architecture document
4. Maintain zero-cost abstractions
5. Ensure SciRS2 POLICY compliance

---

*Last Updated: 2025-10-23*
*Version: 0.1.0*