torsh-core 0.1.1

Core types and traits for ToRSh deep learning framework
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
# torsh-core Architecture

This document describes the architecture of the `torsh-core` crate, the foundational layer of the ToRSh deep learning framework.

## Table of Contents

- [Overview]#overview
- [Core Principles]#core-principles
- [Module Organization]#module-organization
- [Component Relationships]#component-relationships
- [Key Design Patterns]#key-design-patterns
- [Extension Points]#extension-points
- [Performance Considerations]#performance-considerations

## Overview

`torsh-core` provides the fundamental building blocks for the ToRSh framework:

- **Type System**: DType, Shape, and type promotion
- **Device Abstraction**: Platform-independent device representation
- **Error Handling**: Comprehensive error system with context
- **Memory Management**: Efficient memory allocation and pooling
- **Storage Backends**: Unified interface for different memory layouts
- **Debugging Tools**: Runtime introspection and profiling

### Design Philosophy

1. **Zero-cost abstractions**: Performance critical paths have minimal overhead
2. **Type safety**: Compile-time and runtime validation
3. **Extensibility**: Easy to add new devices, dtypes, and backends
4. **SciRS2 Integration**: Deep integration with the scirs2 ecosystem
5. **Production-ready**: Comprehensive error handling and debugging tools

## Core Principles

### 1. Modular Design

Each major component is isolated in its own module with clear interfaces:

```
torsh-core/
├── dtype/          # Data type system
├── shape/          # Tensor shape management
├── device/         # Device abstraction
├── error/          # Error handling
├── storage/        # Memory management
└── ...
```

### 2. Layered Architecture

```
┌─────────────────────────────────────────┐
│     High-Level APIs & Utilities         │  Examples, profiling, debugging
├─────────────────────────────────────────┤
│        Core Abstractions                │  DType, Shape, Device
├─────────────────────────────────────────┤
│      Memory & Storage Layer             │  Allocators, pooling, NUMA
├─────────────────────────────────────────┤
│     Platform-Specific Backends          │  CPU, CUDA, Metal, WebGPU
└─────────────────────────────────────────┘
```

### 3. Separation of Concerns

- **Types** (DType, Shape) are pure data structures
- **Devices** provide computational capabilities
- **Storage** manages memory allocation
- **Errors** handle all failure modes
- **Utilities** add debugging and profiling

## Module Organization

### Core Types Module Graph

```
dtype.rs ──────┐
              ├──> TensorElement ──> Operations
shape.rs ─────┤
              └──> Validation ──────> Error Handling
device.rs ─────────────────────────> Backend Selection
```

### Data Type System (`dtype/`)

```rust
pub enum DType {
    // Integer types
    U8, I8, I16, I32, I64,
    // Float types
    F16, BF16, F32, F64,
    // Complex types
    C64, C128,
    // Quantized types
    QInt8, QUInt8,
}
```

**Key Features:**
- Type promotion system for mixed-precision operations
- IEEE 754 compliance checking
- Custom data type support through traits
- Automatic type conversion with safety checks

**Dependencies:**
- Uses `scirs2_core::numeric` for numerical traits
- Integrates with `scirs2_core::ndarray` for array operations

### Shape Management (`shape/`)

```
┌────────────────┐
│  Shape (Core)  │
└────────┬───────┘
    ┌────┴────┬──────────┬─────────────┐
    │         │          │             │
┌───▼───┐ ┌──▼──┐  ┌────▼─────┐  ┌───▼────┐
│Stride │ │Cache│  │Validation│  │ Utils  │
│       │ │     │  │          │  │        │
└───────┘ └─────┘  └──────────┘  └────────┘
```

**Components:**
- `shape.rs`: Core shape representation with dimension tracking
- `shape_utils.rs`: Common shape operations and patterns
- `shape_validation.rs`: Validation with visual error messages
- `shape_debug.rs`: ASCII visualization and debugging

**Design Decisions:**
- Shapes are immutable for thread safety
- Stride caching for performance (thread-local + global)
- Symbolic shape support for dynamic graphs

### Device Abstraction (`device/`)

```
                    ┌─────────────┐
                    │   Device    │
                    │   (Trait)   │
                    └──────┬──────┘
         ┌─────────────────┼─────────────────┐
         │                 │                 │
    ┌────▼────┐      ┌─────▼────┐     ┌────▼─────┐
    │  CPU    │      │   CUDA   │     │  Metal   │
    │         │      │          │     │          │
    └─────────┘      └──────────┘     └──────────┘
```

**Submodules:**
- `device/core.rs`: Device trait and base implementations
- `device/capabilities.rs`: Feature detection and scoring
- `device/discovery.rs`: Automatic device selection
- `device/management.rs`: Device pools and health monitoring
- `device/phantom.rs`: Type-level device tracking

**Phantom Types for Compile-Time Safety:**

```rust
// Compile-time device type checking
let tensor: Tensor<CpuDevice, F32> = ...;
let gpu_tensor: Tensor<CudaDevice, F32> = ...;

// This won't compile:
// let result = tensor + gpu_tensor; // Error: device mismatch!

// Type-safe device groups
let devices: DeviceGroup<CudaDevice, 4> = ...;
```

### Error Handling (`error/`)

```
                  ┌──────────────┐
                  │  TorshError  │
                  └───────┬──────┘
        ┌─────────────────┼─────────────────┐
        │                 │                 │
   ┌────▼─────┐     ┌─────▼────┐    ┌──────▼──────┐
   │  Shape   │     │  Index   │    │   General   │
   │  Error   │     │  Error   │    │    Error    │
   └──────────┘     └──────────┘    └─────────────┘
```

**Features:**
- Modular error types (shape, index, general)
- Rich error context with stack traces
- Standard error codes for FFI interoperability
- Error recovery mechanisms
- Source location tracking

**Error Code Mapping:**

```rust
// ToRSh errors map to standard POSIX-like codes
TorshError::OutOfMemory     -> ENOMEM (12)
TorshError::InvalidArgument -> EINVAL (22)
TorshError::NotImplemented  -> ENOSYS (38)

// Custom codes for framework-specific errors
TorshError::ShapeMismatch   -> 1001
TorshError::DTypeMismatch   -> 1011
TorshError::DeviceError     -> 1021
```

### Storage System (`storage/`)

```
┌──────────────────────────────────┐
│   Storage Trait (Abstract)       │
└────────────┬─────────────────────┘
    ┌────────┴────────┬────────────┬──────────┐
    │                 │            │          │
┌───▼────┐   ┌───────▼──┐   ┌─────▼───┐  ┌──▼─────┐
│Aligned │   │  NUMA    │   │ Mapped  │  │  Pool  │
│        │   │          │   │ Storage │  │        │
└────────┘   └──────────┘   └─────────┘  └────────┘
```

**Memory Management Strategies:**

1. **Aligned Storage**: SIMD-friendly memory alignment
2. **NUMA-Aware**: Optimize for multi-socket systems
3. **Memory-Mapped**: Lazy loading for large tensors
4. **Memory Pooling**: Reduce allocation overhead

**Registry Pattern:**

```rust
// Register custom allocators
registry.register(
    "custom_allocator",
    AllocatorMetadata { ... },
    Box::new(MyAllocator::new())
);

// Automatic allocator selection
let allocator = registry.find_best_for_backend(backend_type);
```

## Component Relationships

### Data Flow: Tensor Operation

```
┌──────────┐
│   User   │
└────┬─────┘
     │ operation()
┌────────────────┐
│  Validation    │  ◄── Shape, DType checks
└────┬───────────┘
     │ validated
┌────────────────┐
│ Device Select  │  ◄── Device capabilities
└────┬───────────┘
     │ device chosen
┌────────────────┐
│ Memory Alloc   │  ◄── Storage backend
└────┬───────────┘
     │ memory ready
┌────────────────┐
│  Computation   │  ◄── Backend execution
└────┬───────────┘
     │ result
┌────────────────┐
│    Return      │
└────────────────┘
```

### Type Promotion Flow

```
Operation(tensor_f32, tensor_i32)
┌─────────────────────┐
│  Type Compatibility │
│       Check         │
└──────────┬──────────┘
           ▼ (incompatible)
┌─────────────────────┐
│   Type Promotion    │
│   f32 + i32 → f32   │
└──────────┬──────────┘
┌─────────────────────┐
│  Execute Operation  │
└─────────────────────┘
```

### Device Discovery & Selection

```
┌─────────────────┐
│ Discover Devices│
└────────┬────────┘
┌─────────────────────┐
│  Query Capabilities │  ◄── SIMD, memory, etc.
└────────┬────────────┘
┌─────────────────────┐
│  Score Performance  │  ◄── Workload profile
└────────┬────────────┘
┌─────────────────────┐
│  Select Best Device │
└─────────────────────┘
```

## Key Design Patterns

### 1. Builder Pattern

Used extensively for configuration:

```rust
let config = RuntimeConfig::builder()
    .debug_level(DebugLevel::Verbose)
    .validation_level(ValidationLevel::Strict)
    .enable_profiling(true)
    .build();
```

### 2. Registry Pattern

For extensible component registration:

```rust
// Device registry
DeviceRegistry::register(device_type, factory);

// Allocator registry
AllocatorRegistry::register(name, metadata, allocator);
```

### 3. Phantom Types

For compile-time type safety:

```rust
struct Tensor<D: PhantomDevice, T: DType> {
    data: Storage,
    _phantom: PhantomData<(D, T)>,
}
```

### 4. Strategy Pattern

For algorithm selection:

```rust
trait AllocationStrategy {
    fn allocate(&self, size: usize) -> Result<*mut u8>;
}

// Different strategies: NUMA, pooled, aligned
```

### 5. Observer Pattern

For monitoring and telemetry:

```rust
// Performance profiler observes operations
profiler.record_operation("matmul", duration);

// Memory debugger tracks allocations
debugger.record_allocation(size, layout);
```

### 6. Flyweight Pattern

For shape stride caching:

```rust
// Reuse computed strides across tensors
let strides = STRIDE_CACHE.get_or_compute(shape);
```

## Extension Points

### Adding a New Data Type

1. Define the type in `dtype/extended.rs`
2. Implement `TensorElement` trait
3. Add to `DType` enum
4. Implement type promotion rules
5. Add test cases

### Adding a New Device Backend

1. Implement `Device` trait in `device/implementations.rs`
2. Add device capabilities
3. Register device factory
4. Implement memory allocator
5. Add backend-specific optimizations

### Adding Custom Storage

1. Implement `Storage` trait
2. Register allocator in registry
3. Specify allocation requirements
4. Add metadata for discovery

## Performance Considerations

### Hot Paths

1. **Tensor indexing**: Uses raw pointers, bounds checking only in debug
2. **Shape validation**: Cached strides, thread-local caches
3. **Type promotion**: Compile-time when possible, minimal runtime overhead
4. **Memory allocation**: Pooled for small tensors, aligned for SIMD

### SIMD Optimization

```rust
#[cfg(target_feature = "avx2")]
fn simd_add(a: &[f32], b: &[f32]) -> Vec<f32> {
    use std::arch::x86_64::*;
    // AVX2 vectorized implementation
}

#[cfg(target_feature = "neon")]
fn simd_add(a: &[f32], b: &[f32]) -> Vec<f32> {
    use std::arch::aarch64::*;
    // NEON vectorized implementation
}
```

### Memory Layout Optimization

- **C-contiguous**: Default, best for row-major operations
- **F-contiguous**: Better for column-major operations
- **Strided**: Flexible but slower
- **Aligned**: 32/64-byte alignment for SIMD

### Cache Efficiency

```rust
// Thread-local stride cache
thread_local! {
    static STRIDE_CACHE: RefCell<HashMap<Shape, Vec<usize>>> = ...;
}

// Global LRU cache with eviction
static GLOBAL_STRIDE_CACHE: Lazy<Mutex<LruCache<...>>> = ...;
```

## Runtime Configuration

### Debug Levels

```rust
pub enum DebugLevel {
    None,       // No debug output
    Essential,  // Critical errors only
    Standard,   // Normal debug info
    Verbose,    // Detailed debug info
    Paranoid,   // Everything, including internals
}
```

### Validation Levels

```rust
pub enum ValidationLevel {
    Essential,  // Only check critical invariants
    Standard,   // Normal validation
    Strict,     // Thorough validation
    Maximum,    // Every possible check
}
```

### Configuration Presets

- **Development**: Verbose debugging, strict validation
- **Testing**: Standard debugging, strict validation
- **Production**: Essential debugging, essential validation
- **Profiling**: Minimal debugging, standard validation

## Testing Strategy

### Unit Tests

- Per-module tests in `#[cfg(test)]` blocks
- Cover edge cases and error conditions
- Property-based testing with `proptest`

### Integration Tests

- Backend integration tests
- Cross-module interaction tests
- SciRS2 integration verification

### Benchmark Tests

- Criterion benchmarks in `benches/`
- Performance regression detection
- Platform-specific optimizations

### Fuzz Testing

- Cargo-fuzz targets for shape operations
- Random input generation
- Invariant checking

## Future Directions

### Planned Enhancements

1. **Graph-based shape inference** for optimization
2. **Automatic memory layout optimization**
3. **Distributed tensor metadata management**
4. **Enhanced compile-time type checking**
5. **WebGPU compute shader integration**

### Research Topics

1. Cache-oblivious algorithms for shape operations
2. Tensor expression templates for optimization
3. Type-level automatic differentiation
4. Neuromorphic computing data structures

## References

- [PyTorch Tensor Implementation]https://pytorch.org/
- [TensorFlow Core]https://www.tensorflow.org/
- [ndarray Rust Crate]https://docs.rs/ndarray/
- [SciRS2 Documentation]https://github.com/cool-japan/scirs
- [IEEE 754 Floating-Point Standard]https://en.wikipedia.org/wiki/IEEE_754

## Contributing

When contributing to torsh-core, please:

1. Follow the module organization patterns
2. Add comprehensive tests for new features
3. Update this architecture document
4. Maintain zero-cost abstractions
5. Ensure SciRS2 POLICY compliance

---

*Last Updated: 2025-10-23*
*Version: 0.1.0*