# Dispatch Registry Integration Guide
**Version:** 0.1.0
**Audience:** TenfloweRS Core Contributors
**Status:** Active Reference Document
## Table of Contents
1. [Overview](#overview)
2. [Quick Start](#quick-start)
3. [Architecture](#architecture)
4. [Registration Patterns](#registration-patterns)
5. [Kernel Implementation](#kernel-implementation)
6. [Best Practices](#best-practices)
7. [Testing](#testing)
8. [Common Pitfalls](#common-pitfalls)
## Overview
The TenfloweRS dispatch registry provides a unified system for registering and executing tensor operations across multiple backends (CPU, SIMD, GPU, BLAS, etc.). It eliminates per-module logic duplication and provides automatic backend selection based on device, dtype, and availability.
### Key Benefits
- **Unified Interface**: Single dispatch path for all operations
- **Automatic Backend Selection**: Choose optimal implementation at runtime
- **Feature Gating**: Conditional compilation for optional backends
- **Type Safety**: Type-specific registries prevent runtime errors
- **Extensibility**: Easy to add new operations and backends
### Architecture Overview
```
┌─────────────────────────────────────────────────────────────┐
│ Operation Call │
│ (e.g., tensor.abs(), add(a, b)) │
└───────────────────────┬─────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Dispatch Registry Query │
│ get_registry::<T>().dispatch_unary("abs", x) │
└───────────────────────┬─────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Backend Selection Logic │
│ 1. Check device → preferred backend │
│ 2. Filter available backends │
│ 3. Select highest priority │
└───────────────────────┬─────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Kernel Execution │
│ Backend-specific implementation │
│ (CPU, SIMD, GPU, BLAS, CUDA, Metal, etc.) │
└─────────────────────────────────────────────────────────────┘
```
## Quick Start
### 1. Basic Unary Operation Registration
```rust
use crate::dispatch_registry::{
BackendType, KernelImplementation, OperationDescriptor, F32_REGISTRY
};
use crate::{DType, Tensor, Result};
// Step 1: Define CPU kernel
fn sqrt_f32_cpu(x: &Tensor<f32>) -> Result<Tensor<f32>> {
let data = x.data();
let result: Vec<f32> = data.iter().map(|v| v.sqrt()).collect();
let array = scirs2_autograd::ndarray::ArrayD::from_shape_vec(
x.shape().dims(),
result
)?;
Ok(Tensor::from_array(array))
}
// Step 2: Register operation
pub fn register_sqrt() {
let desc = OperationDescriptor::new("sqrt", "unary")
.with_dtypes(vec![DType::Float32])
.with_broadcast();
F32_REGISTRY.register_operation(desc).unwrap();
// Step 3: Register CPU kernel
F32_REGISTRY.register_kernel(
"sqrt",
KernelImplementation::unary(BackendType::Cpu, sqrt_f32_cpu)
).unwrap();
}
// Step 4: Use in operation
pub fn sqrt(x: &Tensor<f32>) -> Result<Tensor<f32>> {
F32_REGISTRY.dispatch_unary("sqrt", x)
}
```
### 2. Basic Binary Operation Registration
```rust
// Step 1: Define CPU kernel
fn add_f32_cpu(a: &Tensor<f32>, b: &Tensor<f32>) -> Result<Tensor<f32>> {
if a.shape() != b.shape() {
return Err(TensorError::shape_mismatch("add", a.shape(), b.shape()));
}
let a_data = a.data();
let b_data = b.data();
let result: Vec<f32> = a_data.iter()
.zip(b_data.iter())
.map(|(x, y)| x + y)
.collect();
let array = scirs2_autograd::ndarray::ArrayD::from_shape_vec(
a.shape().dims(),
result
)?;
Ok(Tensor::from_array(array))
}
// Step 2: Register operation
pub fn register_add() {
let desc = OperationDescriptor::new("add", "binary")
.with_dtypes(vec![DType::Float32])
.with_broadcast();
F32_REGISTRY.register_operation(desc).unwrap();
// Step 3: Register CPU kernel
F32_REGISTRY.register_kernel(
"add",
KernelImplementation::binary(BackendType::Cpu, add_f32_cpu)
).unwrap();
}
// Step 4: Use in operation
pub fn add(a: &Tensor<f32>, b: &Tensor<f32>) -> Result<Tensor<f32>> {
F32_REGISTRY.dispatch_binary("add", a, b)
}
```
## Architecture
### Type-Specific Registries
TenfloweRS uses type-specific registries to ensure type safety:
```rust
// Global registries (lazy_static)
pub static ref F32_REGISTRY: DispatchRegistry<f32>;
pub static ref F64_REGISTRY: DispatchRegistry<f64>;
pub static ref I32_REGISTRY: DispatchRegistry<i32>;
// Access via helper function
let registry = get_registry::<f32>().unwrap();
```
### Backend Types
```rust
pub enum BackendType {
Cpu, // Always available
#[cfg(feature = "simd")]
SimdCpu, // SIMD-optimized CPU
#[cfg(feature = "blas")]
Blas, // BLAS library
#[cfg(feature = "gpu")]
Gpu, // WebGPU
#[cfg(feature = "cuda")]
Cuda, // NVIDIA CUDA
#[cfg(feature = "metal")]
Metal, // Apple Metal
#[cfg(feature = "rocm")]
Rocm, // AMD ROCm
}
```
### Backend Priority
Backends are selected by priority (higher = preferred):
| Cpu | 0 | Fallback, small tensors |
| SimdCpu | 10 | Medium tensors, CPU-only |
| Blas | 20 | Linear algebra on CPU |
| Gpu | 30 | General GPU (WebGPU) |
| Cuda | 40 | NVIDIA GPUs |
| Rocm | 40 | AMD GPUs |
| Metal | 50 | Apple Silicon |
## Registration Patterns
### Pattern 1: Multi-Backend Registration
Register an operation for multiple backends:
```rust
pub fn register_matmul() {
let desc = OperationDescriptor::new("matmul", "linalg")
.with_dtypes(vec![DType::Float32])
.with_rank_range(Some(2), None); // At least 2D
F32_REGISTRY.register_operation(desc).unwrap();
// CPU implementation
F32_REGISTRY.register_kernel(
"matmul",
KernelImplementation::binary(BackendType::Cpu, matmul_f32_cpu)
).unwrap();
// BLAS implementation (if available)
#[cfg(feature = "blas")]
F32_REGISTRY.register_kernel(
"matmul",
KernelImplementation::binary(BackendType::Blas, matmul_f32_blas)
).unwrap();
// GPU implementation (if available)
#[cfg(feature = "gpu")]
F32_REGISTRY.register_kernel(
"matmul",
KernelImplementation::binary(BackendType::Gpu, matmul_f32_gpu)
).unwrap();
}
```
### Pattern 2: Multi-Type Registration
Register an operation for multiple data types:
```rust
pub fn register_abs_all_types() {
// F32
{
let desc = OperationDescriptor::new("abs", "unary")
.with_dtypes(vec![DType::Float32]);
F32_REGISTRY.register_operation(desc).unwrap();
F32_REGISTRY.register_kernel(
"abs",
KernelImplementation::unary(BackendType::Cpu, abs_f32_cpu)
).unwrap();
}
// F64
{
let desc = OperationDescriptor::new("abs", "unary")
.with_dtypes(vec![DType::Float64]);
F64_REGISTRY.register_operation(desc).unwrap();
F64_REGISTRY.register_kernel(
"abs",
KernelImplementation::unary(BackendType::Cpu, abs_f64_cpu)
).unwrap();
}
// I32
{
let desc = OperationDescriptor::new("abs", "unary")
.with_dtypes(vec![DType::Int32]);
I32_REGISTRY.register_operation(desc).unwrap();
I32_REGISTRY.register_kernel(
"abs",
KernelImplementation::unary(BackendType::Cpu, abs_i32_cpu)
).unwrap();
}
}
```
### Pattern 3: Lazy Registration with Macros
Use macros to simplify registration:
```rust
// In your module
pub fn init_operations() {
register_operation!(F32_REGISTRY, "abs", "unary",
dtypes: [DType::Float32]);
register_unary_kernel!(F32_REGISTRY, "abs",
BackendType::Cpu, abs_f32_cpu);
#[cfg(feature = "simd")]
register_unary_kernel!(F32_REGISTRY, "abs",
BackendType::SimdCpu, abs_f32_simd);
}
```
## Kernel Implementation
### CPU Kernels
CPU kernels should be simple, correct, and serve as reference implementations:
```rust
fn operation_cpu(inputs...) -> Result<Tensor<T>> {
// 1. Validate inputs (shapes, constraints)
// 2. Extract data
// 3. Perform computation
// 4. Package result
let data = input.data();
let result: Vec<T> = data.iter()
.map(|v| /* operation */)
.collect();
let array = ArrayD::from_shape_vec(input.shape().dims(), result)?;
Ok(Tensor::from_array(array))
}
```
### SIMD Kernels
SIMD kernels should use scirs2_core SIMD abstractions:
```rust
#[cfg(feature = "simd")]
fn operation_simd(input: &Tensor<f32>) -> Result<Tensor<f32>> {
use scirs2_core::simd::{SimdArray, SimdOps};
// Use SIMD operations from scirs2_core
// If not available, fallback to CPU
operation_cpu(input) // Fallback for now
}
```
### GPU Kernels
GPU kernels should use WebGPU compute shaders:
```rust
#[cfg(feature = "gpu")]
fn operation_gpu(input: &Tensor<f32>) -> Result<Tensor<f32>> {
use crate::gpu::{GpuContext, execute_kernel};
// 1. Get GPU context
let context = GpuContext::get_or_create()?;
// 2. Create shader if needed
let shader = context.get_or_create_shader("operation", SHADER_SOURCE)?;
// 3. Execute kernel
execute_kernel(&context, &shader, input)
}
```
### BLAS Kernels
BLAS kernels should leverage optimized libraries:
```rust
#[cfg(feature = "blas")]
fn matmul_blas(a: &Tensor<f32>, b: &Tensor<f32>) -> Result<Tensor<f32>> {
use crate::ops::lapack::matmul_blas;
// Delegate to BLAS implementation
matmul_blas(a, b)
}
```
## Best Practices
### 1. Always Provide CPU Fallback
Every operation MUST have a CPU implementation:
```rust
✅ GOOD:
F32_REGISTRY.register_kernel("op",
KernelImplementation::unary(BackendType::Cpu, op_cpu)).unwrap();
#[cfg(feature = "gpu")]
F32_REGISTRY.register_kernel("op",
KernelImplementation::unary(BackendType::Gpu, op_gpu)).unwrap();
❌ BAD:
#[cfg(feature = "gpu")]
F32_REGISTRY.register_kernel("op",
KernelImplementation::unary(BackendType::Gpu, op_gpu)).unwrap();
// No CPU fallback!
```
### 2. Use Shape Error Taxonomy
Use standardized error messages:
```rust
use crate::shape_error_taxonomy::{ShapeErrorBuilder, ShapeErrorCategory};
✅ GOOD:
return Err(ShapeErrorBuilder::new("matmul", ShapeErrorCategory::MatMulIncompatible)
.expected(&format!("(..., m, k) and (..., k, n)"))
.got(&format!("({:?}) and ({:?})", a.shape(), b.shape()))
.detail(&format!("Inner dimensions must match: {} != {}", k1, k2))
.build());
❌ BAD:
return Err(TensorError::invalid_argument("matmul shapes don't match"));
```
### 3. Validate Inputs Early
Check preconditions before computation:
```rust
fn operation(a: &Tensor<f32>, b: &Tensor<f32>) -> Result<Tensor<f32>> {
// ✅ Validate early
if a.shape() != b.shape() {
return Err(ShapeErrorBuilder::new("op",
ShapeErrorCategory::ElementwiseMismatch)
.expected(&format!("{:?}", a.shape()))
.got(&format!("{:?}", b.shape()))
.build());
}
// Now dispatch
F32_REGISTRY.dispatch_binary("op", a, b)
}
```
### 4. Register at Module Initialization
Register operations when the module loads:
```rust
// In ops/mod.rs or specific operation module
pub fn register_all_operations() {
register_unary_ops();
register_binary_ops();
register_reduction_ops();
// ...
}
// Call this in lib.rs or at first use
lazy_static! {
static ref INIT: () = {
crate::ops::register_all_operations();
};
}
```
### 5. Document Operation Constraints
Use OperationDescriptor to document constraints:
```rust
let desc = OperationDescriptor::new("conv2d", "nn")
.with_dtypes(vec![DType::Float32, DType::Float64])
.with_rank_range(Some(4), Some(4)) // Exactly 4D
.with_broadcast()
.with_inplace(); // If in-place is possible
```
## Testing
### Unit Tests for Kernels
Test each kernel implementation:
```rust
#[cfg(test)]
mod tests {
use super::*;
use scirs2_autograd::ndarray::array;
#[test]
fn test_abs_f32_cpu() {
let input = Tensor::from_array(
array![-1.0f32, 2.0, -3.0].into_dyn()
);
let result = abs_f32_cpu(&input).unwrap();
assert_eq!(result.data(), &[1.0f32, 2.0, 3.0]);
}
#[test]
fn test_abs_dispatch() {
crate::ops::register_all_operations();
let input = Tensor::from_array(
array![-1.0f32, 2.0, -3.0].into_dyn()
);
let result = F32_REGISTRY.dispatch_unary("abs", &input).unwrap();
assert_eq!(result.data(), &[1.0f32, 2.0, 3.0]);
}
}
```
### Cross-Backend Consistency Tests
Ensure all backends produce identical results:
```rust
#[test]
#[cfg(all(feature = "simd", feature = "gpu"))]
fn test_cross_backend_consistency() {
let input = Tensor::from_array(/* ... */);
let cpu_result = F32_REGISTRY.dispatch_unary_on_backend(
"op", &input, BackendType::Cpu
).unwrap();
let simd_result = F32_REGISTRY.dispatch_unary_on_backend(
"op", &input, BackendType::SimdCpu
).unwrap();
let gpu_result = F32_REGISTRY.dispatch_unary_on_backend(
"op", &input, BackendType::Gpu
).unwrap();
// Allow small numerical differences
assert_tensors_close(&cpu_result, &simd_result, 1e-6);
assert_tensors_close(&cpu_result, &gpu_result, 1e-5);
}
```
### Performance Tests
Benchmark different backends:
```rust
#[bench]
fn bench_abs_cpu(b: &mut Bencher) {
let input = Tensor::<f32>::randn(&[1000, 1000]);
b.iter(|| {
F32_REGISTRY.dispatch_unary_on_backend(
"abs", &input, BackendType::Cpu
)
});
}
```
## Common Pitfalls
### Pitfall 1: Forgetting to Register
```rust
❌ BAD:
pub fn sqrt(x: &Tensor<f32>) -> Result<Tensor<f32>> {
F32_REGISTRY.dispatch_unary("sqrt", x) // Not registered!
}
✅ GOOD:
lazy_static! {
static ref INIT: () = { register_sqrt(); };
}
pub fn sqrt(x: &Tensor<f32>) -> Result<Tensor<f32>> {
let _ = *INIT; // Ensure registration
F32_REGISTRY.dispatch_unary("sqrt", x)
}
```
### Pitfall 2: Wrong Registry for Type
```rust
❌ BAD:
pub fn abs(x: &Tensor<f64>) -> Result<Tensor<f64>> {
F32_REGISTRY.dispatch_unary("abs", x) // Wrong registry!
}
✅ GOOD:
pub fn abs(x: &Tensor<f64>) -> Result<Tensor<f64>> {
F64_REGISTRY.dispatch_unary("abs", x)
}
```
### Pitfall 3: Duplicate Registration
```rust
❌ BAD:
register_operation!(F32_REGISTRY, "add", "binary");
register_operation!(F32_REGISTRY, "add", "binary"); // Error!
✅ GOOD:
register_operation!(F32_REGISTRY, "add", "binary");
// Only register once
```
### Pitfall 4: Missing Feature Gates
```rust
❌ BAD:
F32_REGISTRY.register_kernel("op",
KernelImplementation::unary(BackendType::Gpu, op_gpu)).unwrap();
// Compilation error if GPU feature not enabled!
✅ GOOD:
#[cfg(feature = "gpu")]
F32_REGISTRY.register_kernel("op",
KernelImplementation::unary(BackendType::Gpu, op_gpu)).unwrap();
```
## Migration Checklist
When migrating an existing operation to use the dispatch registry:
- [ ] Create CPU kernel implementation
- [ ] Register operation with OperationDescriptor
- [ ] Register CPU kernel
- [ ] Add SIMD kernel (if applicable)
- [ ] Add GPU kernel (if applicable)
- [ ] Add BLAS kernel (if applicable)
- [ ] Update public API to use dispatch
- [ ] Add unit tests for each backend
- [ ] Add cross-backend consistency test
- [ ] Add performance benchmark
- [ ] Update documentation
- [ ] Remove old dispatch code
## References
- [dispatch_registry.rs](/src/dispatch_registry.rs) - Core registry implementation
- [dispatch_registry_examples.rs](/src/dispatch_registry_examples.rs) - Example registrations
- [shape_error_taxonomy.rs](/src/shape_error_taxonomy.rs) - Error message standards
- [GPU Kernel Priorities](GPU_KERNEL_PRIORITIES.md) - GPU development roadmap
---
**Questions?** Ask in #tenflowers-dev or file an issue
**Contributions:** Please follow this guide when adding new operations
**Last Updated:** 2026-03-20