torsh-vision 0.1.2

# ToRSh Vision - Best Practices Guide

This guide provides comprehensive best practices for using ToRSh Vision effectively in production computer vision applications, covering performance optimization, memory management, error handling, and development workflows.

## Table of Contents

1. [Performance Optimization](#performance-optimization)
2. [Memory Management](#memory-management)
3. [Hardware Utilization](#hardware-utilization)
4. [Error Handling and Debugging](#error-handling-and-debugging)
5. [Data Pipeline Design](#data-pipeline-design)
6. [Model Development](#model-development)
7. [Production Deployment](#production-deployment)
8. [Testing and Validation](#testing-and-validation)
9. [Code Organization](#code-organization)
10. [Monitoring and Profiling](#monitoring-and-profiling)

## Performance Optimization

### 1. Choose Optimal Data Types

```rust
// Use appropriate data types for your use case
let image_f32 = Tensor::zeros([3, 224, 224], DType::F32, Device::Cpu);  // Training
let image_f16 = Tensor::zeros([3, 224, 224], DType::F16, Device::Cpu);  // Inference (if supported)
let image_u8 = Tensor::zeros([3, 224, 224], DType::U8, Device::Cpu);    // Storage/I/O

// Convert when necessary
let training_tensor = image_u8.to(DType::F32)? / 255.0;  // Normalize to [0,1]
```

### 2. Optimize Transform Pipelines

```rust
// ✅ Good: Batch operations when possible
fn efficient_batch_processing(images: &[Tensor]) -> Result<Vec<Tensor>> {
    let transforms = TransformBuilder::new()
        .resize((224, 224))
        .imagenet_normalize()
        .build();
    
    // Stack into batch for efficient processing
    let batch = Tensor::stack(images, 0)?;
    let processed_batch = transforms.forward(&batch)?;
    
    // Unstack back to individual tensors
    (0..images.len()).map(|i| processed_batch.select(0, i))
        .collect::<Result<Vec<_>>>()
}

// ❌ Avoid: Processing images one by one when batch processing is possible
fn inefficient_processing(images: &[Tensor]) -> Result<Vec<Tensor>> {
    let transforms = TransformBuilder::new()
        .resize((224, 224))
        .imagenet_normalize()
        .build();
    
    images.iter()
        .map(|img| transforms.forward(img))
        .collect()
}
```

### 3. Use Hardware-Aware Transforms

```rust
// Automatically select optimal implementation
fn create_optimal_pipeline() -> Result<Box<dyn Transform>> {
    let hardware = HardwareContext::auto_detect()?;
    
    if hardware.cuda_available() {
        // Use GPU-accelerated transforms
        Ok(Box::new(create_gpu_pipeline()?))
    } else {
        // Use CPU-optimized transforms
        Ok(Box::new(create_cpu_pipeline()?))
    }
}

// Unified transform API for automatic optimization
fn use_unified_transforms() -> Result<()> {
    let context = TransformContext::auto_optimize()?;
    let resize = UnifiedResize::new((224, 224));
    
    // Automatically uses best available hardware
    let processed = resize.apply(&image, &context)?;
    Ok(())
}
```

### 4. Efficient Memory Access Patterns

```rust
// ✅ Good: Sequential memory access
fn efficient_image_processing(image: &Tensor) -> Result<Tensor> {
    // Process channels sequentially for better cache locality
    let mut result = image.clone();
    
    for channel in 0..3 {
        let channel_data = result.select(0, channel)?;
        // Process channel data...
    }
    
    Ok(result)
}

// ✅ Good: Reuse tensors when possible
struct ImageProcessor {
    temp_buffer: Tensor,
}

impl ImageProcessor {
    fn new() -> Result<Self> {
        Ok(Self {
            temp_buffer: Tensor::zeros([3, 1024, 1024], DType::F32, Device::Cpu),
        })
    }
    
    fn process(&mut self, image: &Tensor) -> Result<Tensor> {
        // Reuse temp_buffer instead of allocating new tensors
        self.temp_buffer.copy_from(image)?;
        // Process temp_buffer...
        Ok(self.temp_buffer.clone())
    }
}
```

## Memory Management

### 1. Configure Global Memory Settings

```rust
use torsh_vision::memory::*;

// Configure at application startup
fn setup_memory_management() -> Result<()> {
    let settings = MemorySettings {
        enable_pooling: true,
        max_pool_size: 1000,              // Pool up to 1000 tensors
        max_batch_memory_mb: 4096,        // 4GB max per batch
        enable_profiling: true,           // Monitor memory usage
        auto_optimization: true,          // Automatic optimizations
    };
    
    configure_global_memory(settings);
    
    // Monitor memory usage
    let profiler = MemoryProfiler::new();
    profiler.start_monitoring()?;
    
    Ok(())
}
```

### 2. Use Tensor Pooling for Training Loops

```rust
fn memory_efficient_training() -> Result<()> {
    let mut tensor_pool = TensorPool::new(200);  // Pool for 200 tensors
    
    for epoch in 0..num_epochs {
        for batch_idx in 0..num_batches {
            // Get tensors from pool
            let batch_tensor = tensor_pool.get_tensor(&[32, 3, 224, 224])?;
            let label_tensor = tensor_pool.get_tensor(&[32])?;
            
            // Use tensors for training
            // ... training code ...
            
            // Return to pool when done
            tensor_pool.return_tensor(batch_tensor)?;
            tensor_pool.return_tensor(label_tensor)?;
        }
        
        // Monitor pool efficiency
        let stats = tensor_pool.stats();
        println!("Pool reuse rate: {:.1}%", stats.reuse_rate * 100.0);
    }
    
    Ok(())
}
```

### 3. Optimize Batch Sizes

```rust
use torsh_vision::memory::MemoryOptimizer;

fn calculate_optimal_batch_size(image_shape: &[usize], available_memory_gb: f32) -> Result<usize> {
    let memory_optimizer = MemoryOptimizer::new();
    
    let optimal_batch = memory_optimizer.calculate_optimal_batch_size(
        image_shape,
        (available_memory_gb * 1024.0) as usize,  // Convert to MB
        0.8  // Use 80% of available memory
    )?;
    
    println!("Optimal batch size for {}GB memory: {}", available_memory_gb, optimal_batch);
    Ok(optimal_batch)
}

// Dynamic batch size adjustment
fn adaptive_batch_processing() -> Result<()> {
    let mut current_batch_size = 32;
    let max_batch_size = 128;
    
    loop {
        match process_batch_with_size(current_batch_size) {
            Ok(_) => {
                // Successful - try larger batch
                if current_batch_size < max_batch_size {
                    current_batch_size = (current_batch_size * 1.2) as usize;
                }
            }
            Err(VisionError::OutOfMemory(_)) => {
                // Out of memory - reduce batch size
                current_batch_size = (current_batch_size * 0.8) as usize;
                if current_batch_size < 1 {
                    break;
                }
            }
            Err(e) => return Err(e),
        }
    }
    
    Ok(())
}
```

### 4. Clean Up Resources

```rust
// Implement proper cleanup
struct VisionPipeline {
    tensor_pool: TensorPool,
    memory_profiler: MemoryProfiler,
}

impl Drop for VisionPipeline {
    fn drop(&mut self) {
        // Clean up resources
        self.tensor_pool.clear();
        let final_stats = self.memory_profiler.summary();
        println!("Peak memory usage: {:.2} GB", 
                 final_stats.peak_usage_bytes as f64 / (1024.0 * 1024.0 * 1024.0));
    }
}
```

## Hardware Utilization

### 1. Automatic Hardware Detection

```rust
fn setup_hardware_context() -> Result<TransformContext> {
    let hardware = HardwareContext::auto_detect()?;
    
    println!("Hardware capabilities:");
    println!("  CUDA available: {}", hardware.cuda_available());
    println!("  Mixed precision: {}", hardware.supports_mixed_precision());
    println!("  Tensor cores: {}", hardware.has_tensor_cores());
    println!("  Memory: {:.1} GB", hardware.memory_gb());
    
    let context = TransformContext::new(hardware)
        .with_auto_optimization(true)
        .with_mixed_precision_enabled(true);
    
    Ok(context)
}
```

### 2. GPU Memory Management

```rust
fn gpu_memory_best_practices() -> Result<()> {
    let hardware = HardwareContext::auto_detect()?;
    
    if hardware.cuda_available() {
        // Pre-allocate GPU memory to avoid fragmentation
        let gpu_pool = GpuTensorPool::new(hardware.memory_gb() * 0.8)?;  // Use 80% of GPU memory
        
        // Use pinned memory for faster CPU-GPU transfers
        let cpu_tensor = Tensor::zeros_pinned([32, 3, 224, 224], DType::F32)?;
        let gpu_tensor = cpu_tensor.to_device(Device::Cuda(0))?;
        
        // Batch GPU operations
        let batch_operations = vec![
            gpu_resize,
            gpu_normalize,
            gpu_augment,
        ];
        
        let result = batch_gpu_operations(&gpu_tensor, &batch_operations)?;
    }
    
    Ok(())
}
```

### 3. Mixed Precision Training

```rust
fn mixed_precision_best_practices() -> Result<()> {
    let hardware = HardwareContext::auto_detect()?;
    
    if hardware.supports_mixed_precision() {
        let mut mixed_precision = MixedPrecisionTraining::new()?;
        
        // Use f16 for forward pass to save memory
        let image_f16 = image.to(DType::F16)?;
        let features_f16 = model.forward(&image_f16)?;
        
        // Convert to f32 for loss computation (for numerical stability)
        let features_f32 = features_f16.to(DType::F32)?;
        let loss = compute_loss(&features_f32, &targets)?;
        
        // Scale loss for f16 gradient computation
        let scaled_loss = mixed_precision.scale_loss(&loss)?;
        
        // Backward pass and unscale gradients
        scaled_loss.backward()?;
        mixed_precision.unscale_gradients(&optimizer)?;
        mixed_precision.step(&optimizer)?;
    }
    
    Ok(())
}
```

## Error Handling and Debugging

### 1. Comprehensive Error Handling

```rust
use torsh_vision::error_handling::*;

// Use enhanced error types for better debugging
fn robust_image_processing(image_path: &str) -> Result<Tensor> {
    let image = VisionIO::load_image(image_path)
        .map_err(|e| EnhancedVisionError::IoError {
            path: image_path.to_string(),
            operation: "load_image".to_string(),
            message: e.to_string(),
            suggestions: vec![
                "Check file exists and is readable".to_string(),
                "Verify image format is supported".to_string(),
            ],
        })?;
    
    // Validate image before processing
    validate_image_tensor(&image)
        .map_err(|e| EnhancedVisionError::ValidationError {
            tensor_shape: image.shape().dims().to_vec(),
            expected_constraints: "CHW format with C=1 or C=3".to_string(),
            message: e.to_string(),
            suggestions: vec![
                "Convert to RGB if grayscale".to_string(),
                "Check image loading was successful".to_string(),
            ],
        })?;
    
    Ok(image)
}

// Error recovery strategies
fn process_with_fallbacks(image: &Tensor) -> Result<Tensor> {
    // Try GPU processing first
    if let Ok(result) = try_gpu_processing(image) {
        return Ok(result);
    }
    
    println!("GPU processing failed, falling back to CPU");
    
    // Fallback to CPU processing
    if let Ok(result) = try_cpu_processing(image) {
        return Ok(result);
    }
    
    println!("CPU processing failed, using minimal processing");
    
    // Minimal fallback
    Ok(image.clone())
}
```

### 2. Debugging Utilities

```rust
// Debug transform pipelines
fn debug_transform_pipeline() -> Result<()> {
    let mut debug_pipeline = DebugTransformPipeline::new();
    
    debug_pipeline
        .add(Box::new(Resize::new((224, 224))))
        .add(Box::new(RandomHorizontalFlip::new(0.5)))
        .add(Box::new(Normalize::new(imagenet_mean(), imagenet_std())));
    
    let input = Tensor::randn([3, 512, 512], DType::F32, Device::Cpu);
    
    // Process with debugging
    let result = debug_pipeline.forward_with_debug(&input)?;
    
    // Print debug information
    for (i, debug_info) in debug_pipeline.debug_info().iter().enumerate() {
        println!("Step {}: {} -> {}", 
                 i, 
                 debug_info.input_shape_str(), 
                 debug_info.output_shape_str());
        println!("  Time: {:.2}ms", debug_info.execution_time_ms());
        println!("  Memory: {:.1}MB", debug_info.memory_usage_mb());
    }
    
    Ok(())
}

// Tensor inspection utilities
fn inspect_tensor(tensor: &Tensor, name: &str) {
    println!("Tensor '{}' inspection:", name);
    println!("  Shape: {:?}", tensor.shape());
    println!("  DType: {:?}", tensor.dtype());
    println!("  Device: {:?}", tensor.device());
    
    if let Ok(stats) = calculate_tensor_stats(tensor) {
        println!("  Mean: {:.4}", stats.mean);
        println!("  Std: {:.4}", stats.std);
        println!("  Min: {:.4}", stats.min);
        println!("  Max: {:.4}", stats.max);
    }
}
```

### 3. Assertion and Validation

```rust
// Use assertions for debugging
fn validate_training_data(image: &Tensor, label: &Tensor) -> Result<()> {
    // Shape assertions
    debug_assert_eq!(image.ndim(), 3, "Image must be 3D (CHW)");
    debug_assert_eq!(image.shape()[0], 3, "Image must have 3 channels");
    debug_assert_eq!(label.ndim(), 0, "Label must be scalar");
    
    // Value range assertions
    if cfg!(debug_assertions) {
        let image_min = image.min()?;
        let image_max = image.max()?;
        assert!(image_min >= 0.0, "Image values must be non-negative");
        assert!(image_max <= 1.0, "Image values must be <= 1.0 for normalized data");
    }
    
    Ok(())
}

// Production validation
fn production_validate_batch(images: &Tensor, labels: &Tensor) -> Result<()> {
    // Validate shapes
    if images.ndim() != 4 {
        return Err(VisionError::InvalidShape(
            format!("Expected 4D batch tensor, got {}D", images.ndim())
        ));
    }
    
    let batch_size = images.shape()[0];
    if labels.shape()[0] != batch_size {
        return Err(VisionError::InvalidShape(
            format!("Batch size mismatch: images={}, labels={}", 
                    batch_size, labels.shape()[0])
        ));
    }
    
    // Validate data ranges
    let image_stats = calculate_tensor_stats(images)?;
    if image_stats.min < -3.0 || image_stats.max > 3.0 {
        return Err(VisionError::InvalidInput(
            "Image values outside expected range [-3, 3]".to_string()
        ));
    }
    
    Ok(())
}
```

## Data Pipeline Design

### 1. Efficient Data Loading

```rust
fn create_production_data_pipeline() -> Result<DataPipeline> {
    let config = DatasetConfig {
        cache_size_mb: 8192,           // 8GB cache
        prefetch_size: 256,            // Prefetch 256 samples
        max_workers: num_cpus::get(),  // Use all CPU cores
        enable_validation: true,       // Validate data integrity
        compression: true,             // Compress cached data
        memory_mapping: false,         // Use regular I/O for now
    };
    
    // Create optimized dataset
    let dataset = OptimizedImageDataset::new("data/train", config)?
        .with_transforms(create_training_transforms()?)
        .with_error_recovery(true);   // Handle corrupted images gracefully
    
    // Create parallel data loader
    let data_loader = ParallelDataLoader::new(dataset, 8)?  // 8 worker threads
        .with_batch_size(64)
        .with_shuffle(true)
        .with_pin_memory(true)        // For GPU training
        .with_drop_last(true);        // Consistent batch sizes
    
    Ok(DataPipeline::new(data_loader))
}
```

### 2. Separate Training and Validation Pipelines

```rust
struct TrainingPipeline {
    train_loader: ParallelDataLoader<OptimizedImageDataset>,
    val_loader: ParallelDataLoader<OptimizedImageDataset>,
}

impl TrainingPipeline {
    fn new(data_root: &str) -> Result<Self> {
        // Training pipeline with augmentation
        let train_transforms = TransformBuilder::new()
            .resize((256, 256))
            .random_resized_crop((224, 224))
            .random_horizontal_flip(0.5)
            .color_jitter(0.4, 0.4, 0.4, 0.1)
            .random_erasing(0.25)
            .imagenet_normalize()
            .build();
        
        // Validation pipeline without augmentation
        let val_transforms = TransformBuilder::new()
            .resize((256, 256))
            .center_crop((224, 224))
            .imagenet_normalize()
            .build();
        
        let train_dataset = OptimizedImageDataset::new(
            &format!("{}/train", data_root),
            DatasetConfig::default_training()
        )?.with_transforms(train_transforms);
        
        let val_dataset = OptimizedImageDataset::new(
            &format!("{}/val", data_root),
            DatasetConfig::default_validation()
        )?.with_transforms(val_transforms);
        
        Ok(Self {
            train_loader: ParallelDataLoader::new(train_dataset, 8)?,
            val_loader: ParallelDataLoader::new(val_dataset, 4)?,
        })
    }
}
```

### 3. Progressive Loading

```rust
// Load data progressively for large datasets
struct ProgressiveDataLoader {
    current_loader: ParallelDataLoader<OptimizedImageDataset>,
    next_batch_prepared: bool,
    background_loader: Option<thread::JoinHandle<()>>,
}

impl ProgressiveDataLoader {
    fn new(dataset: OptimizedImageDataset) -> Result<Self> {
        let loader = ParallelDataLoader::new(dataset, 4)?;
        
        Ok(Self {
            current_loader: loader,
            next_batch_prepared: false,
            background_loader: None,
        })
    }
    
    fn prepare_next_batch(&mut self) -> Result<()> {
        if !self.next_batch_prepared {
            // Start background loading for next batch
            // Implementation details...
            self.next_batch_prepared = true;
        }
        Ok(())
    }
}
```

## Model Development

### 1. Model Architecture Best Practices

```rust
// Use builder pattern for complex models
struct ResNetBuilder {
    layers: Vec<usize>,
    num_classes: usize,
    dropout_rate: Option<f32>,
    batch_norm: bool,
}

impl ResNetBuilder {
    fn new() -> Self {
        Self {
            layers: vec![2, 2, 2, 2],  // ResNet-18 default
            num_classes: 1000,
            dropout_rate: None,
            batch_norm: true,
        }
    }
    
    fn layers(mut self, layers: Vec<usize>) -> Self {
        self.layers = layers;
        self
    }
    
    fn num_classes(mut self, num_classes: usize) -> Self {
        self.num_classes = num_classes;
        self
    }
    
    fn dropout(mut self, rate: f32) -> Self {
        self.dropout_rate = Some(rate);
        self
    }
    
    fn build(self) -> Result<ResNet> {
        ResNet::new(self.layers, self.num_classes)
            .with_dropout(self.dropout_rate)
            .with_batch_norm(self.batch_norm)
    }
}

// Usage
let model = ResNetBuilder::new()
    .layers(vec![3, 4, 6, 3])  // ResNet-34
    .num_classes(100)          // CIFAR-100
    .dropout(0.5)
    .build()?;
```

### 2. Model Initialization

```rust
// Proper weight initialization
fn initialize_model(model: &mut dyn Module) -> Result<()> {
    for (name, param) in model.named_parameters() {
        match name.as_str() {
            name if name.contains("conv") && name.contains("weight") => {
                // Kaiming initialization for convolutional layers
                kaiming_normal_(param, 0.0, "fan_out", "relu")?;
            }
            name if name.contains("bn") && name.contains("weight") => {
                // BatchNorm weights to 1
                constant_(param, 1.0)?;
            }
            name if name.contains("bn") && name.contains("bias") => {
                // BatchNorm bias to 0
                constant_(param, 0.0)?;
            }
            name if name.contains("fc") && name.contains("weight") => {
                // Xavier initialization for fully connected layers
                xavier_normal_(param)?;
            }
            _ => {
                // Default initialization
                normal_(param, 0.0, 0.01)?;
            }
        }
    }
    Ok(())
}
```

### 3. Model Validation

```rust
fn validate_model_architecture(model: &dyn Module) -> Result<()> {
    // Check parameter count
    let total_params = model.parameters().iter()
        .map(|p| p.numel())
        .sum::<usize>();
    
    println!("Total parameters: {:.2}M", total_params as f64 / 1e6);
    
    // Check for gradient flow
    let sample_input = Tensor::randn([1, 3, 224, 224], DType::F32, Device::Cpu);
    let output = model.forward(&sample_input)?;
    
    println!("Model output shape: {:?}", output.shape());
    
    // Validate output shape matches expected
    let expected_classes = 1000;  // ImageNet
    if output.shape()[1] != expected_classes {
        return Err(VisionError::InvalidShape(
            format!("Expected {} classes, got {}", expected_classes, output.shape()[1])
        ));
    }
    
    Ok(())
}
```

## Production Deployment

### 1. Model Optimization for Inference

```rust
// Optimize model for production inference
fn optimize_for_inference(model: &mut dyn Module) -> Result<()> {
    // Set to evaluation mode
    model.eval();
    
    // Fuse batch normalization with convolution
    fuse_conv_bn(model)?;
    
    // Quantize model if supported
    if cfg!(feature = "quantization") {
        quantize_model(model, QuantizationConfig::default())?;
    }
    
    // JIT compile if available
    if cfg!(feature = "jit") {
        let jit_model = jit_compile(model)?;
        // Use jit_model for inference
    }
    
    Ok(())
}
```

### 2. Batch Processing for Throughput

```rust
struct InferenceEngine {
    model: Box<dyn Module>,
    batch_size: usize,
    preprocessing: Box<dyn Transform>,
    hardware_context: TransformContext,
}

impl InferenceEngine {
    fn new(model: Box<dyn Module>, batch_size: usize) -> Result<Self> {
        let preprocessing = TransformBuilder::new()
            .resize((224, 224))
            .imagenet_normalize()
            .build();
        
        let hardware_context = TransformContext::auto_optimize()?;
        
        Ok(Self {
            model,
            batch_size,
            preprocessing,
            hardware_context,
        })
    }
    
    fn predict_batch(&self, images: &[Tensor]) -> Result<Vec<Tensor>> {
        let mut results = Vec::new();
        
        for chunk in images.chunks(self.batch_size) {
            // Preprocess batch
            let mut preprocessed = Vec::new();
            for image in chunk {
                let processed = self.preprocessing.forward(image)?;
                preprocessed.push(processed);
            }
            
            // Stack into batch tensor
            let batch = Tensor::stack(&preprocessed, 0)?;
            
            // Inference
            let batch_output = self.model.forward(&batch)?;
            
            // Split back to individual results
            for i in 0..chunk.len() {
                results.push(batch_output.select(0, i)?);
            }
        }
        
        Ok(results)
    }
    
    async fn predict_async(&self, image: Tensor) -> Result<Tensor> {
        // Async inference for web services
        let processed = self.preprocessing.forward(&image)?;
        let output = self.model.forward(&processed.unsqueeze(0)?)?;
        Ok(output.squeeze(0)?)
    }
}
```

### 3. Model Serving

```rust
// HTTP service for model inference
use warp::Filter;

struct ModelService {
    engine: Arc<InferenceEngine>,
}

impl ModelService {
    fn new(model_path: &str) -> Result<Self> {
        let model = load_model(model_path)?;
        let engine = Arc::new(InferenceEngine::new(model, 32)?);
        
        Ok(Self { engine })
    }
    
    fn routes(&self) -> impl Filter<Extract = impl warp::Reply, Error = warp::Rejection> + Clone {
        let engine = Arc::clone(&self.engine);
        
        warp::path("predict")
            .and(warp::post())
            .and(warp::body::bytes())
            .and(warp::any().map(move || Arc::clone(&engine)))
            .and_then(Self::handle_prediction)
    }
    
    async fn handle_prediction(
        body: bytes::Bytes,
        engine: Arc<InferenceEngine>,
    ) -> Result<impl warp::Reply, warp::Rejection> {
        // Decode image from bytes
        let image = decode_image_bytes(&body)
            .map_err(|_| warp::reject::custom(InvalidImage))?;
        
        // Run inference
        let result = engine.predict_async(image).await
            .map_err(|_| warp::reject::custom(InferenceError))?;
        
        // Encode result as JSON
        let response = serde_json::json!({
            "predictions": tensor_to_predictions(&result)?,
        });
        
        Ok(warp::reply::json(&response))
    }
}
```

## Testing and Validation

### 1. Unit Testing

```rust
#[cfg(test)]
mod tests {
    use super::*;
    use approx::assert_relative_eq;
    
    #[test]
    fn test_transform_output_shape() {
        let transform = Resize::new((224, 224));
        let input = Tensor::zeros([3, 512, 512], DType::F32, Device::Cpu);
        
        let output = transform.forward(&input).unwrap();
        assert_eq!(output.shape().dims(), &[3, 224, 224]);
    }
    
    #[test]
    fn test_normalization_statistics() {
        let mean = vec![0.485, 0.456, 0.406];
        let std = vec![0.229, 0.224, 0.225];
        let normalize = Normalize::new(mean.clone(), std.clone());
        
        // Create image with known statistics
        let mut image = Tensor::ones([3, 224, 224], DType::F32, Device::Cpu);
        
        let normalized = normalize.forward(&image).unwrap();
        
        // Check that normalization was applied correctly
        for c in 0..3 {
            let channel = normalized.select(0, c).unwrap();
            let channel_mean = channel.mean().unwrap().to_scalar().unwrap();
            let expected_mean = (1.0 - mean[c]) / std[c];
            assert_relative_eq!(channel_mean, expected_mean, epsilon = 1e-6);
        }
    }
    
    #[test]
    fn test_dataset_consistency() {
        let dataset = create_test_dataset().unwrap();
        
        // Test that dataset length is consistent
        assert!(dataset.len() > 0);
        
        // Test that samples can be loaded
        for i in 0..std::cmp::min(10, dataset.len()) {
            let (image, label) = dataset.get(i).unwrap();
            assert_eq!(image.ndim(), 3);
            assert!(label >= 0);
        }
    }
}
```

### 2. Integration Testing

```rust
#[test]
fn test_end_to_end_pipeline() -> Result<()> {
    // Create test data
    let test_images = create_test_images()?;
    let test_dataset = TestDataset::new(test_images)?;
    
    // Create data loader
    let data_loader = ParallelDataLoader::new(test_dataset, 2)?
        .with_batch_size(4);
    
    // Create model
    let mut model = ResNet::resnet18(10)?;  // 10 classes for test
    
    // Test training step
    for batch in data_loader.iter()?.take(2) {  // Test 2 batches
        let (images, labels) = batch?;
        
        // Forward pass
        let outputs = model.forward(&images)?;
        assert_eq!(outputs.shape()[0], images.shape()[0]);  // Same batch size
        assert_eq!(outputs.shape()[1], 10);                 // 10 classes
        
        // Compute loss
        let loss = cross_entropy_loss(&outputs, &labels)?;
        assert!(loss.to_scalar::<f32>()? > 0.0);
        
        // Backward pass
        loss.backward()?;
        
        // Check gradients exist
        for param in model.parameters() {
            assert!(param.grad().is_some());
        }
    }
    
    Ok(())
}
```

### 3. Performance Testing

```rust
use std::time::Instant;

fn benchmark_transforms() -> Result<()> {
    let transforms = vec![
        ("Resize", Box::new(Resize::new((224, 224))) as Box<dyn Transform>),
        ("RandomHorizontalFlip", Box::new(RandomHorizontalFlip::new(0.5))),
        ("ColorJitter", Box::new(ColorJitter::new().brightness(0.2))),
        ("Normalize", Box::new(Normalize::new(vec![0.5; 3], vec![0.5; 3]))),
    ];
    
    let test_image = Tensor::randn([3, 512, 512], DType::F32, Device::Cpu);
    let iterations = 100;
    
    for (name, transform) in transforms {
        let start = Instant::now();
        
        for _ in 0..iterations {
            let _ = transform.forward(&test_image)?;
        }
        
        let duration = start.elapsed();
        let avg_ms = duration.as_millis() as f64 / iterations as f64;
        
        println!("{}: {:.2} ms/image", name, avg_ms);
        
        // Performance regression test
        match name {
            "Resize" => assert!(avg_ms < 10.0, "Resize too slow: {:.2} ms", avg_ms),
            "RandomHorizontalFlip" => assert!(avg_ms < 1.0, "Flip too slow: {:.2} ms", avg_ms),
            _ => {}
        }
    }
    
    Ok(())
}
```

## Code Organization

### 1. Project Structure

```
src/
├── lib.rs                 # Main library entry point
├── prelude.rs            # Common imports
├── error_handling.rs     # Error types and utilities
├── transforms/           # Transform implementations
│   ├── mod.rs
│   ├── geometric.rs      # Resize, crop, flip, rotate
│   ├── color.rs          # Color transforms
│   ├── augmentation.rs   # Advanced augmentation
│   └── unified.rs        # Unified transform API
├── datasets/             # Dataset implementations
│   ├── mod.rs
│   ├── image_folder.rs
│   ├── cifar.rs
│   ├── mnist.rs
│   └── optimized/        # Optimized datasets
├── models/               # Model architectures
│   ├── mod.rs
│   ├── resnet.rs
│   ├── vgg.rs
│   └── detection/        # Detection models
├── hardware/             # Hardware acceleration
├── memory/               # Memory management
├── visualization/        # Interactive and 3D viz
└── examples/             # Usage examples
```

### 2. Module Organization

```rust
// lib.rs - Clean public API
#![allow(clippy::all)]
#![allow(dead_code)]
#![allow(unused_imports)]

pub mod datasets;
pub mod transforms;
pub mod models;
pub mod hardware;
pub mod memory;
pub mod visualization;
pub mod error_handling;

// Re-export commonly used items
pub use error_handling::{VisionError, Result};
pub use transforms::{Transform, TransformBuilder};
pub use datasets::{Dataset, ImageDataset};

// Prelude for convenience
pub mod prelude {
    pub use crate::{
        VisionError, Result,
        Transform, TransformBuilder,
        Dataset, ImageDataset,
    };
    pub use torsh_tensor::{Tensor, Device, DType};
}
```

### 3. Feature Flags

```toml
# Cargo.toml
[features]
default = ["std", "cpu"]
std = []
cpu = []
cuda = ["torsh-tensor/cuda"]
mkl = ["torsh-tensor/mkl"]
quantization = ["torsh-tensor/quantization"]
visualization = ["image", "plotters"]
video = ["opencv"]
distributed = ["mpi"]
```

```rust
// Conditional compilation
#[cfg(feature = "cuda")]
pub mod cuda_transforms;

#[cfg(feature = "visualization")]
pub mod interactive;

#[cfg(feature = "quantization")]
impl Model {
    pub fn quantize(&mut self) -> Result<()> {
        // Quantization implementation
    }
}
```

## Monitoring and Profiling

### 1. Performance Monitoring

```rust
use std::time::Instant;
use std::collections::HashMap;

struct PerformanceMonitor {
    timings: HashMap<String, Vec<f64>>,
    memory_usage: HashMap<String, Vec<usize>>,
}

impl PerformanceMonitor {
    fn new() -> Self {
        Self {
            timings: HashMap::new(),
            memory_usage: HashMap::new(),
        }
    }
    
    fn time_operation<F, R>(&mut self, name: &str, operation: F) -> R
    where
        F: FnOnce() -> R,
    {
        let start = Instant::now();
        let result = operation();
        let duration = start.elapsed().as_secs_f64() * 1000.0;  // ms
        
        self.timings.entry(name.to_string())
            .or_default()
            .push(duration);
        
        result
    }
    
    fn record_memory_usage(&mut self, name: &str, bytes: usize) {
        self.memory_usage.entry(name.to_string())
            .or_default()
            .push(bytes);
    }
    
    fn report(&self) {
        println!("Performance Report:");
        
        for (name, times) in &self.timings {
            let avg = times.iter().sum::<f64>() / times.len() as f64;
            let min = times.iter().fold(f64::INFINITY, |a, &b| a.min(b));
            let max = times.iter().fold(f64::NEG_INFINITY, |a, &b| a.max(b));
            
            println!("  {}: avg={:.2}ms, min={:.2}ms, max={:.2}ms", 
                     name, avg, min, max);
        }
    }
}
```

### 2. Resource Usage Tracking

```rust
struct ResourceTracker {
    gpu_memory_usage: Vec<f64>,
    cpu_utilization: Vec<f64>,
    start_time: Instant,
}

impl ResourceTracker {
    fn new() -> Self {
        Self {
            gpu_memory_usage: Vec::new(),
            cpu_utilization: Vec::new(),
            start_time: Instant::now(),
        }
    }
    
    fn sample_resources(&mut self) -> Result<()> {
        #[cfg(feature = "cuda")]
        {
            let gpu_memory = get_gpu_memory_usage()?;
            self.gpu_memory_usage.push(gpu_memory);
        }
        
        let cpu_usage = get_cpu_utilization()?;
        self.cpu_utilization.push(cpu_usage);
        
        Ok(())
    }
    
    fn get_summary(&self) -> ResourceSummary {
        ResourceSummary {
            total_time: self.start_time.elapsed().as_secs_f64(),
            avg_gpu_memory: self.gpu_memory_usage.iter().sum::<f64>() / self.gpu_memory_usage.len() as f64,
            peak_gpu_memory: self.gpu_memory_usage.iter().fold(0.0, |a, &b| a.max(b)),
            avg_cpu_utilization: self.cpu_utilization.iter().sum::<f64>() / self.cpu_utilization.len() as f64,
        }
    }
}
```

### 3. Logging and Telemetry

```rust
use log::{info, warn, error, debug};

// Structured logging
fn log_training_metrics(epoch: usize, loss: f32, accuracy: f32, lr: f32) {
    info!(
        "Training metrics - Epoch: {}, Loss: {:.4}, Accuracy: {:.2}%, LR: {:.6}",
        epoch, loss, accuracy * 100.0, lr
    );
}

// Performance logging
fn log_batch_processing_time(batch_size: usize, processing_time: f64) {
    let throughput = batch_size as f64 / processing_time;
    
    debug!(
        "Batch processing - Size: {}, Time: {:.2}ms, Throughput: {:.1} images/sec",
        batch_size, processing_time * 1000.0, throughput
    );
    
    if processing_time > 1.0 {  // More than 1 second per batch
        warn!("Slow batch processing detected: {:.2}s for {} images", 
              processing_time, batch_size);
    }
}

// Error logging with context
fn log_error_with_context(error: &VisionError, context: &str) {
    error!("Error in {}: {}", context, error);
    
    // Log additional debugging information
    match error {
        VisionError::InvalidShape(msg) => {
            debug!("Shape error details: {}", msg);
        }
        VisionError::IoError(io_err) => {
            debug!("IO error details: {}", io_err);
        }
        _ => {}
    }
}
```

## Conclusion

Following these best practices will help you:

- **Maximize Performance**: Efficient memory usage, hardware acceleration, and optimized data pipelines
- **Ensure Reliability**: Comprehensive error handling, validation, and testing
- **Maintain Code Quality**: Clean architecture, proper documentation, and organized code structure
- **Monitor Production Systems**: Performance tracking, resource monitoring, and proper logging
- **Scale Effectively**: Memory-aware processing, batch optimization, and hardware utilization

Remember to:
- Profile your specific use case to identify bottlenecks
- Test thoroughly before deploying to production
- Monitor resource usage in production environments
- Keep error handling comprehensive but not verbose
- Use appropriate hardware acceleration for your deployment target

For more specific examples and advanced techniques, refer to the [examples module](./src/examples.rs) and other documentation files.