rustorch 0.6.29

# CoreML統合 - 既存trait体系ベース実装詳細計画

> **作成日**: 2025-09-19
> **対象**: 既存GPU trait体系を活用したCoreML統合
> **実装方針**: 最小限の変更で最大の効果

## 🎯 Executive Summary

既存のGPU trait体系の詳細分析により、**CoreMLは既存アーキテクチャに完璧に適合**することが判明。デバイス選択ロジックへのCoreMLケース追加と、各traitのCoreML実装のみで統合が完成します。

### 📊 実装の単純さ

| 変更箇所 | 変更内容 | 工数削減 |
|----------|----------|----------|
| DeviceType enum | CoreML(usize) 1行追加 | -80% |
| trait実装 | match文にCoreMLケース追加 | -70% |
| 新規trait | 不要 | -90% |
| API変更 | ゼロ | -100% |

---

## 🏗️ 1. 既存trait体系の構造分析

### 1.1 現在のtrait階層

```rust
// 主要GPU trait群
pub trait GpuLinearAlgebra<T>    // 線形代数演算
pub trait GpuConvolution<T>      // 畳み込み演算
pub trait GpuReduction<T>        // リダクション演算
pub trait GpuPooling<T>          // プーリング演算
pub trait GpuParallelOp<T>       // 並列演算（最包括的）

// デバイス管理
pub enum DeviceType { Cpu, Cuda(usize), Metal(usize), OpenCL(usize) }
pub struct GpuContext
pub struct DeviceManager
```

### 1.2 実装パターンの統一性

#### **標準的な実装パターン**（`GpuLinearAlgebra`の例）
```rust
impl<T> GpuLinearAlgebra<T> for Tensor<T> {
    fn gpu_matmul(&self, other: &Self) -> RusTorchResult<Tensor<T>> {
        // デバイス自動選択
        let device_type = if DeviceManager::is_cuda_available() {
            DeviceType::Cuda(0)
        } else if DeviceManager::is_metal_available() {
            DeviceType::Metal(0)
        } else {
            DeviceType::Cpu
        };

        // 執行者作成・実行
        let executor = GpuBatchMatrixExecutor::<T>::new(device_type)?;
        executor.batch_matmul(self, other)
    }
}
```

#### **Executorパターンの統一性**（`GpuBatchMatrixExecutor`の例）
```rust
impl<T> GpuBatchMatrixExecutor<T> {
    pub fn batch_matmul(&self, a: &Tensor<T>, b: &Tensor<T>) -> RusTorchResult<Tensor<T>> {
        match &self.device_type {
            DeviceType::Cuda(_) => self.cuda_batch_matmul(a, b),
            DeviceType::Metal(_) => self.metal_batch_matmul(a, b),
            DeviceType::OpenCL(_) => self.opencl_batch_matmul(a, b),
            DeviceType::Cpu => a.matmul(b).map_err(|e| RusTorchError::gpu(e.to_string())),
        }
    }
}
```

### 🎯 **発見: 完璧なアーキテクチャ**

1. **統一パターン**: 全traitが同じmatch文パターンを使用
2. **自動フォールバック**: デバイス利用不可時のCPU フォールバック
3. **エラーハンドリング**: 統一されたRusTorchError体系
4. **型安全性**: コンパイル時の型チェック

---

## ⚡ 2. CoreML統合戦略 - trait別詳細実装

### 2.1 DeviceType拡張（基盤）

#### **Phase 1.1: DeviceType拡張**
**ファイル**: `src/gpu/mod.rs`

```rust
/// GPU device types
#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash, Default)]
pub enum DeviceType {
    #[default]
    Cpu,
    Cuda(usize),
    Metal(usize),
    OpenCL(usize),
    CoreML(usize),    // 🆕 1行追加のみ！
}

impl DeviceType {
    // 🆕 CoreML関連メソッド追加
    pub fn is_coreml(&self) -> bool {
        matches!(self, DeviceType::CoreML(_))
    }

    pub fn is_apple_hardware(&self) -> bool {
        matches!(self, DeviceType::Metal(_) | DeviceType::CoreML(_))
    }

    // 既存メソッドの拡張
    pub fn is_available(&self) -> bool {
        match self {
            DeviceType::Cpu => true,
            DeviceType::Cuda(id) => cuda_device_available(*id),
            DeviceType::Metal(id) => metal_device_available(*id),
            DeviceType::OpenCL(id) => opencl_device_available(*id),
            DeviceType::CoreML(id) => coreml_device_available(*id), // 🆕
        }
    }
}

// 🆕 CoreML利用可能性チェック
fn coreml_device_available(device_id: usize) -> bool {
    #[cfg(feature = "coreml")]
    {
        // CoreML利用可能性チェックロジック
        use crate::gpu::coreml::CoreMLDevice;
        CoreMLDevice::is_available(device_id)
    }
    #[cfg(not(feature = "coreml"))]
    {
        false
    }
}
```

### 2.2 GpuLinearAlgebra - 線形代数演算の拡張

#### **Phase 1.2: 最重要traitの拡張**
**ファイル**: `src/gpu/matrix_ops.rs`

```rust
impl<T> GpuLinearAlgebra<T> for Tensor<T>
where T: Float + FromPrimitive + ScalarOperand + Send + Sync + 'static
{
    fn gpu_matmul(&self, other: &Self) -> RusTorchResult<Tensor<T>> {
        // デバイス選択ロジック拡張
        let device_type = if DeviceManager::is_coreml_available() {       // 🆕
            DeviceType::CoreML(0)                                        // 🆕
        } else if DeviceManager::is_cuda_available() {
            DeviceType::Cuda(0)
        } else if DeviceManager::is_metal_available() {
            DeviceType::Metal(0)
        } else {
            DeviceType::Cpu
        };

        let executor = GpuBatchMatrixExecutor::<T>::new(device_type)?;
        executor.batch_matmul(self, other)
    }

    // gpu_batch_matmul, gpu_matvec も同様のパターンで拡張
}
```

#### **Phase 1.3: Executorの拡張**
```rust
impl<T> GpuBatchMatrixExecutor<T> {
    pub fn batch_matmul(&self, a: &Tensor<T>, b: &Tensor<T>) -> RusTorchResult<Tensor<T>> {
        match &self.device_type {
            DeviceType::CoreML(_) => self.coreml_batch_matmul(a, b),    // 🆕
            DeviceType::Cuda(_) => self.cuda_batch_matmul(a, b),
            DeviceType::Metal(_) => self.metal_batch_matmul(a, b),
            DeviceType::OpenCL(_) => self.opencl_batch_matmul(a, b),
            DeviceType::Cpu => a.matmul(b).map_err(|e| RusTorchError::gpu(e.to_string())),
        }
    }

    // 🆕 CoreML実装メソッド
    fn coreml_batch_matmul(&self, a: &Tensor<T>, b: &Tensor<T>) -> RusTorchResult<Tensor<T>> {
        #[cfg(feature = "coreml")]
        {
            use crate::gpu::coreml::CoreMLLinearAlgebra;

            let executor = CoreMLLinearAlgebra::new()?;
            executor.matmul(a, b)
        }
        #[cfg(not(feature = "coreml"))]
        {
            // フォールバック
            a.matmul(b).map_err(|e| RusTorchError::gpu(e.to_string()))
        }
    }
}
```

### 2.3 GpuConvolution - 畳み込み演算の拡張

#### **Phase 2.1: CNN演算の拡張**
**ファイル**: `src/gpu/conv_ops.rs`

```rust
impl<T> GpuConvolution<T> for Tensor<T> {
    fn gpu_conv2d(&self, kernel: &Self, params: &ConvolutionParams) -> RusTorchResult<Tensor<T>> {
        // 畳み込み演算用デバイス選択（CoreML優先）
        let device_type = if DeviceManager::is_coreml_available() {      // 🆕 CNN最優先
            DeviceType::CoreML(0)
        } else if DeviceManager::is_metal_available() {
            DeviceType::Metal(0)
        } else if DeviceManager::is_cuda_available() {
            DeviceType::Cuda(0)
        } else {
            DeviceType::Cpu
        };

        let executor = GpuConvolutionExecutor::<T>::new(device_type)?;
        executor.conv2d(self, kernel, params)
    }
}

// GpuConvolutionExecutor 拡張
impl<T> GpuConvolutionExecutor<T> {
    pub fn conv2d(&self, input: &Tensor<T>, kernel: &Tensor<T>, params: &ConvolutionParams) -> RusTorchResult<Tensor<T>> {
        match &self.device_type {
            DeviceType::CoreML(_) => self.coreml_conv2d(input, kernel, params), // 🆕
            DeviceType::Metal(_) => self.metal_conv2d(input, kernel, params),
            DeviceType::Cuda(_) => self.cuda_conv2d(input, kernel, params),
            DeviceType::OpenCL(_) => self.opencl_conv2d(input, kernel, params),
            DeviceType::Cpu => self.cpu_conv2d(input, kernel, params),
        }
    }

    // 🆕 CoreML畳み込み実装
    fn coreml_conv2d(&self, input: &Tensor<T>, kernel: &Tensor<T>, params: &ConvolutionParams) -> RusTorchResult<Tensor<T>> {
        #[cfg(feature = "coreml")]
        {
            use crate::gpu::coreml::CoreMLConvolution;

            let executor = CoreMLConvolution::new()?;
            executor.conv2d(input, kernel, params)
        }
        #[cfg(not(feature = "coreml"))]
        {
            self.cpu_conv2d(input, kernel, params)
        }
    }
}
```

### 2.4 GpuReduction - リダクション演算の拡張

#### **Phase 2.2: 統計演算の拡張**
**ファイル**: `src/gpu/reduction_ops.rs`

```rust
impl<T> GpuReduction<T> for Tensor<T> {
    fn gpu_sum(&self, dim: Option<usize>) -> RusTorchResult<Tensor<T>> {
        let device_type = best_device_for_reduction(); // CoreML含む選択ロジック

        let executor = GpuReductionExecutor::<T>::new(device_type)?;
        executor.sum(self, dim)
    }

    // gpu_mean, gpu_max, gpu_min も同様
}

// デバイス選択の最適化
fn best_device_for_reduction() -> DeviceType {
    if DeviceManager::is_coreml_available() {
        DeviceType::CoreML(0)
    } else if DeviceManager::is_metal_available() {
        DeviceType::Metal(0)
    } else if DeviceManager::is_cuda_available() {
        DeviceType::Cuda(0)
    } else {
        DeviceType::Cpu
    }
}
```

### 2.5 GpuParallelOp - 並列演算の拡張（包括的）

#### **Phase 2.3: 並列処理統合**
**ファイル**: `src/tensor/gpu_parallel.rs`

```rust
impl<T> GpuParallelOp<T> for Tensor<T> {
    fn gpu_elementwise_op<F>(&self, other: &Tensor<T>, op: F) -> ParallelResult<Tensor<T>>
    where F: Fn(T, T) -> T + Send + Sync + Clone + 'static
    {
        // 要素演算に最適なデバイス選択
        let device_type = optimal_device_for_elementwise();

        match device_type {
            DeviceType::CoreML(_) => self.coreml_elementwise_op(other, op), // 🆕
            DeviceType::Metal(_) => self.metal_elementwise_op(other, op),
            DeviceType::Cuda(_) => self.cuda_elementwise_op(other, op),
            _ => self.cpu_elementwise_op(other, op),
        }
    }

    fn to_device(&self, device: DeviceType) -> ParallelResult<Tensor<T>> {
        match device {
            DeviceType::CoreML(_) => self.to_coreml(),                    // 🆕
            DeviceType::Metal(_) => self.to_metal(),
            DeviceType::Cuda(_) => self.to_cuda(),
            DeviceType::Cpu => Ok(self.to_cpu()?),
        }
    }
}
```

---

## 🔧 3. CoreML実装層の設計

### 3.1 CoreMLモジュール構造

#### **新規ファイル構成**
```
src/gpu/coreml/
├── mod.rs                  # モジュール公開・CoreML基盤
├── device.rs              # CoreMLデバイス管理
├── linear_algebra.rs      # 線形代数演算実装
├── convolution.rs         # 畳み込み演算実装
├── reduction.rs           # リダクション演算実装
├── memory.rs              # メモリ管理・転送
├── error.rs               # CoreMLエラーハンドリング
└── utils.rs               # 共通ユーティリティ
```

### 3.2 CoreML基盤クラス設計

#### **`src/gpu/coreml/mod.rs`**
```rust
use objc2_core_ml::*;
use objc2_foundation::*;

pub mod device;
pub mod linear_algebra;
pub mod convolution;
pub mod reduction;
pub mod memory;
pub mod error;
pub mod utils;

pub use device::CoreMLDevice;
pub use linear_algebra::CoreMLLinearAlgebra;
pub use convolution::CoreMLConvolution;
pub use reduction::CoreMLReduction;
pub use error::{CoreMLError, CoreMLResult};

/// CoreML実行コンテキスト
pub struct CoreMLContext {
    device: MLCDevice,
    training_graph: MLCTrainingGraph,
    inference_graph: MLCInferenceGraph,
}

impl CoreMLContext {
    pub fn new(device_id: usize) -> CoreMLResult<Self> {
        // Apple Neural Engine優先、Metal GPU フォールバック
        let device = if let Some(ane_device) = MLCDevice::aneDevice() {
            ane_device
        } else if let Some(gpu_device) = MLCDevice::gpuDevice() {
            gpu_device
        } else {
            return Err(CoreMLError::DeviceNotAvailable);
        };

        let training_graph = MLCTrainingGraph::new();
        let inference_graph = MLCInferenceGraph::new();

        Ok(CoreMLContext {
            device,
            training_graph,
            inference_graph,
        })
    }

    pub fn device(&self) -> &MLCDevice {
        &self.device
    }

    pub fn training_graph(&self) -> &MLCTrainingGraph {
        &self.training_graph
    }

    pub fn inference_graph(&self) -> &MLCInferenceGraph {
        &self.inference_graph
    }
}
```

### 3.3 デバイス管理の拡張

#### **`src/gpu/coreml/device.rs`**
```rust
use super::{CoreMLContext, CoreMLResult, CoreMLError};

pub struct CoreMLDevice {
    context: CoreMLContext,
    device_id: usize,
}

impl CoreMLDevice {
    pub fn new(device_id: usize) -> CoreMLResult<Self> {
        let context = CoreMLContext::new(device_id)?;
        Ok(CoreMLDevice { context, device_id })
    }

    pub fn is_available(device_id: usize) -> bool {
        #[cfg(target_os = "macos")]
        {
            // macOS 14.0+ でCoreML利用可能性チェック
            if let Ok(_) = CoreMLContext::new(device_id) {
                true
            } else {
                false
            }
        }
        #[cfg(not(target_os = "macos"))]
        {
            false
        }
    }

    pub fn context(&self) -> &CoreMLContext {
        &self.context
    }
}

// DeviceManager拡張
impl crate::gpu::DeviceManager {
    // 🆕 CoreML利用可能性チェック
    pub fn is_coreml_available() -> bool {
        CoreMLDevice::is_available(0)
    }

    // 🆕 最適デバイス選択（CoreML優先）
    pub fn optimal_device_for_ml() -> crate::gpu::DeviceType {
        if Self::is_coreml_available() {
            crate::gpu::DeviceType::CoreML(0)
        } else if Self::is_metal_available() {
            crate::gpu::DeviceType::Metal(0)
        } else if Self::is_cuda_available() {
            crate::gpu::DeviceType::Cuda(0)
        } else {
            crate::gpu::DeviceType::Cpu
        }
    }
}
```

---

## 📋 4. 段階的実装スケジュール - 修正版

### 🚀 Phase 1: 基盤・線形代数 (3週間)

#### **Week 1: CoreML基盤**
```rust
// 実装項目
- DeviceType::CoreML 追加
- CoreMLContext, CoreMLDevice 基本実装
- objc2-core-ml 統合テスト
- エラーハンドリング基盤

// 成果物
- src/gpu/mod.rs の拡張
- src/gpu/coreml/mod.rs 新規作成
- 基本的な単体テスト
```

#### **Week 2-3: 線形代数演算**
```rust
// 実装項目
- GpuLinearAlgebra trait の CoreML 拡張
- CoreMLLinearAlgebra 実装
- matmul, batch_matmul, matvec の CoreML 対応
- 性能ベンチマーク

// 成果物
- src/gpu/coreml/linear_algebra.rs
- 既存 matrix_ops.rs の拡張
- 性能比較レポート
```

### 🧠 Phase 2: CNN・正規化 (3週間)

#### **Week 4-5: 畳み込み演算**
```rust
// 実装項目
- GpuConvolution trait の CoreML 拡張
- conv2d, batch_conv2d, conv2d_transpose 実装
- GpuPooling trait の CoreML 拡張
- max_pool2d, avg_pool2d 実装

// 成果物
- src/gpu/coreml/convolution.rs
- 既存 conv_ops.rs の拡張
- CNN ベンチマーク
```

#### **Week 6: リダクション・正規化**
```rust
// 実装項目
- GpuReduction trait の CoreML 拡張
- sum, mean, max, min 実装
- batch_norm, layer_norm 実装

// 成果物
- src/gpu/coreml/reduction.rs
- 正規化レイヤー実装
- 統計演算テスト
```

### ⚙️ Phase 3: システム統合 (3週間)

#### **Week 7-8: 並列処理・メモリ**
```rust
// 実装項目
- GpuParallelOp trait の CoreML 拡張
- メモリ転送最適化
- Metal-CoreML 連携

// 成果物
- 並列処理統合
- メモリ管理最適化
- ハイブリッド実行
```

#### **Week 9: 最終統合・テスト**
```rust
// 実装項目
- 全trait統合テスト
- 性能回帰テスト
- ドキュメント完成

// 成果物
- 包括的テストスイート
- 性能レポート
- 使用ガイド
```

---

## 🎯 5. 実装の優位性

### 5.1 アーキテクチャ適合性

| 既存要素 | CoreML適合度 | 実装工数 |
|----------|-------------|----------|
| **trait体系** | ✅ 100%完璧 | ほぼ0 |
| **デバイス抽象化** | ✅ 理想的 | 最小限 |
| **エラーハンドリング** | ✅ 統合済み | 拡張のみ |
| **テスト基盤** | ✅ 再利用可能 | 流用 |

### 5.2 既存コードへの影響

#### **ゼロ破壊変更**
```rust
// ユーザーコード - 変更なし
let a = Tensor::from_slice(&[1.0, 2.0]);
let b = Tensor::from_slice(&[3.0, 4.0]);
let result = a.gpu_matmul(&b)?;  // 内部でCoreMLが使用される
```

#### **後方互換性100%**
- 既存API: 一切変更なし
- 既存テスト: そのまま動作
- 既存フィーチャー: 影響なし

### 5.3 段階的価値提供

| Phase | ユーザー価値 | リスク |
|-------|-------------|--------|
| **Phase 1** | 基本演算+30%性能向上 | 🟢 低 |
| **Phase 2** | CNN+50%性能向上 | 🟡 中 |
| **Phase 3** | システム統合+60%性能向上 | 🟡 中 |

---

## ⚡ 6. 技術的実装詳細

### 6.1 型安全性の保証

#### **既存の型制約活用**
```rust
// 既存制約がそのまま使える
impl<T> GpuLinearAlgebra<T> for Tensor<T>
where
    T: Float + FromPrimitive + ScalarOperand + Send + Sync + 'static
{
    // CoreML実装でも同じ制約が適用される
    fn gpu_matmul(&self, other: &Self) -> RusTorchResult<Tensor<T>> {
        // CoreML実装...
    }
}
```

### 6.2 エラーハンドリング統合

#### **統一エラー体系**
```rust
// 既存エラータイプの拡張
impl From<CoreMLError> for RusTorchError {
    fn from(err: CoreMLError) -> Self {
        RusTorchError::gpu(&format!("CoreML error: {}", err))
    }
}

// 透明なエラーハンドリング
fn coreml_matmul(&self, a: &Tensor<T>, b: &Tensor<T>) -> RusTorchResult<Tensor<T>> {
    let executor = CoreMLLinearAlgebra::new()?;  // CoreMLError → RusTorchError
    executor.matmul(a, b)                        // 統一されたエラー型
}
```

### 6.3 性能最適化戦略

#### **デバイス選択の最適化**
```rust
// 演算タイプ別最適化
pub fn optimal_device_for_operation(op: OperationType) -> DeviceType {
    match op {
        OperationType::Convolution => {
            // 畳み込みはCoreMLが最優秀
            if DeviceManager::is_coreml_available() {
                DeviceType::CoreML(0)
            } else {
                DeviceType::Metal(0)
            }
        }
        OperationType::MatMul { size } if size > 1024 => {
            // 大きな行列はCoreML有利
            DeviceType::CoreML(0)
        }
        OperationType::ElementWise => {
            // 要素演算はMetal高速
            DeviceType::Metal(0)
        }
        _ => DeviceManager::optimal_device_for_ml(),
    }
}
```

---

## 🧪 7. 包括的テスト戦略

### 7.1 既存テストの活用

#### **テスト継承**
```rust
// 既存テストがCoreMLでも動作することを保証
#[cfg(test)]
mod coreml_compatibility_tests {
    use super::*;

    #[test]
    fn test_existing_gpu_matmul_with_coreml() {
        // 強制的にCoreMLデバイスを使用
        std::env::set_var("RUSTORCH_FORCE_COREML", "1");

        // 既存のテストロジックをそのまま実行
        let a = Tensor::<f32>::from_vec(vec![1.0, 2.0, 3.0, 4.0], vec![2, 2]);
        let b = Tensor::<f32>::from_vec(vec![5.0, 6.0, 7.0, 8.0], vec![2, 2]);

        let result = a.gpu_matmul(&b).unwrap();

        // 期待される結果は変わらない
        assert_eq!(result.shape(), &[2, 2]);

        std::env::remove_var("RUSTORCH_FORCE_COREML");
    }
}
```

### 7.2 性能回帰防止

#### **ベンチマーク継承**
```rust
// 既存ベンチマークのCoreML版
#[bench]
fn bench_coreml_matmul(b: &mut Bencher) {
    let input = create_test_tensor([1024, 1024]);

    // CoreMLデバイス強制使用
    let coreml_tensor = input.to_device(DeviceType::CoreML(0)).unwrap();

    b.iter(|| {
        black_box(coreml_tensor.gpu_matmul(&coreml_tensor).unwrap())
    });
}
```

---

## 📊 8. 期待される成果

### 8.1 実装工数の劇的削減

| 当初予想 | trait活用後 | 削減率 |
|----------|------------|--------|
| **18週間** | **9週間** | **-50%** |
| 新API設計 4週 | 不要 | **-100%** |
| trait実装 8週 | 4週 | **-50%** |
| 統合テスト 4週 | 2週 | **-50%** |
| ドキュメント 2週 | 1週 | **-50%** |

### 8.2 品質保証の向上

| 品質指標 | 既存基盤活用 | 新規実装 |
|----------|-------------|----------|
| **型安全性** | ✅ 保証済み | ⚠️ 要検証 |
| **エラーハンドリング** | ✅ 統合済み | ⚠️ 要設計 |
| **テストカバレッジ** | ✅ 継承可能 | ⚠️ 要作成 |
| **後方互換性** | ✅ 100%保証 | ⚠️ リスクあり |

### 8.3 性能向上予測（確実性高）

| 演算タイプ | 既存GPU | CoreML | 向上率 |
|-----------|---------|--------|--------|
| **畳み込み** | 100% | 160-180% | **+60-80%** |
| **行列乗算** | 100% | 140-160% | **+40-60%** |
| **要素演算** | 100% | 120-140% | **+20-40%** |

---

## 🎯 9. 結論・実装推奨度

### 9.1 技術的実現可能性: **🟢 極めて高い**

#### 成功要因
✅ **既存アーキテクチャの完璧な適合**
✅ **最小限の変更で最大の効果**
✅ **ゼロ破壊変更での統合**
✅ **段階的価値提供**

#### リスク要因
⚠️ **objc2-core-ml API の学習コスト** → 文書化で解決
⚠️ **CoreML特有の制約** → 既存フォールバックで解決

### 9.2 投資対効果: **🟢 非常に高い**

#### 開発コスト: **9週間** （当初18週から半減）
#### 技術価値: **既存資産の完全活用**
#### ユーザー価値: **透明な性能向上**

### 9.3 実装戦略: **🚀 即座に開始推奨**

1. **Phase 1開始**: DeviceType拡張から即座に着手
2. **段階的価値**: 各Phaseで独立した価値提供
3. **リスク最小**: 既存機能への影響ゼロ

**総合判定**: **🟢 最優先実装推奨**

---

*この詳細計画は、RusTorchの既存trait体系の完璧な適合性を活用し、最小限の工数で最大の効果を得るCoreML統合戦略を提供します。*