mecha10-cli 0.1.47

# Automatic Model Quantization Proposal

## Overview

Integrate automatic INT8 quantization into the model download pipeline, triggered by model config.

## Changes Required

### 1. Add Quantization Config to Model Schema

```rust
// In model_service.rs, ModelCatalogEntry:
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct ModelCatalogEntry {
    pub name: String,
    pub description: String,
    pub task: String,
    pub repo: String,
    pub filename: String,

    // NEW: Quantization configuration
    #[serde(default)]
    pub quantize: Option<QuantizeConfig>,
    ...
}

#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct QuantizeConfig {
    /// Enable automatic quantization after download
    pub enabled: bool,
    /// Quantization type (only "dynamic_int8" supported for now)
    pub method: String, // "dynamic_int8", "static_int8", etc.
}
```

### 2. Modify ModelService::pull() Method

```rust
pub async fn pull(&self, name: &str, progress: Option<&ProgressBar>) -> Result<PathBuf> {
    let entry = self.get_catalog_entry(name)
        .context(format!("Model '{}' not found in catalog", name))?;

    // Create model directory
    let model_dir = self.models_dir.join(name);
    fs::create_dir_all(&model_dir).await?;

    // Download FP32 model
    let model_path = self
        .pull_from_repo(&entry.repo, &entry.filename, name, progress)
        .await?;

    // Download labels...
    self.pull_labels_from_repo(entry, name, progress).await?;

    // Generate config...
    self.generate_model_config(entry, &model_path, progress).await?;

    // NEW: Auto-quantize if configured
    if let Some(quantize_config) = &entry.quantize {
        if quantize_config.enabled {
            self.quantize_model(&model_path, quantize_config, progress).await?;
        }
    }

    Ok(model_path)
}
```

### 3. Add Quantization Method

```rust
impl ModelService {
    /// Quantize a model to INT8 (calls Python script)
    async fn quantize_model(
        &self,
        model_path: &Path,
        config: &QuantizeConfig,
        progress: Option<&ProgressBar>,
    ) -> Result<PathBuf> {
        let int8_path = model_path.with_file_name("model-int8.onnx");

        // Skip if already quantized
        if int8_path.exists() {
            if let Some(pb) = progress {
                pb.set_message("INT8 model already cached");
            }
            return Ok(int8_path);
        }

        if let Some(pb) = progress {
            pb.set_message("Quantizing model to INT8...");
        }

        match config.method.as_str() {
            "dynamic_int8" => {
                self.quantize_dynamic_int8(model_path, &int8_path).await?;
            }
            _ => {
                anyhow::bail!("Unsupported quantization method: {}", config.method);
            }
        }

        if let Some(pb) = progress {
            pb.set_message("✅ INT8 model ready");
        }

        Ok(int8_path)
    }

    /// Perform dynamic INT8 quantization using ONNX Runtime tools
    async fn quantize_dynamic_int8(&self, input: &Path, output: &Path) -> Result<()> {
        // Check if Python + onnxruntime available
        let python = self.find_python()?;

        // Embed quantization script in binary
        let script = include_str!("../../scripts/quantize_int8.py");
        let script_path = std::env::temp_dir().join("quantize_int8.py");
        fs::write(&script_path, script).await?;

        // Run Python quantization script
        let status = tokio::process::Command::new(&python)
            .arg(&script_path)
            .arg(input)
            .arg(output)
            .status()
            .await?;

        if !status.success() {
            anyhow::bail!("Quantization failed with exit code: {:?}", status.code());
        }

        // Cleanup temp script
        let _ = fs::remove_file(&script_path).await;

        Ok(())
    }

    /// Find Python 3 executable
    fn find_python(&self) -> Result<String> {
        for candidate in &["python3", "python"] {
            if which::which(candidate).is_ok() {
                return Ok(candidate.to_string());
            }
        }
        anyhow::bail!(
            "Python 3 not found. Install with: brew install python3 (macOS) or apt install python3 (Linux)"
        )
    }
}
```

### 4. Update Model Catalog

```toml
# model_catalog.toml
[[models]]
name = "yolov8n"
description = "YOLOv8 Nano - Fast object detection"
task = "object-detection"
repo = "deepghs/yolos"
filename = "yolov8n/model.onnx"
preprocessing_preset = "yolo"

# NEW: Enable automatic quantization
[models.quantize]
enabled = true
method = "dynamic_int8"
```

## Benefits

✅ **Zero manual steps** - Quantization happens automatically
✅ **Config-driven** - Enable per-model via catalog
✅ **Cached** - INT8 model cached like FP32
✅ **Portable** - Works on any machine with Python
✅ **Fallback** - Uses FP32 if quantization fails
✅ **Progressive enhancement** - Old models work without changes

## User Experience

### Before:
```bash
mecha10 models pull yolov8n
# Wait for download...
cd models/yolov8n/
pip install onnx onnxruntime
python quantize_to_int8.py
# Edit config: "use_int8": true
mecha10 dev
```

### After:
```bash
mecha10 models pull yolov8n
# Wait for download...
# Automatically quantizes to INT8!
# Edit config: "use_int8": true
mecha10 dev
```

## Dependencies

- **Python 3** (already required for Godot scripts)
- **pip packages:** `onnx onnxruntime` (auto-install or warn)

## Rollout Plan

1. **Phase 1:** Add optional quantization to ModelService
2. **Phase 2:** Enable for yolov8n in catalog
3. **Phase 3:** Document in GETTING_STARTED.md
4. **Phase 4:** Enable for all object detection models