mecha10-cli 0.1.47

# Automatic INT8 Model Quantization

Mecha10 supports **automatic INT8 quantization** for ONNX models, enabled per-model in the model catalog. When quantization is enabled, models are automatically converted to INT8 format during download, requiring zero manual steps.

## Overview

**INT8 quantization** reduces model size and increases inference speed by converting 32-bit floating-point weights to 8-bit integers. This typically results in:
- **2x faster inference** (50-70ms → 25-35ms)
- **4x smaller model size** (8 MB → 2 MB)
- **Minimal accuracy loss** (<1% for most models)

## How It Works

### Automatic Quantization Flow

```
mecha10 models pull yolov8n
    ↓
1. Download model.onnx from HuggingFace
    ↓
2. Download labels and generate config
    ↓
3. Check model catalog for quantize.enabled
    ↓
4. [IF ENABLED] Automatically quantize to model-int8.onnx
    ↓
5. Cache both FP32 and INT8 models
    ↓
✅ Model ready to use
```

### Model Selection at Runtime

The object detector node automatically selects the appropriate model:

```rust
// If use_int8: true in node config
if config.use_int8 {
    // Load models/yolov8n/model-int8.onnx
} else {
    // Load models/yolov8n/model.onnx
}
```

## Enabling Quantization

### For Catalog Models

Quantization is controlled in `model_catalog.toml`:

```toml
[[models]]
name = "yolov8n"
description = "YOLOv8 Nano - Fast object detection"
task = "object-detection"
repo = "deepghs/yolos"
filename = "yolov8n/model.onnx"
preprocessing_preset = "yolo"

# Enable automatic INT8 quantization
[models.quantize]
enabled = true
method = "dynamic_int8"
```

### For Custom Models

To quantize a custom model:

1. Pull the model without quantization
2. Use the embedded quantization script:

```bash
# Find the script in temp directory during model pull
# Or use the standalone Python script:
python3 -c "
from onnxruntime.quantization import quantize_dynamic, QuantType

quantize_dynamic(
    model_input='models/custom/model.onnx',
    model_output='models/custom/model-int8.onnx',
    weight_type=QuantType.QUInt8,
    optimize_model=True,
    per_channel=False,
)
"
```

3. Enable in node config: `"use_int8": true`

## Using INT8 Models

### Enable in Node Configuration

Edit `configs/common/nodes/object-detector/config.json`:

```json
{
  "model_path": "models/yolov8n/model.onnx",
  "use_int8": true,
  "input_size": 320,
  ...
}
```

### Verify INT8 Model Exists

```bash
ls -lh models/yolov8n/
# Should see both:
#   model.onnx       (FP32, ~8 MB)
#   model-int8.onnx  (INT8, ~2 MB)
```

### Check Inference Performance

```bash
mecha10 dev

# Watch object detector logs for inference timing:
# [INFO] Inference: 35ms (FP32)
# vs
# [INFO] Inference: 18ms (INT8)
```

## Requirements

### Python Environment

Automatic quantization requires:
- Python 3 (`python3` or `python` in PATH)
- `onnx` and `onnxruntime` packages

Install dependencies:

```bash
# macOS
brew install python3
pip3 install onnx onnxruntime

# Ubuntu/Debian
apt install python3 python3-pip
pip3 install onnx onnxruntime

# Windows
# Install Python from python.org
pip install onnx onnxruntime
```

### Fallback Behavior

If Python or required packages are missing:
- Quantization will fail with a helpful error message
- FP32 model will still be available and functional
- You can install dependencies and re-pull the model

## Architecture

### Implementation Details

1. **Catalog Schema**: `ModelCatalogEntry` includes optional `QuantizeConfig`
2. **Embedded Script**: Python quantization script is embedded in CLI binary
3. **Automatic Execution**: Called by `ModelService::pull()` after model download
4. **Caching**: INT8 models are cached like FP32 (skip if already exists)
5. **Runtime Selection**: Node loads INT8 or FP32 based on `use_int8` config

### File Structure

```
models/yolov8n/
├── model.onnx         # FP32 model (from HuggingFace)
├── model-int8.onnx    # INT8 quantized (auto-generated, gitignored)
├── labels.txt         # Class labels
└── config.json        # Model metadata
```

### Quantization Method

**Dynamic INT8 Quantization**:
- Quantizes weights to INT8 at load time
- Activations remain FP32 (dynamic quantization)
- No calibration data required
- Works with any ONNX model
- ~2x speedup with minimal accuracy loss

## Performance Benchmarks

| Model | Input Size | FP32 (ms) | INT8 (ms) | Speedup |
|-------|-----------|-----------|-----------|---------|
| YOLOv8n | 320x320 | 35 | 18 | **1.9x** |
| YOLOv8n | 640x640 | 140 | 70 | **2.0x** |
| YOLOv8s | 320x320 | 65 | 32 | **2.0x** |

*Tested on M3 MacBook Pro with CoreML acceleration*

## Troubleshooting

### "Python 3 not found"

```bash
# Install Python
brew install python3  # macOS
apt install python3   # Ubuntu

# Verify
python3 --version
```

### "onnxruntime not found"

```bash
# Install dependencies
pip3 install onnx onnxruntime

# Verify
python3 -c "import onnxruntime; print(onnxruntime.__version__)"
```

### "Quantization failed"

Check Python script output:
```bash
# Re-pull model with verbose logging
RUST_LOG=debug mecha10 models pull yolov8n
```

Common issues:
- Corrupt ONNX model: Re-download with `mecha10 models remove yolov8n && mecha10 models pull yolov8n`
- Unsupported model format: Some models can't be quantized (use FP32)
- Disk space: INT8 model requires additional ~2 MB per model

### INT8 Model Not Being Used

1. Check node config: `"use_int8": true`
2. Verify INT8 model exists: `ls models/yolov8n/model-int8.onnx`
3. Check logs for model loading: `mecha10 dev | grep "Loading model"`

## Best Practices

1. **Development**: Use FP32 for maximum accuracy during development
2. **Production**: Enable INT8 for speed in production deployments
3. **Testing**: Compare FP32 vs INT8 accuracy on your specific use case
4. **Input Size**: Combine with reduced input size (320x320) for maximum speed
5. **Hardware Acceleration**: Use INT8 with CoreML/CUDA for best performance

## See Also

- [Model Service API](../src/services/model_service.rs)
- [Object Detector Configuration](../../nodes/object-detector/src/config.rs)
- [Model Catalog](../model_catalog.toml)
- [ONNX Runtime Quantization Docs](https://onnxruntime.ai/docs/performance/model-optimizations/quantization.html)