# Automatic INT8 Model Quantization
Mecha10 supports **automatic INT8 quantization** for ONNX models, enabled per-model in the model catalog. When quantization is enabled, models are automatically converted to INT8 format during download, requiring zero manual steps.
## Overview
**INT8 quantization** reduces model size and increases inference speed by converting 32-bit floating-point weights to 8-bit integers. This typically results in:
- **2x faster inference** (50-70ms → 25-35ms)
- **4x smaller model size** (8 MB → 2 MB)
- **Minimal accuracy loss** (<1% for most models)
## How It Works
### Automatic Quantization Flow
```
mecha10 models pull yolov8n
↓
1. Download model.onnx from HuggingFace
↓
2. Download labels and generate config
↓
3. Check model catalog for quantize.enabled
↓
4. [IF ENABLED] Automatically quantize to model-int8.onnx
↓
5. Cache both FP32 and INT8 models
↓
✅ Model ready to use
```
### Model Selection at Runtime
The object detector node automatically selects the appropriate model:
```rust
// If use_int8: true in node config
if config.use_int8 {
// Load models/yolov8n/model-int8.onnx
} else {
// Load models/yolov8n/model.onnx
}
```
## Enabling Quantization
### For Catalog Models
Quantization is controlled in `model_catalog.toml`:
```toml
[[models]]
name = "yolov8n"
description = "YOLOv8 Nano - Fast object detection"
task = "object-detection"
repo = "deepghs/yolos"
filename = "yolov8n/model.onnx"
preprocessing_preset = "yolo"
# Enable automatic INT8 quantization
[models.quantize]
enabled = true
method = "dynamic_int8"
```
### For Custom Models
To quantize a custom model:
1. Pull the model without quantization
2. Use the embedded quantization script:
```bash
# Find the script in temp directory during model pull
# Or use the standalone Python script:
python3 -c "
from onnxruntime.quantization import quantize_dynamic, QuantType
quantize_dynamic(
model_input='models/custom/model.onnx',
model_output='models/custom/model-int8.onnx',
weight_type=QuantType.QUInt8,
optimize_model=True,
per_channel=False,
)
"
```
3. Enable in node config: `"use_int8": true`
## Using INT8 Models
### Enable in Node Configuration
Edit `configs/common/nodes/object-detector/config.json`:
```json
{
"model_path": "models/yolov8n/model.onnx",
"use_int8": true,
"input_size": 320,
...
}
```
### Verify INT8 Model Exists
```bash
ls -lh models/yolov8n/
# Should see both:
# model.onnx (FP32, ~8 MB)
# model-int8.onnx (INT8, ~2 MB)
```
### Check Inference Performance
```bash
mecha10 dev
# Watch object detector logs for inference timing:
# [INFO] Inference: 35ms (FP32)
# vs
# [INFO] Inference: 18ms (INT8)
```
## Requirements
### Python Environment
Automatic quantization requires:
- Python 3 (`python3` or `python` in PATH)
- `onnx` and `onnxruntime` packages
Install dependencies:
```bash
# macOS
brew install python3
pip3 install onnx onnxruntime
# Ubuntu/Debian
apt install python3 python3-pip
pip3 install onnx onnxruntime
# Windows
# Install Python from python.org
pip install onnx onnxruntime
```
### Fallback Behavior
If Python or required packages are missing:
- Quantization will fail with a helpful error message
- FP32 model will still be available and functional
- You can install dependencies and re-pull the model
## Architecture
### Implementation Details
1. **Catalog Schema**: `ModelCatalogEntry` includes optional `QuantizeConfig`
2. **Embedded Script**: Python quantization script is embedded in CLI binary
3. **Automatic Execution**: Called by `ModelService::pull()` after model download
4. **Caching**: INT8 models are cached like FP32 (skip if already exists)
5. **Runtime Selection**: Node loads INT8 or FP32 based on `use_int8` config
### File Structure
```
models/yolov8n/
├── model.onnx # FP32 model (from HuggingFace)
├── model-int8.onnx # INT8 quantized (auto-generated, gitignored)
├── labels.txt # Class labels
└── config.json # Model metadata
```
### Quantization Method
**Dynamic INT8 Quantization**:
- Quantizes weights to INT8 at load time
- Activations remain FP32 (dynamic quantization)
- No calibration data required
- Works with any ONNX model
- ~2x speedup with minimal accuracy loss
## Performance Benchmarks
| YOLOv8n | 320x320 | 35 | 18 | **1.9x** |
| YOLOv8n | 640x640 | 140 | 70 | **2.0x** |
| YOLOv8s | 320x320 | 65 | 32 | **2.0x** |
*Tested on M3 MacBook Pro with CoreML acceleration*
## Troubleshooting
### "Python 3 not found"
```bash
# Install Python
brew install python3 # macOS
apt install python3 # Ubuntu
# Verify
python3 --version
```
### "onnxruntime not found"
```bash
# Install dependencies
pip3 install onnx onnxruntime
# Verify
python3 -c "import onnxruntime; print(onnxruntime.__version__)"
```
### "Quantization failed"
Check Python script output:
```bash
# Re-pull model with verbose logging
RUST_LOG=debug mecha10 models pull yolov8n
```
Common issues:
- Corrupt ONNX model: Re-download with `mecha10 models remove yolov8n && mecha10 models pull yolov8n`
- Unsupported model format: Some models can't be quantized (use FP32)
- Disk space: INT8 model requires additional ~2 MB per model
### INT8 Model Not Being Used
1. Check node config: `"use_int8": true`
2. Verify INT8 model exists: `ls models/yolov8n/model-int8.onnx`
3. Check logs for model loading: `mecha10 dev | grep "Loading model"`
## Best Practices
1. **Development**: Use FP32 for maximum accuracy during development
2. **Production**: Enable INT8 for speed in production deployments
3. **Testing**: Compare FP32 vs INT8 accuracy on your specific use case
4. **Input Size**: Combine with reduced input size (320x320) for maximum speed
5. **Hardware Acceleration**: Use INT8 with CoreML/CUDA for best performance
## See Also
- [Model Service API](../src/services/model_service.rs)
- [Object Detector Configuration](../../nodes/object-detector/src/config.rs)
- [Model Catalog](../model_catalog.toml)
- [ONNX Runtime Quantization Docs](https://onnxruntime.ai/docs/performance/model-optimizations/quantization.html)