kenosis-cli 1.2.1

CLI for ONNX model quantization, casting, inspection, and comparison — powered by kenosis-core
<div align="center">

# Kenosis

**Production-grade ONNX model quantization. Zero Python. Single Native Binary.**

[![CI](https://github.com/CoreEpoch/kenosis/actions/workflows/ci.yml/badge.svg)](https://github.com/CoreEpoch/kenosis/actions)
[![License: Apache-2.0](https://img.shields.io/badge/License-Apache--2.0-blue.svg)](LICENSE)

</div>

---

Kenosis is a Rust CLI toolkit for quantizing, validating, inspecting, and comparing ONNX models. Its flagship feature is **static INT8 quantization** (enabled by default) with fusion-aware QDQ placement that achieves up to **2.43× speedup** over FP32 baselines and **63% faster inference** than the ORT Python quantizer — on stock ONNX Runtime, no custom operators required. Evaluated on **1,000 real-world images** from ImageNet-1K and MS COCO val2017, Kenosis INT8 achieves **92.0–95.6% Top-1 predict agreement** with FP32 baselines on standard models.

> **How it works:** [Fusion-Aware QDQ Placement: Achieving Native Kernel Fusion in ONNX via Graph Reordering](https://coreepoch.dev/research/kenosis-fusion-aware-qdq-placement.html)

## Benchmark Results

Kenosis quantizes the **PP-YOLOE+ object detection models**, an anchor-free architecture designed for efficient edge deployment:

| Model | Resolution | Cosine | Latency | Speedup | Size |
|-------|-----------|--------|---------|---------|------|
| PP-YOLOE+ Small | 320×320 | 0.998 | **23ms** vs 44ms | **1.89×** | 7.9 MB (3.9× smaller) |
| PP-YOLOE+ Small | 416×416 | 0.998 | **43ms** vs 77ms | **1.80×** | 7.9 MB (3.9× smaller) |
| PP-YOLOE+ Small | 640×640 | 0.999 | **111ms** vs 187ms | **1.68×** | 7.9 MB (3.8× smaller) |


### Classifier Benchmarks (Kenosis PT/PC vs FP32 Baseline)

Fidelity evaluated on **500 images** from the ImageNet-1K validation set with model-specific preprocessing.

| Model | Config | Cosine | Top-1 Agree | Latency | Speedup | INT8 Size |
|-------|--------|--------|-------------|---------|---------|-----------|
| SqueezeNet 1.1 | Kenosis (PT) | 0.9206 | 0.0% | **3.08ms** | **2.01×** | 1.24 MB (3.8× smaller) |
| SqueezeNet 1.1 | Kenosis (PC) | 0.9206 | 0.0% | **3.08ms** | **2.01×** | 1.24 MB (3.8× smaller) |
| ResNet50 v2 | Kenosis (PT) | 0.9804 | **94.8%** | **27.89ms** | **2.43×** | 30.6 MB (3.2× smaller) |
| ResNet50 v2 | Kenosis (PC) | 0.9883 | **95.6%** | **28.02ms** | **2.41×** | 30.6 MB (3.2× smaller) |
| MobileNetV2 | Kenosis (PT) | 0.9700 | **92.6%** | **5.16ms** | **1.35×** | 7.10 MB (1.9× smaller) |
| MobileNetV2 | Kenosis (PC) | 0.9739 | **92.0%** | **5.18ms** | **1.35×** | 7.10 MB (1.9× smaller) |
| EfficientNet-Lite4 | Kenosis (PT) | 0.3658 | 17.6% | **18.53ms** | **1.48×** | 16.5 MB (3.0× smaller) |
| EfficientNet-Lite4 | Kenosis (PC) | 0.3821 | 19.2% | **18.68ms** | **1.47×** | 16.5 MB (3.0× smaller) |

> SqueezeNet cosine is measured under synthetic Gaussian calibration inputs. Real-image accuracy metrics are low because the ONNX Zoo SqueezeNet 1.1 model uses non-standard input preprocessing.

### Direct Comparison (Kenosis PT/PC vs ORT Python Quantizer PT)

| Model | Kenosis (PT) Latency | Kenosis (PC) Latency | ORT Latency | Kenosis Advantage |
|-------|----------------------|----------------------|-------------|-------------------|
| SqueezeNet 1.1 | **3.08ms** | **3.08ms** | 8.41ms | **63.4% faster** |
| ResNet50 v2 | **27.89ms** | **28.02ms** | 44.73ms | **37.4% faster** |
| MobileNetV2 | **5.16ms** | **5.18ms** | 6.79ms | **23.7% faster** |
| EfficientNet-Lite4 | **18.53ms** | **18.68ms** | 23.84ms | **21.6% faster** |

> The ORT Python quantizer produces a SqueezeNet model that is **slower than FP32** (8.41ms INT8 vs 6.19ms FP32) — broken Conv-ReLU fusion in action. Kenosis eliminates this regression entirely, delivering a 2.01× speedup over FP32 in both PT and PC modes.

### Real-World Accuracy (Kenosis vs ORT Python)

Evaluated on **500 ImageNet-1K** images (classifiers) and **500 MS COCO val2017** images (detector):

| Model | Dataset | Kenosis (PT) Cos | Kenosis (PC) Cos | ORT Cos | Kenosis (PT) Agree | Kenosis (PC) Agree | ORT Agree |
|-------|---------|------------------|------------------|---------|--------------------|--------------------|-----------|
| ResNet50 v2 | ImageNet-1K (500) | **0.980** | **0.988** | 0.974 | **94.8%** | **95.6%** | 51.4% |
| MobileNetV2 | ImageNet-1K (500) | **0.970** | **0.974** | 0.951 | **92.6%** | **92.0%** | 7.6% |
| EfficientNet-Lite4 | ImageNet-1K (500) | 0.366 | 0.382 | 0.366 | 17.6% | 19.2% | **23.0%** |

| Model | Dataset | Kenosis (PT) Cos | Kenosis (PC) Cos | ORT Cos | Kenosis (PT) Agree | Kenosis (PC) Agree | ORT Agree |
|-------|---------|------------------|------------------|---------|--------------------|--------------------|-----------|
| PP-YOLOE+ Small | COCO val2017 (500) | 0.833 | **0.850** | 0.817 | **99.8%** | **99.8%** | 98.4% |

> ORT's MobileNetV2 achieves only **7.6% predict agreement** with FP32. Kenosis maintains **92.0–95.6% agreement** across standard classifiers.

## Key Features

| Feature | Kenosis | ORT Python |
|---------|---------|------------|
| Static INT8 with ReLU-aware QDQ | ✅ | ❌ |
| Detection model mixed-precision | ✅ | ❌ |
| Non-vision tensor protection | ✅ | ❌ |
| Multi-input model calibration | ✅ | ❌ |
| Transformer & MatMul quantization | ✅ | ❌ |
| NLP synthetic calibration data | ✅ | ❌ |
| INT32 bias quantization w/ DQL | ✅ | ✅ |
| Per-channel weight quantization | ✅ | ✅ |
| Built-in validation + benchmarking | ✅ | ❌ |
| PaddlePaddle Constant extraction | ✅ | ❌ |
| Zero Python dependency | ✅ | ❌ |
| Cross-platform single binary | ✅ | ❌ |

## Install

```bash
cargo install kenosis-cli
```

Or build from source:

```bash
git clone https://github.com/CoreEpoch/kenosis.git
cd kenosis
cargo build --release
```

## Usage

### Static INT8 Quantization (default)

The primary quantization mode. Produces QDQ-format models that run on stock ONNX Runtime with full INT8 acceleration.

```bash
# Standard vision model (SqueezeNet, ResNet, EfficientNet, etc.)
kenosis quantize model.onnx -o model_int8.onnx

# Per-tensor weights (one scale per tensor, override the default per-channel mode)
kenosis quantize model.onnx -o model_int8.onnx --per-tensor

# PaddlePaddle models (PP-YOLOE+, PP-LCNet, etc.)
kenosis quantize ppyoloe.onnx -o ppyoloe_int8.onnx --extract-constants

# Custom calibration sample count
kenosis quantize model.onnx -o model_int8.onnx --n-calib 40

# External calibration data (raw f32 binary files)
kenosis quantize model.onnx -o model_int8.onnx --calib-dir ./calib_data/
```

### Validate Quantized Models

Compare a quantized model against its FP32 baseline — measures cosine similarity, Top-1 agreement, and latency side-by-side.

```bash
# Basic validation (default samples, 200 timed runs)
kenosis validate model.onnx model_int8.onnx

# Custom sample counts
kenosis validate model.onnx model_int8.onnx -n 500 --timed 500
```

Output:
```
  ════════════════════════════════════════════════════════
  📊  Kenosis Validation Report
  ════════════════════════════════════════════════════════
  ▸ Cosine similarity:  0.999128 (min 0.9986)
  ▸ Top-1 agreement:    83/100 (83%)
  ▸ Latency:            2.82ms vs 6.03ms (2.13× speedup)
  ▸ Size:               1.24 MB vs 4.73 MB (3.8× smaller)
  ▸ Verdict:             EXCELLENT — production ready
  ════════════════════════════════════════════════════════
```

### Inspect a Model

```bash
# Basic stats — ops, params, size, data types, largest tensors
kenosis inspect model.onnx
```

### Utility Commands

```bash
# Cast to FP16/BF16
kenosis cast model.onnx -o model_fp16.onnx --precision fp16

# Compare two models
kenosis diff model.onnx model_int8.onnx
```

## How Static INT8 Works

Kenosis's static INT8 pipeline applies seven coordinated optimizations:

1. **Self-calibration** — Automatically generates synthetic calibration inputs and runs them through the model via ONNX Runtime to collect per-tensor activation ranges. No external calibration data required. Multi-input models and NLP inputs (token IDs, attention masks) are handled automatically.

2. **Weight quantization** — INT8 symmetric per-tensor or per-channel. All scale computations in f64 to match ORT's internal precision.

3. **INT32 bias quantization** — `scale = activation_scale × weight_scale`, zero_point = 0. Wrapped with DequantizeLinear for ORT kernel fusion.

4. **Zero-point nudged activation quantization** — UINT8 asymmetric with post-hoc range adjustment ensuring `float 0.0` maps exactly to the quantized zero. Prevents rounding asymmetry from compounding across layers.

5. **Fusion-aware QDQ placement** — ORT's Python quantizer places QDQ nodes on every Conv/MatMul output independently. Kenosis detects `Conv/MatMul → Activation` pairs (ReLU, LeakyRelu, Clip, HardSwish, Sigmoid) at graph level and places QDQ *after* the activation instead. This gives ORT's runtime optimizer a cleaner pattern that fuses into a single INT8 kernel. Combined with second-pass wrapping of Add, Concat, MaxPool, and AveragePool, this maximizes QLinear fusions.

6. **Non-vision tensor protection** — For multi-input models (detection, segmentation), tensors reachable from non-primary inputs (scale_factor, image_shape) are traced through the graph and excluded from quantization. This prevents metadata paths from being crushed by INT8 range limits.

7. **Model output protection** — Tensors that are direct model outputs are never QDQ-wrapped, preserving full FP32 precision in detection head scores and bounding box coordinates.


## Detection Model Support

Kenosis handles the specific challenges of quantizing object detection models:

- **Multi-input calibration** — Auto-generates appropriate default values for secondary inputs (scale_factor → 1.0, shape tensors → 0.0)
- **PaddlePaddle weight handling** — Extracts inline Constant nodes, deduplicates shared weights (deepcopy tensors), and upgrades opset attributes (Squeeze, Unsqueeze, BatchNorm, Dropout)
- **Mixed-precision detection head** — Backbone and neck are fully INT8; detection head outputs and metadata paths stay FP32
- **Scale factor preservation** — The bounding box rescaling path remains live and dynamic, not frozen to calibration values


## Architecture

```
kenosis/
├── crates/
│   └── kenosis-core/           # Library: quantization engine
│       └── src/
│           ├── model.rs        # OnnxModel load/save/traversal + Constant extraction
│           ├── static_int8.rs  # Static INT8 QDQ quantization pipeline
│           ├── inspect.rs      # Stats and analysis
│           ├── cast.rs         # FP16/BF16 casting
│           ├── diff.rs         # Model comparison
│           ├── proto.rs        # ONNX protobuf type definitions
│           └── error.rs        # Error types
├── apps/
│   └── kenosis-cli/            # Binary: CLI interface
│       └── src/commands/
│           ├── quantize.rs     # quantize command (static INT8)
│           ├── validate.rs     # validate command (accuracy + latency)
│           ├── inspect.rs      # inspect command
│           ├── cast.rs         # cast command
│           └── diff.rs         # diff command
```

## License

Apache-2.0 — see [LICENSE](LICENSE).