Kenosis
Production-grade ONNX model quantization. Zero Python. Single Native Binary.
Kenosis is a Rust CLI toolkit for quantizing, validating, inspecting, and comparing ONNX models. Its flagship feature is static INT8 quantization (enabled by default) with fusion-aware QDQ placement that achieves up to 2.43× speedup over FP32 baselines and 63% faster inference than the ORT Python quantizer — on stock ONNX Runtime, no custom operators required. Evaluated on 1,000 real-world images from ImageNet-1K and MS COCO val2017, Kenosis INT8 achieves 92.0–95.6% Top-1 predict agreement with FP32 baselines on standard models.
How it works: Fusion-Aware QDQ Placement: Achieving Native Kernel Fusion in ONNX via Graph Reordering
Benchmark Results
Kenosis quantizes the PP-YOLOE+ object detection models, an anchor-free architecture designed for efficient edge deployment:
| Model | Resolution | Cosine | Latency | Speedup | Size |
|---|---|---|---|---|---|
| PP-YOLOE+ Small | 320×320 | 0.998 | 23ms vs 44ms | 1.89× | 7.9 MB (3.9× smaller) |
| PP-YOLOE+ Small | 416×416 | 0.998 | 43ms vs 77ms | 1.80× | 7.9 MB (3.9× smaller) |
| PP-YOLOE+ Small | 640×640 | 0.999 | 111ms vs 187ms | 1.68× | 7.9 MB (3.8× smaller) |
Classifier Benchmarks (Kenosis PT/PC vs FP32 Baseline)
Fidelity evaluated on 500 images from the ImageNet-1K validation set with model-specific preprocessing.
| Model | Config | Cosine | Top-1 Agree | Latency | Speedup | INT8 Size |
|---|---|---|---|---|---|---|
| SqueezeNet 1.1 | Kenosis (PT) | 0.9206 | 0.0% | 3.08ms | 2.01× | 1.24 MB (3.8× smaller) |
| SqueezeNet 1.1 | Kenosis (PC) | 0.9206 | 0.0% | 3.08ms | 2.01× | 1.24 MB (3.8× smaller) |
| ResNet50 v2 | Kenosis (PT) | 0.9804 | 94.8% | 27.89ms | 2.43× | 30.6 MB (3.2× smaller) |
| ResNet50 v2 | Kenosis (PC) | 0.9883 | 95.6% | 28.02ms | 2.41× | 30.6 MB (3.2× smaller) |
| MobileNetV2 | Kenosis (PT) | 0.9700 | 92.6% | 5.16ms | 1.35× | 7.10 MB (1.9× smaller) |
| MobileNetV2 | Kenosis (PC) | 0.9739 | 92.0% | 5.18ms | 1.35× | 7.10 MB (1.9× smaller) |
| EfficientNet-Lite4 | Kenosis (PT) | 0.3658 | 17.6% | 18.53ms | 1.48× | 16.5 MB (3.0× smaller) |
| EfficientNet-Lite4 | Kenosis (PC) | 0.3821 | 19.2% | 18.68ms | 1.47× | 16.5 MB (3.0× smaller) |
SqueezeNet cosine is measured under synthetic Gaussian calibration inputs. Real-image accuracy metrics are low because the ONNX Zoo SqueezeNet 1.1 model uses non-standard input preprocessing.
Direct Comparison (Kenosis PT/PC vs ORT Python Quantizer PT)
| Model | Kenosis (PT) Latency | Kenosis (PC) Latency | ORT Latency | Kenosis Advantage |
|---|---|---|---|---|
| SqueezeNet 1.1 | 3.08ms | 3.08ms | 8.41ms | 63.4% faster |
| ResNet50 v2 | 27.89ms | 28.02ms | 44.73ms | 37.4% faster |
| MobileNetV2 | 5.16ms | 5.18ms | 6.79ms | 23.7% faster |
| EfficientNet-Lite4 | 18.53ms | 18.68ms | 23.84ms | 21.6% faster |
The ORT Python quantizer produces a SqueezeNet model that is slower than FP32 (8.41ms INT8 vs 6.19ms FP32) — broken Conv-ReLU fusion in action. Kenosis eliminates this regression entirely, delivering a 2.01× speedup over FP32 in both PT and PC modes.
Real-World Accuracy (Kenosis vs ORT Python)
Evaluated on 500 ImageNet-1K images (classifiers) and 500 MS COCO val2017 images (detector):
| Model | Dataset | Kenosis (PT) Cos | Kenosis (PC) Cos | ORT Cos | Kenosis (PT) Agree | Kenosis (PC) Agree | ORT Agree |
|---|---|---|---|---|---|---|---|
| ResNet50 v2 | ImageNet-1K (500) | 0.980 | 0.988 | 0.974 | 94.8% | 95.6% | 51.4% |
| MobileNetV2 | ImageNet-1K (500) | 0.970 | 0.974 | 0.951 | 92.6% | 92.0% | 7.6% |
| EfficientNet-Lite4 | ImageNet-1K (500) | 0.366 | 0.382 | 0.366 | 17.6% | 19.2% | 23.0% |
| Model | Dataset | Kenosis (PT) Cos | Kenosis (PC) Cos | ORT Cos | Kenosis (PT) Agree | Kenosis (PC) Agree | ORT Agree |
|---|---|---|---|---|---|---|---|
| PP-YOLOE+ Small | COCO val2017 (500) | 0.833 | 0.850 | 0.817 | 99.8% | 99.8% | 98.4% |
ORT's MobileNetV2 achieves only 7.6% predict agreement with FP32. Kenosis maintains 92.0–95.6% agreement across standard classifiers.
Key Features
| Feature | Kenosis | ORT Python |
|---|---|---|
| Static INT8 with ReLU-aware QDQ | ✅ | ❌ |
| Detection model mixed-precision | ✅ | ❌ |
| Non-vision tensor protection | ✅ | ❌ |
| Multi-input model calibration | ✅ | ❌ |
| Transformer & MatMul quantization | ✅ | ❌ |
| NLP synthetic calibration data | ✅ | ❌ |
| INT32 bias quantization w/ DQL | ✅ | ✅ |
| Per-channel weight quantization | ✅ | ✅ |
| Built-in validation + benchmarking | ✅ | ❌ |
| PaddlePaddle Constant extraction | ✅ | ❌ |
| Zero Python dependency | ✅ | ❌ |
| Cross-platform single binary | ✅ | ❌ |
Install
Or build from source:
Usage
Static INT8 Quantization (default)
The primary quantization mode. Produces QDQ-format models that run on stock ONNX Runtime with full INT8 acceleration.
# Standard vision model (SqueezeNet, ResNet, EfficientNet, etc.)
# Per-tensor weights (one scale per tensor, override the default per-channel mode)
# PaddlePaddle models (PP-YOLOE+, PP-LCNet, etc.)
# Custom calibration sample count
# External calibration data (raw f32 binary files)
Validate Quantized Models
Compare a quantized model against its FP32 baseline — measures cosine similarity, Top-1 agreement, and latency side-by-side.
# Basic validation (default samples, 200 timed runs)
# Custom sample counts
Output:
════════════════════════════════════════════════════════
📊 Kenosis Validation Report
════════════════════════════════════════════════════════
▸ Cosine similarity: 0.999128 (min 0.9986)
▸ Top-1 agreement: 83/100 (83%)
▸ Latency: 2.82ms vs 6.03ms (2.13× speedup)
▸ Size: 1.24 MB vs 4.73 MB (3.8× smaller)
▸ Verdict: EXCELLENT — production ready
════════════════════════════════════════════════════════
Inspect a Model
# Basic stats — ops, params, size, data types, largest tensors
Utility Commands
# Cast to FP16/BF16
# Compare two models
How Static INT8 Works
Kenosis's static INT8 pipeline applies seven coordinated optimizations:
-
Self-calibration — Automatically generates synthetic calibration inputs and runs them through the model via ONNX Runtime to collect per-tensor activation ranges. No external calibration data required. Multi-input models and NLP inputs (token IDs, attention masks) are handled automatically.
-
Weight quantization — INT8 symmetric per-tensor or per-channel. All scale computations in f64 to match ORT's internal precision.
-
INT32 bias quantization —
scale = activation_scale × weight_scale, zero_point = 0. Wrapped with DequantizeLinear for ORT kernel fusion. -
Zero-point nudged activation quantization — UINT8 asymmetric with post-hoc range adjustment ensuring
float 0.0maps exactly to the quantized zero. Prevents rounding asymmetry from compounding across layers. -
Fusion-aware QDQ placement — ORT's Python quantizer places QDQ nodes on every Conv/MatMul output independently. Kenosis detects
Conv/MatMul → Activationpairs (ReLU, LeakyRelu, Clip, HardSwish, Sigmoid) at graph level and places QDQ after the activation instead. This gives ORT's runtime optimizer a cleaner pattern that fuses into a single INT8 kernel. Combined with second-pass wrapping of Add, Concat, MaxPool, and AveragePool, this maximizes QLinear fusions. -
Non-vision tensor protection — For multi-input models (detection, segmentation), tensors reachable from non-primary inputs (scale_factor, image_shape) are traced through the graph and excluded from quantization. This prevents metadata paths from being crushed by INT8 range limits.
-
Model output protection — Tensors that are direct model outputs are never QDQ-wrapped, preserving full FP32 precision in detection head scores and bounding box coordinates.
Detection Model Support
Kenosis handles the specific challenges of quantizing object detection models:
- Multi-input calibration — Auto-generates appropriate default values for secondary inputs (scale_factor → 1.0, shape tensors → 0.0)
- PaddlePaddle weight handling — Extracts inline Constant nodes, deduplicates shared weights (deepcopy tensors), and upgrades opset attributes (Squeeze, Unsqueeze, BatchNorm, Dropout)
- Mixed-precision detection head — Backbone and neck are fully INT8; detection head outputs and metadata paths stay FP32
- Scale factor preservation — The bounding box rescaling path remains live and dynamic, not frozen to calibration values
Architecture
kenosis/
├── crates/
│ └── kenosis-core/ # Library: quantization engine
│ └── src/
│ ├── model.rs # OnnxModel load/save/traversal + Constant extraction
│ ├── static_int8.rs # Static INT8 QDQ quantization pipeline
│ ├── inspect.rs # Stats and analysis
│ ├── cast.rs # FP16/BF16 casting
│ ├── diff.rs # Model comparison
│ ├── proto.rs # ONNX protobuf type definitions
│ └── error.rs # Error types
├── apps/
│ └── kenosis-cli/ # Binary: CLI interface
│ └── src/commands/
│ ├── quantize.rs # quantize command (static INT8)
│ ├── validate.rs # validate command (accuracy + latency)
│ ├── inspect.rs # inspect command
│ ├── cast.rs # cast command
│ └── diff.rs # diff command
License
Apache-2.0 — see LICENSE.