kenosis-cli-1.2.0 is not a library.

Kenosis

Production-grade ONNX model quantization. Zero Python. Single Native Binary.

Kenosis is a Rust CLI toolkit for quantizing, validating, inspecting, and comparing ONNX models. Its flagship feature is static INT8 quantization (enabled by default) with fusion-aware QDQ placement that achieves up to 2.43× speedup over FP32 baselines and 63% faster inference than the ORT Python quantizer — on stock ONNX Runtime, no custom operators required. Evaluated on 1,000 real-world images from ImageNet-1K and MS COCO val2017, Kenosis INT8 achieves 92.0–95.6% Top-1 predict agreement with FP32 baselines on standard models.

How it works: Fusion-Aware QDQ Placement: Achieving Native Kernel Fusion in ONNX via Graph Reordering

Benchmark Results

Kenosis quantizes the PP-YOLOE+ object detection models, an anchor-free architecture designed for efficient edge deployment:

Model	Resolution	Cosine	Latency	Speedup	Size
PP-YOLOE+ Small	320×320	0.998	23ms vs 44ms	1.89×	7.9 MB (3.9× smaller)
PP-YOLOE+ Small	416×416	0.998	43ms vs 77ms	1.80×	7.9 MB (3.9× smaller)
PP-YOLOE+ Small	640×640	0.999	111ms vs 187ms	1.68×	7.9 MB (3.8× smaller)

Classifier Benchmarks (Kenosis PT/PC vs FP32 Baseline)

Fidelity evaluated on 500 images from the ImageNet-1K validation set with model-specific preprocessing.

Model	Config	Cosine	Top-1 Agree	Latency	Speedup	INT8 Size
SqueezeNet 1.1	Kenosis (PT)	0.9206	0.0%	3.08ms	2.01×	1.24 MB (3.8× smaller)
SqueezeNet 1.1	Kenosis (PC)	0.9206	0.0%	3.08ms	2.01×	1.24 MB (3.8× smaller)
ResNet50 v2	Kenosis (PT)	0.9804	94.8%	27.89ms	2.43×	30.6 MB (3.2× smaller)
ResNet50 v2	Kenosis (PC)	0.9883	95.6%	28.02ms	2.41×	30.6 MB (3.2× smaller)
MobileNetV2	Kenosis (PT)	0.9700	92.6%	5.16ms	1.35×	7.10 MB (1.9× smaller)
MobileNetV2	Kenosis (PC)	0.9739	92.0%	5.18ms	1.35×	7.10 MB (1.9× smaller)
EfficientNet-Lite4	Kenosis (PT)	0.3658	17.6%	18.53ms	1.48×	16.5 MB (3.0× smaller)
EfficientNet-Lite4	Kenosis (PC)	0.3821	19.2%	18.68ms	1.47×	16.5 MB (3.0× smaller)

SqueezeNet cosine is measured under synthetic Gaussian calibration inputs. Real-image accuracy metrics are low because the ONNX Zoo SqueezeNet 1.1 model uses non-standard input preprocessing.

Direct Comparison (Kenosis PT/PC vs ORT Python Quantizer PT)

Model	Kenosis (PT) Latency	Kenosis (PC) Latency	ORT Latency	Kenosis Advantage
SqueezeNet 1.1	3.08ms	3.08ms	8.41ms	63.4% faster
ResNet50 v2	27.89ms	28.02ms	44.73ms	37.4% faster
MobileNetV2	5.16ms	5.18ms	6.79ms	23.7% faster
EfficientNet-Lite4	18.53ms	18.68ms	23.84ms	21.6% faster

The ORT Python quantizer produces a SqueezeNet model that is slower than FP32 (8.41ms INT8 vs 6.19ms FP32) — broken Conv-ReLU fusion in action. Kenosis eliminates this regression entirely, delivering a 2.01× speedup over FP32 in both PT and PC modes.

Real-World Accuracy (Kenosis vs ORT Python)

Evaluated on 500 ImageNet-1K images (classifiers) and 500 MS COCO val2017 images (detector):

Model	Dataset	Kenosis (PT) Cos	Kenosis (PC) Cos	ORT Cos	Kenosis (PT) Agree	Kenosis (PC) Agree	ORT Agree
ResNet50 v2	ImageNet-1K (500)	0.980	0.988	0.974	94.8%	95.6%	51.4%
MobileNetV2	ImageNet-1K (500)	0.970	0.974	0.951	92.6%	92.0%	7.6%
EfficientNet-Lite4	ImageNet-1K (500)	0.366	0.382	0.366	17.6%	19.2%	23.0%

Model	Dataset	Kenosis (PT) Cos	Kenosis (PC) Cos	ORT Cos	Kenosis (PT) Agree	Kenosis (PC) Agree	ORT Agree
PP-YOLOE+ Small	COCO val2017 (500)	0.833	0.850	0.817	99.8%	99.8%	98.4%

ORT's MobileNetV2 achieves only 7.6% predict agreement with FP32. Kenosis maintains 92.0–95.6% agreement across standard classifiers.

Key Features

Feature	Kenosis	ORT Python
Static INT8 with ReLU-aware QDQ	✅	❌
Detection model mixed-precision	✅	❌
Non-vision tensor protection	✅	❌
Multi-input model calibration	✅	❌
Transformer & MatMul quantization	✅	❌
NLP synthetic calibration data	✅	❌
INT32 bias quantization w/ DQL	✅	✅
Per-channel weight quantization	✅	✅
Built-in validation + benchmarking	✅	❌
PaddlePaddle Constant extraction	✅	❌
Zero Python dependency	✅	❌
Cross-platform single binary	✅	❌

Install

cargo install kenosis-cli

Or build from source:

git clone https://github.com/CoreEpoch/kenosis.git
cd kenosis
cargo build --release

Usage

Static INT8 Quantization (default)

The primary quantization mode. Produces QDQ-format models that run on stock ONNX Runtime with full INT8 acceleration.

# Standard vision model (SqueezeNet, ResNet, EfficientNet, etc.)
kenosis quantize model.onnx -o model_int8.onnx

# Per-tensor weights (one scale per tensor, override the default per-channel mode)
kenosis quantize model.onnx -o model_int8.onnx --per-tensor

# PaddlePaddle models (PP-YOLOE+, PP-LCNet, etc.)
kenosis quantize ppyoloe.onnx -o ppyoloe_int8.onnx --extract-constants

# Custom calibration sample count
kenosis quantize model.onnx -o model_int8.onnx --n-calib 40

# External calibration data (raw f32 binary files)
kenosis quantize model.onnx -o model_int8.onnx --calib-dir ./calib_data/

Validate Quantized Models

Compare a quantized model against its FP32 baseline — measures cosine similarity, Top-1 agreement, and latency side-by-side.

# Basic validation (default samples, 200 timed runs)
kenosis validate model.onnx model_int8.onnx

# Custom sample counts
kenosis validate model.onnx model_int8.onnx -n 500 --timed 500

Output:

  ════════════════════════════════════════════════════════
  📊  Kenosis Validation Report
  ════════════════════════════════════════════════════════
  ▸ Cosine similarity:  0.999128 (min 0.9986)
  ▸ Top-1 agreement:    83/100 (83%)
  ▸ Latency:            2.82ms vs 6.03ms (2.13× speedup)
  ▸ Size:               1.24 MB vs 4.73 MB (3.8× smaller)
  ▸ Verdict:             EXCELLENT — production ready
  ════════════════════════════════════════════════════════

Inspect a Model

# Basic stats — ops, params, size, data types, largest tensors
kenosis inspect model.onnx

Utility Commands

# Cast to FP16/BF16
kenosis cast model.onnx -o model_fp16.onnx --precision fp16

# Compare two models
kenosis diff model.onnx model_int8.onnx

How Static INT8 Works

Kenosis's static INT8 pipeline applies seven coordinated optimizations:

Self-calibration — Automatically generates synthetic calibration inputs and runs them through the model via ONNX Runtime to collect per-tensor activation ranges. No external calibration data required. Multi-input models and NLP inputs (token IDs, attention masks) are handled automatically.
Weight quantization — INT8 symmetric per-tensor or per-channel. All scale computations in f64 to match ORT's internal precision.
INT32 bias quantization — scale = activation_scale × weight_scale, zero_point = 0. Wrapped with DequantizeLinear for ORT kernel fusion.
Zero-point nudged activation quantization — UINT8 asymmetric with post-hoc range adjustment ensuring float 0.0 maps exactly to the quantized zero. Prevents rounding asymmetry from compounding across layers.
Fusion-aware QDQ placement — ORT's Python quantizer places QDQ nodes on every Conv/MatMul output independently. Kenosis detects Conv/MatMul → Activation pairs (ReLU, LeakyRelu, Clip, HardSwish, Sigmoid) at graph level and places QDQ after the activation instead. This gives ORT's runtime optimizer a cleaner pattern that fuses into a single INT8 kernel. Combined with second-pass wrapping of Add, Concat, MaxPool, and AveragePool, this maximizes QLinear fusions.
Non-vision tensor protection — For multi-input models (detection, segmentation), tensors reachable from non-primary inputs (scale_factor, image_shape) are traced through the graph and excluded from quantization. This prevents metadata paths from being crushed by INT8 range limits.
Model output protection — Tensors that are direct model outputs are never QDQ-wrapped, preserving full FP32 precision in detection head scores and bounding box coordinates.

Detection Model Support

Kenosis handles the specific challenges of quantizing object detection models:

Multi-input calibration — Auto-generates appropriate default values for secondary inputs (scale_factor → 1.0, shape tensors → 0.0)
PaddlePaddle weight handling — Extracts inline Constant nodes, deduplicates shared weights (deepcopy tensors), and upgrades opset attributes (Squeeze, Unsqueeze, BatchNorm, Dropout)
Mixed-precision detection head — Backbone and neck are fully INT8; detection head outputs and metadata paths stay FP32
Scale factor preservation — The bounding box rescaling path remains live and dynamic, not frozen to calibration values

Architecture

kenosis/
├── crates/
│   └── kenosis-core/           # Library: quantization engine
│       └── src/
│           ├── model.rs        # OnnxModel load/save/traversal + Constant extraction
│           ├── static_int8.rs  # Static INT8 QDQ quantization pipeline
│           ├── inspect.rs      # Stats and analysis
│           ├── cast.rs         # FP16/BF16 casting
│           ├── diff.rs         # Model comparison
│           ├── proto.rs        # ONNX protobuf type definitions
│           └── error.rs        # Error types
├── apps/
│   └── kenosis-cli/            # Binary: CLI interface
│       └── src/commands/
│           ├── quantize.rs     # quantize command (static INT8)
│           ├── validate.rs     # validate command (accuracy + latency)
│           ├── inspect.rs      # inspect command
│           ├── cast.rs         # cast command
│           └── diff.rs         # diff command

License

Apache-2.0 — see LICENSE.

kenosis-cli 1.2.0