# INT8 SIMD Kernels Implementation - Phase 2.2-2.3
## Overview
This document describes the implementation of ADR-091 Phase 2.2-2.3: SIMD INT8 kernels for `ruvector-cnn`. These kernels provide 2-4x speedup over FP32 inference with minimal accuracy loss.
## Implementation Summary
### Files Created
1. **`src/kernels/int8_avx2.rs`** - x86_64 AVX2 kernels
2. **`src/kernels/int8_neon.rs`** - ARM NEON kernels
3. **`src/kernels/int8_wasm.rs`** - WebAssembly SIMD128 kernels
4. **`src/kernels/mod.rs`** - Module exports and dispatch logic
### Key Features
- **Multi-architecture support**: AVX2, NEON, WASM SIMD128
- **Automatic dispatch**: Runtime feature detection selects optimal implementation
- **Kernel equivalence**: All SIMD kernels match scalar reference within 1 ULP (INV-6)
- **Edge case handling**: Supports non-aligned sizes, small inputs, remainder processing
## Architecture-Specific Implementations
### 1. x86_64 AVX2 (`int8_avx2.rs`)
**Key Instructions:**
- `_mm256_maddubs_epi16`: Multiply u8×i8 → i16, pairwise add
- `_mm256_madd_epi16`: Multiply i16×i16 → i32, pairwise add
- `_mm256_add_epi32`: Accumulate i32 results
**Performance:**
- Processes 32 elements per iteration (dot product)
- Processes 8 output channels per iteration (convolution)
- Expected speedup: **2-4x over FP32**
**Functions:**
- `dot_product_int8_avx2()` - INT8 dot product
- `conv2d_int8_avx2()` - 2D convolution with per-channel quantization
- `depthwise_conv2d_int8_avx2()` - Depthwise separable convolution
- `matmul_int8_avx2()` - Matrix multiplication (GEMM)
### 2. ARM NEON (`int8_neon.rs`)
**Key Instructions:**
- `vmull_s8`: Multiply i8×i8 → i16 (widening)
- `vpadalq_s16`: Pairwise add i16 → i32 (accumulate)
- `vpadd_s32`: Horizontal sum for final accumulation
**Performance:**
- Processes 16 elements per iteration (dot product)
- Processes 4 output channels per iteration (convolution)
- Expected speedup: **2-3x over FP32**
**Functions:**
- `dot_product_int8_neon()` - INT8 dot product
- `conv2d_int8_neon()` - 2D convolution
- `depthwise_conv2d_int8_neon()` - Depthwise convolution
- `matmul_int8_neon()` - Matrix multiplication
### 3. WebAssembly SIMD128 (`int8_wasm.rs`)
**Key Instructions:**
- `i16x8_extend_low_i8x16`: Widen i8 → i16
- `i16x8_mul`: Multiply i16×i16
- `i32x4_extend_low_i16x8`: Widen i16 → i32
- `i32x4_add`: Accumulate i32
**Performance:**
- Processes 16 elements per iteration (dot product)
- Processes 4 output channels per iteration (convolution)
- Expected speedup: **1.5-2.5x over scalar**
**Functions:**
- `dot_product_int8_wasm()` - INT8 dot product
- `conv2d_int8_wasm()` - 2D convolution
- `depthwise_conv2d_int8_wasm()` - Depthwise convolution
- `matmul_int8_wasm()` - Matrix multiplication
## Automatic Dispatch System
The `kernels/mod.rs` module provides automatic dispatch functions that select the best implementation at runtime:
```rust
// Automatic dispatch based on target architecture
pub fn conv2d_int8_dispatch(
input: &[u8],
input_zero_point: i32,
kernel: &[i8],
bias_i32: &[i32],
output: &mut [i32],
in_h: usize, in_w: usize, in_c: usize, out_c: usize,
stride: usize, padding: usize,
) {
#[cfg(target_arch = "x86_64")]
{
if is_x86_feature_detected!("avx2") {
// Use AVX2 kernel
}
}
// Fallback to scalar or other architectures
}
```
## Quantization Scheme
### Asymmetric Quantization (Activations)
Used for activations after ReLU (non-negative):
```
quantized = round(value / scale) + zero_point
```
- **Range**: [0, 255] for u8
- **Zero point**: Computed to map min value to 0
- **Scale**: (max - min) / 255
### Symmetric Quantization (Weights)
Used for weights (centered around 0):
```
quantized = round(value / scale)
```
- **Range**: [-127, 127] for i8
- **Zero point**: Always 0
- **Scale**: max(abs(values)) / 127
### Per-Channel Quantization
Different scale per output channel for higher accuracy:
- **Weights**: Per-channel symmetric quantization
- **Activations**: Per-tensor asymmetric quantization
## Zero-Point Correction
For asymmetric quantization, a correction term is added to account for the zero point:
```rust
// Pre-compute correction: input_zero_point × sum(weights)
let correction = input_zero_point * weight_sum;
output = bias - correction + dot_product(input, weights);
```
This ensures the quantized computation matches the FP32 result when dequantized.
## Kernel Equivalence Testing (INV-6)
All SIMD kernels are tested against scalar reference implementations:
```rust
#[test]
fn test_kernel_equivalence_conv2d() {
let mut scalar_output = vec![0i32; output_size];
let mut simd_output = vec![0i32; output_size];
scalar_conv2d_int8(..., &mut scalar_output);
conv2d_int8_dispatch(..., &mut simd_output);
// Must match exactly (INV-6)
assert_eq!(scalar_output, simd_output);
}
```
Tests verify:
- **Exact equivalence**: INT32 outputs match exactly
- **Edge cases**: Non-aligned sizes, small inputs
- **Remainder handling**: Correct processing of tail elements
## Performance Expectations
| Conv2d 3×3 | 2-3x | 2-2.5x | 1.5-2x |
| Depthwise | 2-2.5x | 2x | 1.5-2x |
| MatMul | 3-4x | 2.5-3x | 2-2.5x |
| Dot Product | 4x | 3x | 2x |
### Memory Bandwidth
INT8 provides 4x memory bandwidth reduction:
- **FP32**: 4 bytes per value
- **INT8**: 1 byte per value
- **Cache efficiency**: 4x better cache utilization
## Usage Example
```rust
use ruvector_cnn::kernels::{conv2d_int8_dispatch, QuantParams};
// Quantize inputs
let input_q: Vec<u8> = quantize_activations(&input_fp32);
let kernel_q: Vec<i8> = quantize_weights(&kernel_fp32);
let bias_q: Vec<i32> = quantize_bias(&bias_fp32);
// Run INT8 convolution (automatic SIMD dispatch)
let mut output_i32 = vec![0i32; output_size];
conv2d_int8_dispatch(
&input_q, input_zero_point, &kernel_q, &bias_q,
&mut output_i32, in_h, in_w, in_c, out_c, stride, padding
);
// Dequantize output
let output_fp32 = dequantize(&output_i32, output_scale);
```
## Testing Strategy
### Unit Tests
Each kernel includes unit tests:
- Basic functionality tests
- Small input tests (< SIMD width)
- Non-aligned size tests
- Equivalence tests vs scalar
### Integration Tests
Tests verify:
- End-to-end quantized inference
- Accuracy vs FP32 baseline (<1% degradation)
- Performance benchmarks
### Running Tests
```bash
# Run all kernel tests
cargo test -p ruvector-cnn --lib kernels
# Run specific kernel tests
cargo test -p ruvector-cnn test_kernel_equivalence
# Run with optimizations
cargo test -p ruvector-cnn --release kernels
```
## Future Optimizations
### AVX-512 VNNI
Intel Cascade Lake+ supports VNNI instructions:
- `_mm512_dpbusd_epi32`: 4-way dot product in single instruction
- Expected speedup: 5-6x over FP32
### ARM Dot Product (ARMv8.2+)
ARM Cortex-A55+ supports dot product instructions:
- `vdotq_s32`: 4-way dot product
- Expected speedup: 3-4x over FP32
## References
1. **ADR-091**: INT8 Quantization Design for ruvector-cnn
2. **Intel Intrinsics Guide**: https://www.intel.com/content/www/us/en/docs/intrinsics-guide/
3. **ARM NEON Intrinsics**: https://developer.arm.com/architectures/instruction-sets/intrinsics/
4. **WebAssembly SIMD**: https://github.com/WebAssembly/simd
## Compliance
- **INV-6**: All kernels match scalar reference within 1 ULP ✓
- **Edge cases**: Non-aligned inputs handled correctly ✓
- **Multi-architecture**: AVX2, NEON, WASM SIMD128 supported ✓
- **Automatic dispatch**: Runtime feature detection implemented ✓