ruvector-cnn 2.0.6

# INT8 SIMD Kernels Implementation - Phase 2.2-2.3

## Overview

This document describes the implementation of ADR-091 Phase 2.2-2.3: SIMD INT8 kernels for `ruvector-cnn`. These kernels provide 2-4x speedup over FP32 inference with minimal accuracy loss.

## Implementation Summary

### Files Created

1. **`src/kernels/int8_avx2.rs`** - x86_64 AVX2 kernels
2. **`src/kernels/int8_neon.rs`** - ARM NEON kernels
3. **`src/kernels/int8_wasm.rs`** - WebAssembly SIMD128 kernels
4. **`src/kernels/mod.rs`** - Module exports and dispatch logic

### Key Features

- **Multi-architecture support**: AVX2, NEON, WASM SIMD128
- **Automatic dispatch**: Runtime feature detection selects optimal implementation
- **Kernel equivalence**: All SIMD kernels match scalar reference within 1 ULP (INV-6)
- **Edge case handling**: Supports non-aligned sizes, small inputs, remainder processing

## Architecture-Specific Implementations

### 1. x86_64 AVX2 (`int8_avx2.rs`)

**Key Instructions:**
- `_mm256_maddubs_epi16`: Multiply u8×i8 → i16, pairwise add
- `_mm256_madd_epi16`: Multiply i16×i16 → i32, pairwise add
- `_mm256_add_epi32`: Accumulate i32 results

**Performance:**
- Processes 32 elements per iteration (dot product)
- Processes 8 output channels per iteration (convolution)
- Expected speedup: **2-4x over FP32**

**Functions:**
- `dot_product_int8_avx2()` - INT8 dot product
- `conv2d_int8_avx2()` - 2D convolution with per-channel quantization
- `depthwise_conv2d_int8_avx2()` - Depthwise separable convolution
- `matmul_int8_avx2()` - Matrix multiplication (GEMM)

### 2. ARM NEON (`int8_neon.rs`)

**Key Instructions:**
- `vmull_s8`: Multiply i8×i8 → i16 (widening)
- `vpadalq_s16`: Pairwise add i16 → i32 (accumulate)
- `vpadd_s32`: Horizontal sum for final accumulation

**Performance:**
- Processes 16 elements per iteration (dot product)
- Processes 4 output channels per iteration (convolution)
- Expected speedup: **2-3x over FP32**

**Functions:**
- `dot_product_int8_neon()` - INT8 dot product
- `conv2d_int8_neon()` - 2D convolution
- `depthwise_conv2d_int8_neon()` - Depthwise convolution
- `matmul_int8_neon()` - Matrix multiplication

### 3. WebAssembly SIMD128 (`int8_wasm.rs`)

**Key Instructions:**
- `i16x8_extend_low_i8x16`: Widen i8 → i16
- `i16x8_mul`: Multiply i16×i16
- `i32x4_extend_low_i16x8`: Widen i16 → i32
- `i32x4_add`: Accumulate i32

**Performance:**
- Processes 16 elements per iteration (dot product)
- Processes 4 output channels per iteration (convolution)
- Expected speedup: **1.5-2.5x over scalar**

**Functions:**
- `dot_product_int8_wasm()` - INT8 dot product
- `conv2d_int8_wasm()` - 2D convolution
- `depthwise_conv2d_int8_wasm()` - Depthwise convolution
- `matmul_int8_wasm()` - Matrix multiplication

## Automatic Dispatch System

The `kernels/mod.rs` module provides automatic dispatch functions that select the best implementation at runtime:

```rust
// Automatic dispatch based on target architecture
pub fn conv2d_int8_dispatch(
    input: &[u8],
    input_zero_point: i32,
    kernel: &[i8],
    bias_i32: &[i32],
    output: &mut [i32],
    in_h: usize, in_w: usize, in_c: usize, out_c: usize,
    stride: usize, padding: usize,
) {
    #[cfg(target_arch = "x86_64")]
    {
        if is_x86_feature_detected!("avx2") {
            // Use AVX2 kernel
        }
    }
    // Fallback to scalar or other architectures
}
```

## Quantization Scheme

### Asymmetric Quantization (Activations)

Used for activations after ReLU (non-negative):

```
quantized = round(value / scale) + zero_point
```

- **Range**: [0, 255] for u8
- **Zero point**: Computed to map min value to 0
- **Scale**: (max - min) / 255

### Symmetric Quantization (Weights)

Used for weights (centered around 0):

```
quantized = round(value / scale)
```

- **Range**: [-127, 127] for i8
- **Zero point**: Always 0
- **Scale**: max(abs(values)) / 127

### Per-Channel Quantization

Different scale per output channel for higher accuracy:
- **Weights**: Per-channel symmetric quantization
- **Activations**: Per-tensor asymmetric quantization

## Zero-Point Correction

For asymmetric quantization, a correction term is added to account for the zero point:

```rust
// Pre-compute correction: input_zero_point × sum(weights)
let correction = input_zero_point * weight_sum;
output = bias - correction + dot_product(input, weights);
```

This ensures the quantized computation matches the FP32 result when dequantized.

## Kernel Equivalence Testing (INV-6)

All SIMD kernels are tested against scalar reference implementations:

```rust
#[test]
fn test_kernel_equivalence_conv2d() {
    let mut scalar_output = vec![0i32; output_size];
    let mut simd_output = vec![0i32; output_size];

    scalar_conv2d_int8(..., &mut scalar_output);
    conv2d_int8_dispatch(..., &mut simd_output);

    // Must match exactly (INV-6)
    assert_eq!(scalar_output, simd_output);
}
```

Tests verify:
- **Exact equivalence**: INT32 outputs match exactly
- **Edge cases**: Non-aligned sizes, small inputs
- **Remainder handling**: Correct processing of tail elements

## Performance Expectations

| Operation | AVX2 Speedup | NEON Speedup | WASM Speedup |
|-----------|--------------|--------------|--------------|
| Conv2d 3×3 | 2-3x | 2-2.5x | 1.5-2x |
| Depthwise | 2-2.5x | 2x | 1.5-2x |
| MatMul | 3-4x | 2.5-3x | 2-2.5x |
| Dot Product | 4x | 3x | 2x |

### Memory Bandwidth

INT8 provides 4x memory bandwidth reduction:
- **FP32**: 4 bytes per value
- **INT8**: 1 byte per value
- **Cache efficiency**: 4x better cache utilization

## Usage Example

```rust
use ruvector_cnn::kernels::{conv2d_int8_dispatch, QuantParams};

// Quantize inputs
let input_q: Vec<u8> = quantize_activations(&input_fp32);
let kernel_q: Vec<i8> = quantize_weights(&kernel_fp32);
let bias_q: Vec<i32> = quantize_bias(&bias_fp32);

// Run INT8 convolution (automatic SIMD dispatch)
let mut output_i32 = vec![0i32; output_size];
conv2d_int8_dispatch(
    &input_q, input_zero_point, &kernel_q, &bias_q,
    &mut output_i32, in_h, in_w, in_c, out_c, stride, padding
);

// Dequantize output
let output_fp32 = dequantize(&output_i32, output_scale);
```

## Testing Strategy

### Unit Tests

Each kernel includes unit tests:
- Basic functionality tests
- Small input tests (< SIMD width)
- Non-aligned size tests
- Equivalence tests vs scalar

### Integration Tests

Tests verify:
- End-to-end quantized inference
- Accuracy vs FP32 baseline (<1% degradation)
- Performance benchmarks

### Running Tests

```bash
# Run all kernel tests
cargo test -p ruvector-cnn --lib kernels

# Run specific kernel tests
cargo test -p ruvector-cnn test_kernel_equivalence

# Run with optimizations
cargo test -p ruvector-cnn --release kernels
```

## Future Optimizations

### AVX-512 VNNI

Intel Cascade Lake+ supports VNNI instructions:
- `_mm512_dpbusd_epi32`: 4-way dot product in single instruction
- Expected speedup: 5-6x over FP32

### ARM Dot Product (ARMv8.2+)

ARM Cortex-A55+ supports dot product instructions:
- `vdotq_s32`: 4-way dot product
- Expected speedup: 3-4x over FP32

## References

1. **ADR-091**: INT8 Quantization Design for ruvector-cnn
2. **Intel Intrinsics Guide**: https://www.intel.com/content/www/us/en/docs/intrinsics-guide/
3. **ARM NEON Intrinsics**: https://developer.arm.com/architectures/instruction-sets/intrinsics/
4. **WebAssembly SIMD**: https://github.com/WebAssembly/simd

## Compliance

- **INV-6**: All kernels match scalar reference within 1 ULP ✓
- **Edge cases**: Non-aligned inputs handled correctly ✓
- **Multi-architecture**: AVX2, NEON, WASM SIMD128 supported ✓
- **Automatic dispatch**: Runtime feature detection implemented ✓