lnmp-spatial 0.5.14

# LNMP-Spatial Optimization Guide

## Overview

This guide provides optimization techniques and best practices for maximizing LNMP-Spatial performance in production systems.

## Core Principles

1. **Zero-Allocation Hot Path**: Reuse buffers
2. **Stack-First Design**: Avoid heap when possible
3. **Delta Efficiency**: Minimize absolute frames
4. **Batch Processing**: Amortize syscall overhead
5. **Prediction Tuning**: Balance smoothness vs accuracy

## Memory Optimization

### 1. Buffer Reuse Pattern

**Problem:** Allocating `Vec<u8>` every frame wastes time.

**Solution:** Pre-allocate and reuse:

```rust
struct SpatialEncoder {
    buffer: Vec<u8>,
}

impl SpatialEncoder {
    fn new() -> Self {
        Self {
            buffer: Vec::with_capacity(1024), // Typical max frame size
        }
    }

    fn encode(&mut self, value: &SpatialValue) -> Result<&[u8], SpatialError> {
        self.buffer.clear(); // O(1), keeps capacity
        encode_spatial(value, &mut self.buffer)?;
        Ok(&self.buffer)
    }
}
```

**Impact:** Eliminates per-frame allocation (~50ns saving).

### 2. Fixed-Size Buffers (Embedded)

For deterministic memory on embedded systems:

```rust
const MAX_FRAME_SIZE: usize = 256;

struct FixedEncoder {
    buffer: [u8; MAX_FRAME_SIZE],
    len: usize,
}

impl FixedEncoder {
    fn encode(&mut self, value: &SpatialValue) -> Result<&[u8], SpatialError> {
        let mut writer = &mut self.buffer[..];
        encode_spatial(value, &mut Vec::from(writer))?; // ⚠️ Still allocates
        // Better: use a custom Write impl
        Ok(&self.buffer[..self.len])
    }
}
```

### 3. Object Pooling (High-Frequency)

For multi-threaded systems:

```rust
use crossbeam::queue::ArrayQueue;

struct FramePool {
    pool: ArrayQueue<Vec<u8>>,
}

impl FramePool {
    fn acquire(&self) -> Vec<u8> {
        self.pool.pop().unwrap_or_else(|| Vec::with_capacity(1024))
    }

    fn release(&self, mut buffer: Vec<u8>) {
        buffer.clear();
        let _ = self.pool.push(buffer); // Ignore if pool full
    }
}
```

**Impact:** Thread-safe, lock-free buffer reuse.

## Bandwidth Optimization

### 1. Adaptive ABS Interval

Dynamically adjust based on drift:

```rust
struct AdaptiveStreamer {
    streamer: SpatialStreamer,
    drift_estimator: DriftEstimator,
}

impl AdaptiveStreamer {
    fn next_frame(&mut self, state: &SpatialState) -> Result<SpatialFrame, SpatialError> {
        let drift = self.drift_estimator.current_drift();
        
        // Increase ABS frequency if drift is high
        let abs_interval = if drift > 0.1 {
            10  // Reset more often
        } else if drift > 0.01 {
            50
        } else {
            100 // Normal
        };

        // Update config if changed
        if abs_interval != self.streamer.config.abs_interval {
            self.streamer = SpatialStreamer::with_config(SpatialStreamerConfig {
                abs_interval,
                ..self.streamer.config
            });
        }

        self.streamer.next_frame(state, get_timestamp_ns())
    }
}
```

### 2. Delta Suppression

Skip frames with negligible changes:

```rust
fn should_send_frame(current: &Position3D, last: &Position3D) -> bool {
    let delta = Position3D::compute_delta(last, current);
    let magnitude = (delta.dx.powi(2) + delta.dy.powi(2) + delta.dz.powi(2)).sqrt();
    
    magnitude > 0.001 // Threshold: 1mm
}
```

**Impact:** Reduces frames by 30-50% in stationary scenarios.

### 3. Compression (Optional)

For network-constrained environments:

```rust
use lz4::EncoderBuilder;

fn compress_frame(frame: &[u8]) -> Vec<u8> {
    let mut encoder = EncoderBuilder::new()
        .level(1) // Fast compression
        .build(Vec::new())
        .unwrap();
    
    encoder.write_all(frame).unwrap();
    let (compressed, _) = encoder.finish();
    compressed
}
```

**Impact:** 20-40% additional size reduction, +10-20ns latency.

## Latency Optimization

### 1. Inline Hot Paths

Mark critical functions for inlining:

```rust
#[inline(always)]
pub fn encode_position3d(pos: &Position3D, buf: &mut Vec<u8>) {
    buf.put_u8(0x02);
    buf.put_f32(pos.x);
    buf.put_f32(pos.y);
    buf.put_f32(pos.z);
}
```

**Impact:** Eliminates function call overhead (~1-2ns).

### 2. Avoid Serialization for Checksums

Current implementation serializes payload for CRC32. Optimize:

```rust
// Instead of: bincode::serialize(&payload)
// Use direct byte access:

fn compute_payload_checksum(value: &SpatialValue) -> u32 {
    match value {
        SpatialValue::S2(pos) => {
            let mut hasher = crc32fast::Hasher::new();
            hasher.update(&[0x02]); // Type
            hasher.update(&pos.x.to_le_bytes());
            hasher.update(&pos.y.to_le_bytes());
            hasher.update(&pos.z.to_le_bytes());
            hasher.finalize()
        }
        // ... other types
        _ => todo!()
    }
}
```

**Impact:** ~8-10ns faster checksum computation.

### 3. SIMD Vectorization (Future)

For batch encoding (requires nightly Rust):

```rust
#![feature(portable_simd)]
use std::simd::f32x4;

fn encode_positions_simd(positions: &[Position3D], buf: &mut Vec<u8>) {
    for chunk in positions.chunks(4) {
        // Load 4 positions into SIMD registers
        let xs = f32x4::from_array([chunk[0].x, chunk[1].x, chunk[2].x, chunk[3].x]);
        // Process in parallel...
    }
}
```

**Estimated Impact:** 2-4× throughput for batch operations.

## CPU Optimization

### 1. Branch Prediction

Order match arms by frequency:

```rust
match frame.header.mode {
    FrameMode::Delta => { /* 99% of frames */ }
    FrameMode::Absolute => { /* 1% of frames */ }
}
```

### 2. Cache-Friendly Data Layout

Keep frequently-accessed fields together:

```rust
#[repr(C)]
struct SpatialFrameHeader {
    mode: FrameMode,          // 1 byte
    sequence_id: u32,         // 4 bytes (aligned)
    timestamp: u64,           // 8 bytes (aligned)
    checksum: u32,            // 4 bytes
    // Total: 17 bytes, fits in 32-byte cache line
}
```

### 3. Prefetching (Advanced)

For predictable access patterns:

```rust
use std::intrinsics::prefetch_read_data;

unsafe fn process_frame_batch(frames: &[SpatialFrame]) {
    for i in 0..frames.len() {
        if i + 1 < frames.len() {
            prefetch_read_data(&frames[i + 1], 3); // Prefetch next frame
        }
        process_frame(&frames[i]);
    }
}
```

## Network Optimization

### 1. Nagle Algorithm Disable

For real-time systems:

```rust
use std::net::TcpStream;

let stream = TcpStream::connect("127.0.0.1:8080")?;
stream.set_nodelay(true)?; // Disable Nagle
```

**Impact:** Reduces latency by 10-40ms in LAN environments.

### 2. UDP + Custom Reliability

For maximum speed:

```rust
use std::net::UdpSocket;

struct ReliableUdp {
    socket: UdpSocket,
    pending: HashMap<u32, SpatialFrame>, // seq_id -> frame
}

impl ReliableUdp {
    fn send(&mut self, frame: SpatialFrame) -> Result<(), Error> {
        let serialized = bincode::serialize(&frame)?;
        self.socket.send(&serialized)?;
        
        // Store for potential retransmission
        if frame.header.mode == FrameMode::Absolute {
            self.pending.insert(frame.header.sequence_id, frame);
        }
        
        Ok(())
    }
}
```

### 3. Batching

Send multiple frames per syscall:

```rust
fn send_batch(socket: &UdpSocket, frames: &[SpatialFrame]) -> Result<(), Error> {
    let mut batch_buffer = Vec::with_capacity(frames.len() * 100);
    
    for frame in frames {
        let serialized = bincode::serialize(frame)?;
        batch_buffer.extend_from_slice(&(serialized.len() as u16).to_le_bytes());
        batch_buffer.extend_from_slice(&serialized);
    }
    
    socket.send(&batch_buffer)?;
    Ok(())
}
```

**Impact:** Reduces syscalls 10-100×.

## Prediction Optimization

### 1. Velocity-Based Prediction

Current implementation uses fixed `dt = 1ms`. Make it dynamic:

```rust
fn predict_next(&self, last_timestamp: u64, current_timestamp: u64) -> Position3D {
    let dt = (current_timestamp - last_timestamp) as f64 / 1e9; // ns to seconds
    
    Position3D {
        x: self.position.x + self.velocity.vx * dt as f32,
        y: self.position.y + self.velocity.vy * dt as f32,
        z: self.position.z + self.velocity.vz * dt as f32,
    }
}
```

### 2. Acceleration-Aware Prediction

For smoother motion:

```rust
fn predict_with_acceleration(&self, dt: f32) -> Position3D {
    // s = s0 + v*t + 0.5*a*t²
    Position3D {
        x: self.position.x + self.velocity.vx * dt + 0.5 * self.acceleration.ax * dt.powi(2),
        y: self.position.y + self.velocity.vy * dt + 0.5 * self.acceleration.ay * dt.powi(2),
        z: self.position.z + self.velocity.vz * dt + 0.5 * self.acceleration.az * dt.powi(2),
    }
}
```

### 3. Kalman Filter Integration

For noisy sensors:

```rust
use nalgebra::{Matrix2, Vector2};

struct KalmanPredictor {
    state: Vector2<f32>,      // [position, velocity]
    covariance: Matrix2<f32>,
}

impl KalmanPredictor {
    fn predict(&mut self, dt: f32) -> f32 {
        // State transition: position += velocity * dt
        let f = Matrix2::new(
            1.0, dt,
            0.0, 1.0,
        );
        
        self.state = f * self.state;
        self.covariance = f * self.covariance * f.transpose();
        
        self.state[0] // Return predicted position
    }

    fn update(&mut self, measurement: f32) {
        // Update with actual measurement when frame arrives
        // ... Kalman update equations
    }
}
```

## Profiling Best Practices

### 1. Benchmark Real Workloads

Don't just benchmark isolated operations:

```rust
#[bench]
fn bench_realistic_telemetry_loop(b: &mut Bencher) {
    let mut streamer = SpatialStreamer::new(100);
    let mut buffer = Vec::with_capacity(1024);
    let mut state = create_initial_state();
    
    b.iter(|| {
        // Update physics
        update_robot_state(&mut state, 0.001);
        
        // Generate frame
        let frame = streamer.next_frame(&state, get_timestamp_ns()).unwrap();
        
        // Encode
        buffer.clear();
        bincode::serialize_into(&mut buffer, &frame).unwrap();
        
        // Simulate network
        black_box(&buffer);
    });
}
```

### 2. Use `perf` Effectively

```bash
# Record with call graph
perf record --call-graph dwarf -F 999 cargo bench

# Find hotspots
perf report --stdio | head -50

# Cache misses
perf stat -e cache-misses,cache-references cargo bench
```

### 3. Flamegraphs

```bash
cargo install flamegraph
cargo flamegraph --bench spatial_bench -- --bench

# Opens flamegraph.svg in browser
```

## Platform-Specific Optimizations

### Real-Time Linux

```rust
// Set real-time priority
use libc::{sched_setscheduler, sched_param, SCHED_FIFO};

unsafe {
    let mut param = sched_param { sched_priority: 99 };
    sched_setscheduler(0, SCHED_FIFO, &param);
}

// Pin to CPU core
use core_affinity;
core_affinity::set_for_current(core_affinity::CoreId { id: 0 });
```

### Windows High-Resolution Timer

```rust
#[cfg(windows)]
use winapi::um::timeapi::{timeBeginPeriod, timeEndPeriod};

unsafe {
    timeBeginPeriod(1); // 1ms resolution
}
```

### Embedded (no_std)

```rust
#![no_std]

// Use fixed-point instead of f32
type FixedPoint = i32; // Q16.16 format

fn to_fixed(f: f32) -> FixedPoint {
    (f * 65536.0) as i32
}

fn from_fixed(fp: FixedPoint) -> f32 {
    fp as f32 / 65536.0
}
```

## Anti-Patterns to Avoid

### ❌ Don't: Allocate in hot loop

```rust
for state in states {
    let delta = compute_delta(&last, &state); // ❌ Returns owned struct
    send_frame(delta);
}
```

### ✅ Do: Reuse storage

```rust
let mut delta = PositionDelta::default();
for state in states {
    compute_delta_into(&last, &state, &mut delta); // ✅ Writes to existing
    send_frame(&delta);
}
```

### ❌ Don't: Clone unnecessarily

```rust
fn process(frame: SpatialFrame) { /* ❌ Takes ownership */ }
```

### ✅ Do: Use references

```rust
fn process(frame: &SpatialFrame) { /* ✅ Borrows */ }
```

### ❌ Don't: Over-predict

```rust
max_prediction_frames: 100 // ❌ Too many, drift accumulates
```

### ✅ Do: Conservative limits

```rust
max_prediction_frames: 3 // ✅ 3ms tolerance at 1kHz
```

## Checklist for Production

- [ ] Buffer pre-allocation and reuse
- [ ] Batch network I/O where possible
- [ ] Profile with realistic workloads
- [ ] Set correct prediction limits
- [ ] Disable Nagle for TCP
- [ ] Consider UDP for lowest latency
- [ ] Use `--release` builds (10-100× faster than debug)
- [ ] Monitor drift and adjust ABS interval
- [ ] Implement backpressure/flow control
- [ ] Test under packet loss scenarios
- [ ] Benchmark on target hardware

## Summary

**Key Optimizations:**
1. 📦 **Reuse buffers** → Eliminate allocations
2. 📊 **Batch operations** → Amortize syscall cost
3. 🎯 **Tune prediction** → Balance smoothness vs accuracy
4. 🚀 **Inline hot paths** → Reduce call overhead
5. 🌐 **Optimize network** → TCP_NODELAY or UDP

**Expected Gains:**
- 2-5× lower latency with buffer reuse
- 10-100× higher throughput with batching
- 20-30% bandwidth savings with delta suppression
- Sub-microsecond jitter with real-time OS

**Remember:** Profile first, optimize second. Target the 20% of code that accounts for 80% of runtime.