spn-native 0.2.0

Native model inference and storage for SuperNovae ecosystem
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
# spn-native Implementation Checklist

**Current Version:** v0.1.0 (Storage & Download)
**Next Version:** v0.2.0 (Inference with mistral.rs)
**Prepared by:** Rust Architect Review

---

## Immediate Actions (This Sprint)

### Code Quality & Documentation

- [ ] **Add comprehensive module docs**
  ```rust
  // src/inference/mod.rs
  //! Local inference backend using mistral.rs.
  //!
  //! # Features
  //! - Requires `inference` feature: `spn-native = { version = "0.2", features = ["inference"] }`
  //! - Supports all mistral.rs architectures (Llama, Qwen, Gemma, etc.)
  //! - GPU acceleration (Metal on macOS, CUDA on Linux)
  //! - Streaming inference for better UX
  ```

- [ ] **Review error handling gaps**
  - [ ] Add `NativeError::ModelNotLoaded`
  - [ ] Add `NativeError::InferenceFailed(String)`
  - [ ] Add `NativeError::UnsupportedArchitecture(String)`
  - [ ] Add proper error message for missing "inference" feature

- [ ] **Fix HTTP status code handling**
  ```rust
  // Current: treats 404 and 500 identically
  match response.status() {
      StatusCode::NOT_FOUND => {
          return Err(NativeError::ModelNotFound { repo, filename });
      }
      _ if !response.status().is_success() => {
          return Err(NativeError::Http(
              format!("HTTP {}: {}", response.status(), ...)
          ));
      }
      _ => {}
  }
  ```

- [ ] **Extract magic numbers to constants**
  ```rust
  const BYTES_PER_GB: u64 = 1_073_741_824;
  const KB_PER_GB: u64 = 1_048_576;
  const WINDOWS_RAM_DEFAULT: u32 = 16;
  const FALLBACK_RAM_DEFAULT: u32 = 8;
  ```

- [ ] **Add path traversal safety validation** (optional, low priority)
  ```rust
  fn validate_model_id(model_id: &str) -> Result<()> {
      if model_id.contains("..") {
          return Err(NativeError::InvalidConfig(".. not allowed in model ID".into()));
      }
      Ok(())
  }
  ```

### Testing

- [ ] **Add more comprehensive error tests**
  ```rust
  #[test]
  fn test_checksum_mismatch_cleanup() {
      // Verify corrupted file is deleted
  }

  #[test]
  fn test_interrupted_download_cleanup() {
      // Verify partial downloads don't corrupt cache
  }
  ```

- [ ] **Add property-based tests** (optional)
  ```rust
  use proptest::prelude::*;

  proptest! {
      #[test]
      fn prop_model_path_safe(repo in "[a-z0-9/-]{1,50}") {
          // Verify no path traversal
      }
  }
  ```

- [ ] **Mark integration tests with `#[ignore]`**
  ```rust
  #[tokio::test]
  #[ignore]  // Run with: cargo test -- --ignored --test-threads 1
  async fn test_real_huggingface_download() {
      // Requires network access
  }
  ```

---

## Phase 2 Preparation (Pre-Implementation)

### Architectural Decisions

- [ ] **Confirm mistral.rs version**
  - Target: v0.7.0 or later
  - Check: https://github.com/EricLBuehler/mistral.rs/releases
  - Action: Document minimum supported version

- [ ] **Decide on streaming architecture**
  - Option A: Return `Box<dyn Stream<Item = Token>>`
  - Option B: Use `futures::channel::mpsc` with spawned task
  - Option C: Callback-based (like current `download()`)
  - **Recommendation:** Option B (cleaner for async/await)

- [ ] **Plan GPU layer allocation strategy**
  - How to auto-detect available VRAM?
  - How to gracefully fall back to CPU?
  - Should we expose this in LoadConfig?

### Dependency Management

- [ ] **Create feature flag matrix**
  ```toml
  [features]
  default = []
  inference = ["dep:mistral-rs"]
  metal = ["candle-core/metal"]
  cuda = ["candle-core/cuda"]
  all = ["inference", "metal", "cuda"]
  ```

- [ ] **Pin mistral.rs version**
  - Add to Cargo.toml with exact version
  - Document breaking changes if any
  - Create MISTRAL_RS_CHANGELOG.md tracking compatibility

- [ ] **Test feature combinations**
  ```bash
  cargo test --no-default-features
  cargo test --features inference
  cargo test --all-features
  ```

### Documentation

- [ ] **Create MISTRAL_RS_INTEGRATION_PLAN.md** ✅ (Done)

- [ ] **Create GPU_SETUP.md**
  ```markdown
  # GPU Setup Guide

  ## macOS (Metal)
  - Automatic via candle-core/metal feature
  - Works on all Apple Silicon & Intel with Metal support

  ## Linux (CUDA)
  - Requires CUDA 12.0+
  - NVIDIA driver 525.0+
  - cargo build --features cuda

  ## Windows
  - TODO: Document
  ```

- [ ] **Create TROUBLESHOOTING.md**
  ```markdown
  # Troubleshooting

  ## "Model not found" errors
  - Check HuggingFace repo is public
  - Verify exact model name

  ## Out of memory
  - Try lower quantization (Q4 instead of F16)
  - Check system RAM with: spn-native detect-ram
  - Reduce context_size in LoadConfig

  ## GPU not detected
  - macOS: Check System Report > GPU
  - Linux: Run nvidia-smi
  ```

---

## Phase 2 Implementation (Detailed Task Breakdown)

### Module: inference/mod.rs

- [ ] **Define InferenceBackend trait**
  ```rust
  pub trait InferenceBackend: Send + Sync {
      async fn load(&mut self, model_path: PathBuf, config: LoadConfig) -> Result<()>;
      async fn unload(&mut self) -> Result<()>;
      fn is_loaded(&self) -> bool;
      fn model_info(&self) -> Option<ModelInfo>;
      async fn infer(&self, prompt: &str, opts: ChatOptions) -> Result<ChatResponse>;
  }
  ```

- [ ] **Re-export public types**
  ```rust
  pub use spn_core::{ChatOptions, ChatResponse, LoadConfig, ModelArchitecture};
  pub use crate::NativeError;

  #[cfg(feature = "inference")]
  pub use self::runtime::NativeRuntime;
  ```

- [ ] **Write module documentation with examples**

### Module: inference/runtime.rs

- [ ] **Create NativeRuntime struct**
  ```rust
  pub struct NativeRuntime {
      model_path: Option<PathBuf>,
      config: LoadConfig,

      #[cfg(feature = "inference")]
      model: Option<GgufModel>,

      #[cfg(feature = "inference")]
      tokenizer: Option<Tokenizer>,

      device: Option<Device>,
      metadata: Option<ModelInfo>,
  }
  ```

- [ ] **Implement load()**
  - [ ] Read GGUF file from disk
  - [ ] Parse GGUF header (detect architecture)
  - [ ] Select device (CPU/GPU)
  - [ ] Create appropriate builder based on architecture
  - [ ] Load tokenizer
  - [ ] Populate metadata
  - [ ] Handle errors gracefully

- [ ] **Implement infer()**
  - [ ] Validate model is loaded
  - [ ] Tokenize input
  - [ ] Run forward pass
  - [ ] Decode output
  - [ ] Return ChatResponse

- [ ] **Implement infer_stream()** (Phase 2b)
  - [ ] Return async stream of tokens
  - [ ] Handle interrupts gracefully

- [ ] **Implement unload()**
  - [ ] Clear model from memory
  - [ ] Release GPU resources

### Module: inference/builder.rs (Optional, Phase 2+)

- [ ] **Create builder pattern for NativeRuntime**
  ```rust
  pub struct NativeRuntimeBuilder {
      model_path: Option<PathBuf>,
      config: LoadConfig,
      device: Option<Device>,
  }

  impl NativeRuntimeBuilder {
      pub fn new() -> Self { ... }
      pub fn model_path(mut self, path: PathBuf) -> Self { ... }
      pub fn with_gpu_layers(mut self, layers: i32) -> Self { ... }
      pub fn force_cpu(mut self) -> Self { ... }
      pub fn build(self) -> Result<NativeRuntime> { ... }
  }
  ```

### Error Handling: error.rs

- [ ] **Add InferenceError variants**
  ```rust
  #[derive(Error, Debug)]
  pub enum NativeError {
      // ... existing variants ...

      #[error("Model not loaded")]
      #[cfg(feature = "inference")]
      ModelNotLoaded,

      #[error("Inference failed: {0}")]
      #[cfg(feature = "inference")]
      InferenceFailed(String),

      #[error("Unsupported architecture: {0}")]
      #[cfg(feature = "inference")]
      UnsupportedArchitecture(String),

      #[error("Device error: {0}")]
      #[cfg(feature = "inference")]
      DeviceError(String),

      #[error("Tokenizer error: {0}")]
      #[cfg(feature = "inference")]
      TokenizerError(String),
  }
  ```

- [ ] **Add feature-gated conversions**
  ```rust
  #[cfg(feature = "inference")]
  impl From<mistral_rs::MistralError> for NativeError {
      fn from(err: mistral_rs::MistralError) -> Self {
          NativeError::InferenceFailed(err.to_string())
      }
  }
  ```

### Tests: tests/inference.rs

- [ ] **Create test module with `#[ignore]` tests**
  ```rust
  #[cfg(feature = "inference")]
  mod inference_tests {
      #[tokio::test]
      #[ignore]
      async fn test_load_gguf() { ... }

      #[tokio::test]
      #[ignore]
      async fn test_infer_text() { ... }

      #[tokio::test]
      #[ignore]
      async fn test_unload() { ... }
  }
  ```

- [ ] **Document how to run ignored tests**
  ```bash
  # Run inference tests (requires model file)
  cargo test --test inference -- --ignored --test-threads 1
  ```

### Integration: lib.rs

- [ ] **Update public API**
  ```rust
  mod inference;

  #[cfg(feature = "inference")]
  pub use inference::NativeRuntime;

  #[cfg(feature = "inference")]
  pub use spn_core::{ChatOptions, ChatResponse};
  ```

- [ ] **Update module-level documentation**
  ```rust
  //! # Example: Download and Infer
  //!
  //! ```ignore
  //! #[cfg(feature = "inference")]
  //! #[tokio::main]
  //! async fn main() -> Result<(), Box<dyn std::error::Error>> {
  //!     // Download
  //!     let storage = HuggingFaceStorage::new(default_model_dir());
  //!     let model = find_model("qwen3:8b")?;
  //!     let request = DownloadRequest::curated(model);
  //!     let result = storage.download(&request, |_| {}).await?;
  //!
  //!     // Infer
  //!     let config = LoadConfig::default().with_gpu_layers(-1);
  //!     let mut runtime = NativeRuntime::new();
  //!     runtime.load(result.path, config).await?;
  //!     let response = runtime.infer("Hello", ChatOptions::default()).await?;
  //!     println!("{}", response.content);
  //!
  //!     Ok(())
  //! }
  //! ```
  ```

---

## Phase 2b: Nika Integration

### In nika/Cargo.toml

- [ ] **Add spn-native with inference feature**
  ```toml
  [dependencies]
  spn-native = { path = "../supernovae-cli/crates/spn-native", features = ["inference"] }
  ```

### In nika/src/

- [ ] **Create model_executor module**
  ```rust
  pub mod model_executor {
      use spn_native::NativeRuntime;
      use spn_core::LoadConfig;

      pub struct ModelExecutor {
          runtime: NativeRuntime,
          loaded_model: Option<String>,
      }

      impl ModelExecutor {
          pub async fn load_and_infer(&mut self, prompt: &str) -> Result<String> { ... }
      }
  }
  ```

- [ ] **Wire into Nika's execution engine**
  - [ ] Hook model loading commands
  - [ ] Hook infer commands
  - [ ] Handle errors from NativeRuntime

- [ ] **Update Nika tests to use local inference**
  - [ ] Remove Ollama dependency tests
  - [ ] Add feature guards: `#[cfg(feature = "native-inference")]`

### In nika/docs/

- [ ] **Create NATIVE_INFERENCE.md**
  - [ ] How to enable locally
  - [ ] GPU setup instructions
  - [ ] Troubleshooting guide
  - [ ] Performance expectations

---

## Phase 3: Streaming & Advanced

### Streaming Support

- [ ] **Implement infer_stream() properly**
  - [ ] Return async stream of tokens
  - [ ] Handle cancellation
  - [ ] Test with long outputs

- [ ] **Update Nika TUI to show streaming output**
  - [ ] Display tokens as they arrive
  - [ ] Show estimated tokens/sec
  - [ ] Allow user to stop generation

### Performance Optimization

- [ ] **Implement token caching** (KV cache management)
- [ ] **Add batch processing** for multiple requests
- [ ] **Profile GPU memory usage**
- [ ] **Optimize context window allocation**

---

## Quality Gates: Before Release

### Code Quality

- [ ] **Zero clippy warnings**
  ```bash
  cargo clippy --all-targets --all-features -- -D warnings
  ```

- [ ] **Format check**
  ```bash
  cargo fmt --check
  ```

- [ ] **Documentation tests pass**
  ```bash
  cargo test --doc
  ```

- [ ] **All unit tests pass**
  ```bash
  cargo test --lib
  ```

### Integration Testing

- [ ] **Manual test on macOS (Metal)**
  ```bash
  cargo run --example load-infer --features inference,metal
  ```

- [ ] **Manual test on Linux (CUDA, if available)**
  ```bash
  cargo run --example load-infer --all-features
  ```

- [ ] **Manual test on Windows (CPU fallback)**
  ```bash
  cargo run --example load-infer --features inference
  ```

### Documentation

- [ ] **All public items have doc comments**
  ```bash
  cargo doc --no-deps --open  # Check for red items
  ```

- [ ] **Examples compile and run**
  ```bash
  cargo test --example '*' --features inference
  ```

- [ ] **README.md is up-to-date**

### Performance Benchmarks

- [ ] **Model load time < 30 seconds** (for typical 8B model)
- [ ] **Inference latency < 2 seconds/token** (on 16GB RAM)
- [ ] **Memory footprint reasonable** (no memory leaks)

---

## Release Checklist

### Pre-Release (v0.2.0)

- [ ] **Bump version in Cargo.toml**
  ```toml
  [package]
  name = "spn-native"
  version = "0.2.0"
  ```

- [ ] **Update CHANGELOG.md**
  ```markdown
  ## [0.2.0] - 2024-MM-DD

  ### Added
  - Inference support via mistral.rs
  - NativeRuntime for local model execution
  - GPU acceleration (Metal, CUDA)
  - Streaming inference support

  ### Changed
  - Re-export ChatOptions, ChatResponse from spn-core

  ### Fixed
  - HTTP status code handling (distinguish 404 vs 500)
  ```

- [ ] **Tag release**
  ```bash
  git tag -a v0.2.0 -m "Add mistral.rs inference support"
  git push origin v0.2.0
  ```

- [ ] **Publish to crates.io**
  ```bash
  cargo publish
  ```

### Post-Release

- [ ] **Update downstream repos**
  - [ ] nika: bump spn-native dependency
  - [ ] docs: add inference examples

---

## Known Limitations & TODOs

### Current (v0.1.0)
- [ ] Windows RAM detection not implemented (uses 16GB default)
- [ ] No retry logic for network failures
- [ ] No streaming downloads

### Phase 2 (v0.2.0)
- [ ] VLMs not yet supported (text models only)
- [ ] No fine-tuning or LoRA support
- [ ] Single model loaded at a time

### Phase 3+
- [ ] Multi-model swapping (hot-swap)
- [ ] Distributed inference across devices
- [ ] Advanced quantization techniques

---

## Success Metrics

### Before Phase 2 Release
- [ ] 100% unit test pass rate
- [ ] Zero clippy warnings
- [ ] Documentation coverage > 95%
- [ ] Inference latency < 2 sec/token on reference hardware

### After Nika Integration
- [ ] Nika can run workflows offline
- [ ] Performance parity with Ollama ± 10%
- [ ] User feedback rating > 4.0/5.0

---

## Quick Reference: Command Summary

```bash
# Development
cargo test                           # Unit tests
cargo test -- --ignored             # Integration tests
cargo clippy -- -D warnings          # Lint
cargo fmt                            # Format
cargo doc --no-deps --open          # View docs

# Build
cargo build --release                # Native
cargo build --features inference     # With inference
cargo build --all-features          # All features

# Publish
cargo publish --dry-run              # Verify
cargo publish                        # Publish to crates.io

# Testing inference (Phase 2+)
cargo test --test inference -- --ignored
cargo run --example load-infer --features inference
```

---

**Last Updated:** 2026-03-10
**Status:** Ready for Phase 2 implementation
**Reviewed by:** Rust Architect (Claude)