realizar 0.8.4

Pure Rust ML inference engine built from scratch - model serving for GGUF and safetensors
1
2
3
4
5
6
7
8
9
10
11
12
//! Weight management methods for CUDA-accelerated inference
//!
//! This module contains weight upload and caching implementations:
//! - `pre_cache_weights_for_batch`: Pre-cache weights for batched forward pass
//! - `preload_weights_gpu`: Upload all layer weights to GPU with indexed lookup
//! - `clear_decode_graph`: Clear CUDA graph state
//! - `supports_gpu_resident`: Check if model supports GPU-resident path

use super::{OwnedQKVWeights, OwnedQuantizedModelCuda};
use crate::error::{RealizarError, Result};

include!("weights_preload_gpu.rs");