Expand description
Model Runtime Abstraction Layer
Provides a unified interface for hosting multiple model formats (GGUF, ONNX, TensorRT, Safetensors, GGML, CoreML) through a trait-based runtime system.
Architecture:
- Each model format has its own runtime adapter
- All runtimes expose OpenAI-compatible HTTP API
- Maintains 1-hop architecture: Rust → HTTP → Runtime Server
- Automatic format detection from file extension
Re-exports§
pub use runtime_trait::ModelRuntime;pub use runtime_trait::ModelFormat;pub use runtime_trait::RuntimeConfig;pub use runtime_trait::InferenceRequest;pub use runtime_trait::InferenceResponse;pub use gguf_runtime::GGUFRuntime;pub use onnx_runtime::ONNXRuntime;pub use tensorrt_runtime::TensorRTRuntime;pub use safetensors_runtime::SafetensorsRuntime;pub use ggml_runtime::GGMLRuntime;pub use coreml_runtime::CoreMLRuntime;pub use format_detector::FormatDetector;pub use platform_detector::HardwareCapabilities;pub use runtime_manager::RuntimeManager;
Modules§
- coreml_
runtime - CoreML Runtime Adapter (macOS only) For Apple Silicon optimized models
- format_
detector - Format Detector
- ggml_
runtime - GGML Runtime Adapter (legacy format) Similar to GGUF but for older llama.cpp GGML models
- gguf_
runtime - GGUF Runtime Adapter
- onnx_
runtime - ONNX Runtime Adapter
- platform_
detector - Platform and Hardware Detection
- runtime_
manager - Runtime Manager
- runtime_
trait - Core trait and types for model runtime abstraction
- safetensors_
runtime - Safetensors Runtime Adapter
- tensorrt_
runtime - TensorRT Runtime Adapter