Expand description
Ferrum unified compute kernels for high-performance inference.
Provides the Backend trait and implementations for CUDA, Metal, and CPU.
On CUDA builds, kernels are compiled to PTX during cargo build and loaded
on demand at runtime.
Re-exports§
pub use linear::Linear;pub use stacked_expert::StackedExpertGgufLinear;pub use marlin_expert_stack::MarlinExpertStack;
Modules§
- attention
- ferrum-attention: Fused flash attention and transformer for Metal, CUDA, and CPU.
- backend
- Unified Backend trait for CUDA, Metal, and CPU compute.
- linear
Linear<B>trait — weight-bearing projection abstraction.- marlin_
expert_ stack MarlinExpertStack<B>— abstraction for “N MoE experts’ Marlin GPTQ-INT4 tiles stored contiguously, dispatched as bucketed batched GEMM or vLLM fused MoE kernel”.- moe_
host - Backend-agnostic MoE host-side helpers — used by all backends and
across all builds (no
cfg(feature = "metal")gate). - quant_
linear - Concrete
Linear<B>impls for quantized weights. - stacked_
expert StackedExpertGgufLinear<B>— abstraction for “N MoE experts’ GGUF quantized weights stored contiguously, dispatched as one batched MoE GEMV/GEMM kernel”.