Crate ferrum_kernels

Expand description

Ferrum unified compute kernels for high-performance inference.

Provides the Backend trait and implementations for CUDA, Metal, and CPU. On CUDA builds, kernels are compiled to PTX during cargo build and loaded on demand at runtime.

Re-exports§

pub use linear::Linear;
pub use stacked_expert::StackedExpertGgufLinear;
pub use marlin_expert_stack::MarlinExpertStack;

Modules§

attention: ferrum-attention: Fused flash attention and transformer for Metal, CUDA, and CPU.
backend: Unified Backend trait for CUDA, Metal, and CPU compute.
linear: Linear<B> trait — weight-bearing projection abstraction.
marlin_expert_stack: MarlinExpertStack<B> — abstraction for “N MoE experts’ Marlin GPTQ-INT4 tiles stored contiguously, dispatched as bucketed batched GEMM or vLLM fused MoE kernel”.
moe_host: Backend-agnostic MoE host-side helpers — used by all backends and across all builds (no cfg(feature = "metal") gate).
quant_linear: Concrete Linear<B> impls for quantized weights.
stacked_expert: StackedExpertGgufLinear<B> — abstraction for “N MoE experts’ GGUF quantized weights stored contiguously, dispatched as one batched MoE GEMV/GEMM kernel”.

Functions§

configure_native_profile_sink

Crate ferrum_kernels

Crate ferrum_kernels Copy item path

Re-exports§

Modules§

Functions§

Crate ferrum_kernels