Skip to main content

Crate ferrum_kernels

Crate ferrum_kernels 

Source
Expand description

Ferrum unified compute kernels for high-performance inference.

Provides the Backend trait and implementations for CUDA, Metal, and CPU. On CUDA builds, kernels are compiled to PTX during cargo build and loaded on demand at runtime.

Re-exports§

pub use linear::Linear;
pub use stacked_expert::StackedExpertGgufLinear;
pub use marlin_expert_stack::MarlinExpertStack;

Modules§

attention
ferrum-attention: Fused flash attention and transformer for Metal, CUDA, and CPU.
backend
Unified Backend trait for CUDA, Metal, and CPU compute.
linear
Linear<B> trait — weight-bearing projection abstraction.
marlin_expert_stack
MarlinExpertStack<B> — abstraction for “N MoE experts’ Marlin GPTQ-INT4 tiles stored contiguously, dispatched as bucketed batched GEMM or vLLM fused MoE kernel”.
moe_host
Backend-agnostic MoE host-side helpers — used by all backends and across all builds (no cfg(feature = "metal") gate).
quant_linear
Concrete Linear<B> impls for quantized weights.
stacked_expert
StackedExpertGgufLinear<B> — abstraction for “N MoE experts’ GGUF quantized weights stored contiguously, dispatched as one batched MoE GEMV/GEMM kernel”.

Functions§

configure_native_profile_sink