Expand description
Core interface definitions for the Ferrum inference framework
This crate defines all the stable trait interfaces that different components of Ferrum implement. It provides a clean abstraction layer that allows for pluggable implementations of tokenizers, model executors, schedulers, cache managers, and other core components.
The interfaces are designed following the principles outlined in the refactoring documentation:
- Single responsibility with stable boundaries
- Zero-copy and handle semantics
- Capability discovery driven
- Performance-first API design
Re-exports§
pub use backend::BackendCapabilities;pub use backend::ComputeBackend;pub use backend::WeightLoader;pub use decode_backend::DecodeBackend;pub use engine::InferenceEngine;pub use kv_cache::AllocationRequest;pub use kv_cache::BlockTable;pub use kv_cache::CacheHandleStats;pub use kv_cache::KvCacheHandle;pub use kv_cache::KvCacheManager;pub use memory::DeviceMemoryManager;pub use memory::MemoryHandle;pub use memory::StreamHandle;pub use model_builder::BuildOptions;pub use model_builder::ModelBuilder;pub use model_executor::DecodeInput;pub use model_executor::DecodeOutput;pub use model_executor::ModelExecutor;pub use model_executor::PrefillInput;pub use model_executor::PrefillOutput;pub use sampler::LogitsProcessor;pub use sampler::Sampler;pub use sampler::SamplingConfig;pub use sampler::SamplingContext;pub use scheduler::BatchHint;pub use scheduler::BatchPlan;pub use scheduler::Scheduler as SchedulerInterface;pub use tensor::TensorFactory;pub use tensor::TensorLike;pub use tensor::TensorOps;pub use tensor::TensorRef;pub use tokenizer::IncrementalTokenizer;pub use tokenizer::Tokenizer;pub use tokenizer::TokenizerFactory;pub use tokenizer::TokenizerInfo;pub use transformer::TransformerConfig;pub use transformer::TransformerWeights;pub use kernel_ops::ActivationOps;pub use kernel_ops::AttentionOps;pub use kernel_ops::AttentionParams;pub use kernel_ops::KernelOps;pub use kernel_ops::KernelOpsDispatch;pub use kernel_ops::LinearOps;pub use kernel_ops::NormOps;pub use kernel_ops::PositionOps;pub use kernel_ops::QuantScheme;pub use kernel_ops::RoPEConfig;pub use kernel_ops::SamplingOps;pub use kernel_ops::SamplingParams as KernelSamplingParams;
Modules§
- backend
- Backend abstraction split into compute and weight loading concerns
- decode_
backend - Decode backend abstraction.
- engine
- Inference engine interface with streaming and batch support
- kernel_
ops - Kernel backend abstraction layer for LLM-specific fused operations.
- kv_
cache - KV-Cache abstraction with handle semantics and block management
- memory
- Memory management interfaces for device memory operations
- model_
builder - Model builder interface for constructing model executors
- model_
executor - Model execution interface with clear prefill/decode separation
- sampler
- Sampling and logits processing interfaces
- scheduler
- Unified scheduler interface with resource awareness and SLA support
- tensor
- Tensor abstraction with zero-copy and device-aware semantics
- tokenizer
- Tokenizer interface for text encoding/decoding
- transformer
- Transformer model weight abstraction.
Structs§
- Backend
Config - Backend configuration
- BatchId
- Batch identifier
- Client
Id - Client identifier for multi-tenancy
- Component
Health - Individual component health snapshot
- Component
Status - Aggregated component health map
- Engine
Config - Engine configuration
- Engine
Metrics - Aggregated engine metrics
- Engine
Status - Engine status information
- Health
Status - Health check status
- Inference
Request - Inference request
- Inference
Response - Inference response
- Memory
Usage - Memory usage statistics
- ModelId
- Model identifier
- Model
Info - Model information and metadata
- Request
Id - Request identifier
- Sampling
Params - Sampling parameters for generation
- Scheduler
Config - Scheduler configuration
- Scheduler
Stats - Scheduler statistics
- Session
Id - Session identifier for stateful interactions
- Special
Tokens - Special tokens configuration
- Stream
Chunk - Streaming response chunk
- TaskId
- Task identifier for execution tasks
- TokenId
- Token identifier used across the inference pipeline.
- Tokenizer
Config - Tokenizer configuration
Enums§
- Data
Type - Data type for tensors
- Device
- Device type for computation
- Ferrum
Error - Main error type for Ferrum operations
- Finish
Reason - Reason for completion
- Model
Source - Model loading source specification
- Model
Type - Model type enumeration
- Priority
- Request priority levels