Expand description
Core interface definitions for the Ferrum inference framework
This crate carries the stable, GPU-free trait contracts shared across
the workspace: model execution, scheduling, KV cache management,
tokenization, sampling, and the lifecycle/modality engine traits.
Hardware backends live in ferrum-kernels (the Backend<B> trait
and its supertraits); only types that compile without GPU features
belong here.
Re-exports§
pub use engine::InferenceEngine;pub use kv_cache::AllocationRequest;pub use kv_cache::BlockTable;pub use kv_cache::CacheHandleStats;pub use kv_cache::KvCacheHandle;pub use kv_cache::KvCacheManager;pub use kv_dtype::KvBf16;pub use kv_dtype::KvDtypeKind;pub use kv_dtype::KvFp16;pub use kv_dtype::KvFp8;pub use kv_dtype::KvInt8;pub use model_executor::DecodeInput;pub use model_executor::DecodeOutput;pub use model_executor::ModelExecutor;pub use model_executor::PrefillInput;pub use model_executor::PrefillOutput;pub use sampler::LogitsProcessor;pub use sampler::Sampler;pub use sampler::SamplingConfig;pub use sampler::SamplingContext;pub use scheduler::BatchHint;pub use scheduler::BatchPlan;pub use scheduler::Scheduler as SchedulerInterface;pub use tensor::TensorFactory;pub use tensor::TensorLike;pub use tensor::TensorOps;pub use tensor::TensorRef;pub use tokenizer::IncrementalTokenizer;pub use tokenizer::Tokenizer;pub use tokenizer::TokenizerFactory;pub use tokenizer::TokenizerInfo;
Modules§
- engine
- Inference engine interfaces — split per modality.
- kv_
cache - KV-Cache abstraction with handle semantics and block management
- kv_
dtype - KV cache element-type markers (Dim 5 of the 5-dimension architecture).
- model_
executor - Model execution interface with clear prefill/decode separation
- sampler
- Sampling and logits processing interfaces
- scheduler
- Unified scheduler interface with resource awareness and SLA support
- tensor
- Tensor abstraction with zero-copy and device-aware semantics
- tokenizer
- Tokenizer interface for text encoding/decoding
Structs§
- Backend
Config - Backend configuration
- BatchId
- Batch identifier
- Client
Id - Client identifier for multi-tenancy
- Component
Health - Individual component health snapshot
- Component
Status - Aggregated component health map
- Engine
Config - Engine configuration
- Engine
Metrics - Aggregated engine metrics
- Engine
Status - Engine status information
- Health
Status - Health check status
- Inference
Request - Inference request
- Inference
Response - Inference response
- Memory
Usage - Memory usage statistics
- ModelId
- Model identifier
- Model
Info - Model information and metadata
- Request
Id - Request identifier
- Sampling
Params - Sampling parameters for generation
- Scheduler
Config - Scheduler configuration
- Scheduler
Stats - Scheduler statistics
- Session
Id - Session identifier for stateful interactions
- Special
Tokens - Special tokens configuration
- Stream
Chunk - Streaming response chunk
- TaskId
- Task identifier for execution tasks
- TokenId
- Token identifier used across the inference pipeline.
- Tokenizer
Config - Tokenizer configuration
Enums§
- Data
Type - Data type for tensors
- Device
- Device type for computation
- Ferrum
Error - Main error type for Ferrum operations
- Finish
Reason - Reason for completion
- Model
Source - Model loading source specification
- Model
Type - Model type enumeration
- Priority
- Request priority levels