Expand description
Model execution interface with clear prefill/decode separation
This module provides the ModelExecutor trait that replaces the “fat” Model interface, focusing purely on tensor operations without tokenization or sampling.
Structs§
- Decode
Input - Input for decode phase (generating one token at a time)
- Decode
Output - Output from decode phase
- Executor
Attention Config - Runtime attention configuration for model executor
- Executor
Capabilities - Executor capabilities and configuration
- Executor
Config - Executor configuration
- Executor
Memory Config - Memory configuration for executor
- Executor
Memory Usage - Executor memory usage
- Executor
Metrics - Executor performance metrics
- Executor
Status - Executor status information
- Memory
Requirements - Memory requirements for model execution
- Optimization
Config - Optimization configuration
- Prefill
Input - Input for prefill phase (processing the initial prompt)
- Prefill
Output - Output from prefill phase
- Speculative
Decode Output - Output from speculative decoding
Enums§
- Attention
Type - Attention mechanism types
- Executor
State - Executor state
- Executor
Type - Supported executor types
Traits§
- Batch
Model Executor - Batch model executor for processing multiple requests efficiently
- Executor
Registry - Executor registry for managing multiple executors
- Model
Executor - Core model executor trait focusing on tensor operations
- Model
Executor Factory - Model executor factory
- Speculative
Executor - Speculative execution support