docs.rs failed to build pmetal-models-0.1.0
Please check the build logs for more information.
See Builds for ideas on how to fix a failed build, or Metadata for how to configure docs.rs builds.
If you believe this is docs.rs' fault, open an issue.
Please check the build logs for more information.
See Builds for ideas on how to fix a failed build, or Metadata for how to configure docs.rs builds.
If you believe this is docs.rs' fault, open an issue.
pmetal-models
LLM architecture implementations with dynamic dispatch.
Overview
This crate provides implementations of popular LLM architectures optimized for Apple Silicon. It includes a dynamic dispatch system that automatically detects and loads models based on their configuration.
Supported Architectures
| Family | Variants | Status |
|---|---|---|
| Llama | 2, 3, 3.1, 3.2, 3.3, 4 | Production |
| Qwen | 2, 2.5, 3, 3-MoE | Production |
| DeepSeek | V3, V3.2, V3.2-Speciale | Production |
| Mistral | 7B, 8x7B (MoE) | Production |
| Gemma | 2, 3 | Production |
| Phi | 3, 4 | Production |
| GPT-OSS | 20B, 120B | Production |
| Granite | 3.0, 3.1 | Production |
| Cohere | Command R | Production |
Vision Models
| Family | Variants | Status |
|---|---|---|
| Pixtral | 12B | Inference |
| Qwen2-VL | 2B, 7B | Inference |
| MLlama | 3.2-Vision | Inference |
Features
- Dynamic Model Loading: Auto-detect architecture from
config.json - Unified Generation API: Common interface for all models
- Advanced Sampling: Temperature, top-k, top-p, repetition penalty
- Metal-Accelerated Sampling: Fused GPU sampler kernel
- KV Cache Management: Efficient inference with caching
Usage
use ;
// Load model with auto-detection
let model = from_pretrained?;
// Configure generation
let config = sampling
.with_top_k
.with_top_p;
// Generate tokens
let output = generate?;
Architecture Detection
The DynamicModel automatically detects model architecture:
use ModelArchitecture;
let arch = detect?;
// Returns: Llama, Qwen3, Mistral, Gemma, Phi, etc.
Generation Configuration
| Parameter | Description | Default |
|---|---|---|
max_tokens |
Maximum tokens to generate | Required |
temperature |
Sampling temperature (0 = greedy) | Model default |
top_k |
Top-k sampling (0 = disabled) | Model default |
top_p |
Nucleus sampling threshold | Model default |
repetition_penalty |
Penalty for repeated tokens | 1.0 |
stop_tokens |
Tokens that stop generation | EOS |
Modules
| Module | Description |
|---|---|
architectures/ |
Model implementations (Llama, Qwen, etc.) |
dispatcher |
Dynamic model loading and dispatch |
generation |
Token generation with sampling |
loader |
HuggingFace model loading |
sampling/ |
Sampling strategy implementations |
traits |
CausalLMModel, Quantizable traits |
License
MIT OR Apache-2.0