pmetal-models 0.3.12

LLM model architectures for PMetal
docs.rs failed to build pmetal-models-0.3.12
Please check the build logs for more information.
See Builds for ideas on how to fix a failed build, or Metadata for how to configure docs.rs builds.
If you believe this is docs.rs' fault, open an issue.

pmetal-models

LLM architecture implementations with dynamic dispatch.

Overview

This crate provides implementations of popular LLM architectures optimized for Apple Silicon. It includes a dynamic dispatch system that automatically detects and loads models based on their configuration.

Supported Architectures

Dispatched Models (via DynamicModel)

These architectures are wired into the ModelArchitecture dispatcher and can be loaded automatically from config.json:

Architecture Family Variants
Llama Llama 2, 3, 3.1, 3.2, 3.3
Llama4 Llama 4 Scout, Maverick
Qwen2 Qwen 2, 2.5
Qwen3 Qwen 3
Qwen3MoE Qwen 3-MoE
Qwen3Next Qwen 3.5 (Next)
DeepSeek DeepSeek V3, V3.2, V3.2-Speciale
Mistral Mistral 7B, Mixtral 8x7B (MoE)
Gemma Gemma 2, 3
Phi Phi 3, 3.5
Phi4 Phi 4
Cohere Cohere Command R
Granite Granite 3.0, 3.1, Hybrid MoE
NemotronH NemotronH Hybrid (Mamba+Attention)
StarCoder2 StarCoder2 3B, 7B, 15B
RecurrentGemma RecurrentGemma Griffin
Jamba Jamba 1.5
Flux Flux 1-dev, 1-schnell (diffusion)

Architecture Modules (Not Dispatched)

These have implementations but are not wired into DynamicModel — use their types directly:

Module Family Notes
gpt_oss GPT-OSS 20B, 120B MoE
pixtral Pixtral 12B vision-language
qwen2_vl Qwen2-VL 2B, 7B vision-language
mllama MLlama 3.2-Vision
clip CLIP ViT-L/14 vision encoder
whisper Whisper Base, Small, Medium, Large
t5 T5 Encoder-decoder

Features

  • Dynamic Model Loading: Auto-detect architecture from config.json
  • Unified Generation API: Common interface for all models
  • Advanced Sampling: Temperature, top-k, top-p, repetition penalty
  • Metal-Accelerated Sampling: Fused GPU sampler kernel
  • KV Cache Management: Efficient inference with caching

Usage

use pmetal_models::{DynamicModel, GenerationConfig, generate};

// Load model with auto-detection
let model = DynamicModel::from_pretrained("unsloth/Llama-3.2-1B")?;

// Configure generation
let config = GenerationConfig::sampling(256, 0.7)
    .with_top_k(40)
    .with_top_p(0.95);

// Generate tokens
let output = generate(
    |input| model.forward(input, None),
    &input_tokens,
    config,
)?;

Architecture Detection

The DynamicModel automatically detects model architecture:

use pmetal_models::ModelArchitecture;

let arch = ModelArchitecture::detect("path/to/model")?;
// Returns: Llama, Qwen3, Mistral, Gemma, Phi, etc.

Generation Configuration

Parameter Description Default
max_tokens Maximum tokens to generate Required
temperature Sampling temperature (0 = greedy) Model default
top_k Top-k sampling (0 = disabled) Model default
top_p Nucleus sampling threshold Model default
repetition_penalty Penalty for repeated tokens 1.0
stop_tokens Tokens that stop generation EOS

Modules

Module Description
architectures/ Model implementations (Llama, Qwen, etc.)
dispatcher Dynamic model loading and dispatch
generation Token generation with sampling
loader HuggingFace model loading
sampling/ Sampling strategy implementations
traits CausalLMModel, Quantizable traits

License

MIT OR Apache-2.0