docs.rs failed to build pmetal-models-0.3.12
Please check the build logs for more information.
See Builds for ideas on how to fix a failed build, or Metadata for how to configure docs.rs builds.
If you believe this is docs.rs' fault, open an issue.

pmetal-models

LLM architecture implementations with dynamic dispatch.

Overview

This crate provides implementations of popular LLM architectures optimized for Apple Silicon. It includes a dynamic dispatch system that automatically detects and loads models based on their configuration.

Supported Architectures

Dispatched Models (via `DynamicModel`)

These architectures are wired into the ModelArchitecture dispatcher and can be loaded automatically from config.json:

Architecture	Family	Variants
`Llama`	Llama	2, 3, 3.1, 3.2, 3.3
`Llama4`	Llama 4	Scout, Maverick
`Qwen2`	Qwen	2, 2.5
`Qwen3`	Qwen	3
`Qwen3MoE`	Qwen	3-MoE
`Qwen3Next`	Qwen	3.5 (Next)
`DeepSeek`	DeepSeek	V3, V3.2, V3.2-Speciale
`Mistral`	Mistral	7B, Mixtral 8x7B (MoE)
`Gemma`	Gemma	2, 3
`Phi`	Phi	3, 3.5
`Phi4`	Phi	4
`Cohere`	Cohere	Command R
`Granite`	Granite	3.0, 3.1, Hybrid MoE
`NemotronH`	NemotronH	Hybrid (Mamba+Attention)
`StarCoder2`	StarCoder2	3B, 7B, 15B
`RecurrentGemma`	RecurrentGemma	Griffin
`Jamba`	Jamba	1.5
`Flux`	Flux	1-dev, 1-schnell (diffusion)

Architecture Modules (Not Dispatched)

These have implementations but are not wired into DynamicModel — use their types directly:

Module	Family	Notes
`gpt_oss`	GPT-OSS	20B, 120B MoE
`pixtral`	Pixtral	12B vision-language
`qwen2_vl`	Qwen2-VL	2B, 7B vision-language
`mllama`	MLlama	3.2-Vision
`clip`	CLIP	ViT-L/14 vision encoder
`whisper`	Whisper	Base, Small, Medium, Large
`t5`	T5	Encoder-decoder

Features

Dynamic Model Loading: Auto-detect architecture from config.json
Unified Generation API: Common interface for all models
Advanced Sampling: Temperature, top-k, top-p, repetition penalty
Metal-Accelerated Sampling: Fused GPU sampler kernel
KV Cache Management: Efficient inference with caching

Usage

use pmetal_models::{DynamicModel, GenerationConfig, generate};

// Load model with auto-detection
let model = DynamicModel::from_pretrained("unsloth/Llama-3.2-1B")?;

// Configure generation
let config = GenerationConfig::sampling(256, 0.7)
    .with_top_k(40)
    .with_top_p(0.95);

// Generate tokens
let output = generate(
    |input| model.forward(input, None),
    &input_tokens,
    config,
)?;

Architecture Detection

The DynamicModel automatically detects model architecture:

use pmetal_models::ModelArchitecture;

let arch = ModelArchitecture::detect("path/to/model")?;
// Returns: Llama, Qwen3, Mistral, Gemma, Phi, etc.

Generation Configuration

Parameter	Description	Default
`max_tokens`	Maximum tokens to generate	Required
`temperature`	Sampling temperature (0 = greedy)	Model default
`top_k`	Top-k sampling (0 = disabled)	Model default
`top_p`	Nucleus sampling threshold	Model default
`repetition_penalty`	Penalty for repeated tokens	1.0
`stop_tokens`	Tokens that stop generation	EOS

Modules

Module	Description
`architectures/`	Model implementations (Llama, Qwen, etc.)
`dispatcher`	Dynamic model loading and dispatch
`generation`	Token generation with sampling
`loader`	HuggingFace model loading
`sampling/`	Sampling strategy implementations
`traits`	`CausalLMModel`, `Quantizable` traits

License

MIT OR Apache-2.0

pmetal-models 0.3.12