Why mistral.rs?
- Any HuggingFace model, zero config: Just
mistralrs run -m user/model. Auto-detects architecture, quantization, chat template. - True multimodality: Vision, audio, speech generation, image generation, embeddings.
- Not another model registry: Use HuggingFace models directly. No converting, no uploading to a separate service.
- Full quantization control: Choose the precise quantization you want to use, or make your own UQFF with
mistralrs quantize. - Built-in web UI:
mistralrs serve --uigives you a web interface instantly. - Hardware-aware:
mistralrs tunebenchmarks your system and picks optimal quantization + device mapping. - Flexible SDKs: Python package and Rust crate to build your projects.
Quick Start
Install
Linux/macOS:
|
Windows (PowerShell):
irm https://raw.githubusercontent.com/EricLBuehler/mistral.rs/master/install.ps1 | iex
Manual installation & other platforms
Run Your First Model
# Interactive chat
# Or start a server with web UI
Then visit http://localhost:1234/ui for the web chat interface.
The mistralrs CLI
The CLI is designed to be zero-config: just point it at a model and go.
- Auto-detection: Automatically detects model architecture, quantization format, and chat template
- All-in-one: Single binary for chat, server, benchmarks, and web UI (
run,serve,bench) - Hardware tuning: Run
mistralrs tuneto automatically benchmark and configure optimal settings for your hardware - Format-agnostic: Works with Hugging Face models, GGUF files, and UQFF quantizations seamlessly
# Auto-tune for your hardware and emit a config file
# Run using the generated config
# Diagnose system issues (CUDA, Metal, HuggingFace connectivity)
What Makes It Fast
Performance
- CUDA with FlashAttention V2/V3, Metal, multi-GPU tensor parallelism
- PagedAttention for high throughput, prefix caching (including multimodal)
Quantization (full docs)
- In-situ quantization (ISQ) of any Hugging Face model
- GGUF (2-8 bit), GPTQ, AWQ, HQQ, FP8, BNB support
- ⭐ Per-layer topology: Fine-tune quantization per layer for optimal quality/speed
- ⭐ Auto-select fastest quant method for your hardware
Flexibility
- LoRA & X-LoRA with weight merging
- AnyMoE: Create mixture-of-experts on any base model
- Multiple models: Load/unload at runtime
Agentic Features
- Integrated tool calling with Python/Rust callbacks
- ⭐ Web search integration
- ⭐ MCP client: Connect to external tools automatically
Supported Models
- Granite 4.0
- SmolLM 3
- DeepSeek V3
- GPT-OSS
- DeepSeek V2
- Qwen 3 MoE
- Phi 3.5 MoE
- Qwen 3
- GLM 4
- GLM-4.7-Flash
- GLM-4.7 (MoE)
- Gemma 2
- Qwen 2
- Starcoder 2
- Phi 3
- Mixtral
- Phi 2
- Gemma
- Llama
- Mistral
- Qwen 3-VL
- Gemma 3n
- Llama 4
- Gemma 3
- Mistral 3
- Phi 4 multimodal
- Qwen 2.5-VL
- MiniCPM-O
- Llama 3.2 Vision
- Qwen 2-VL
- Idefics 3
- Idefics 2
- LLaVA Next
- LLaVA
- Phi 3V
- Dia
- FLUX
- Embedding Gemma
- Qwen 3 Embedding
Request a new model | Full compatibility tables
Python SDK
=
=
Python SDK | Installation | Examples | Cookbook
Docker
For quick containerized deployment:
For production use, we recommend installing the CLI directly for maximum flexibility.
Documentation
For complete documentation, see the Documentation.
Quick Links:
- CLI Reference - All commands and options
- HTTP API - OpenAI-compatible endpoints
- Quantization - ISQ, GGUF, GPTQ, and more
- Device Mapping - Multi-GPU and CPU offloading
- MCP Integration - MCP integration documentation
- Troubleshooting - Common issues and solutions
- Configuration - Environment variables for configuration
Contributing
Contributions welcome! Please open an issue to discuss new features or report bugs. If you want to add a new model, please contact us via an issue and we can coordinate.
Credits
This project would not be possible without the excellent work at Candle. Thank you to all contributors!
mistral.rs is not affiliated with Mistral AI.