Expand description
This crate is the Rust SDK for mistral.rs, providing an asynchronous interface for LLM inference.
To get started loading a model, check out the following builders:
TextModelBuilderLoraModelBuilderXLoraModelBuilderGgufModelBuilderGgufLoraModelBuilderGgufXLoraModelBuilderVisionModelBuilderAnyMoeModelBuilder
For loading multiple models simultaneously, use MultiModelBuilder.
The returned Model supports _with_model method variants and runtime
model management (unload/reload).
§Example
use anyhow::Result;
use mistralrs::{
IsqType, PagedAttentionMetaBuilder, TextMessageRole, TextMessages, TextModelBuilder,
};
#[tokio::main]
async fn main() -> Result<()> {
let model = TextModelBuilder::new("microsoft/Phi-3.5-mini-instruct".to_string())
.with_isq(IsqType::Q8_0)
.with_logging()
.with_paged_attn(|| PagedAttentionMetaBuilder::default().build())?
.build()
.await?;
let messages = TextMessages::new()
.add_message(
TextMessageRole::System,
"You are an AI agent with a specialty in programming.",
)
.add_message(
TextMessageRole::User,
"Hello! How are you? Please write generic binary search function in Rust.",
);
let response = model.send_chat_request(messages).await?;
println!("{}", response.choices[0].message.content.as_ref().unwrap());
dbg!(
response.usage.avg_prompt_tok_per_sec,
response.usage.avg_compl_tok_per_sec
);
Ok(())
}§Streaming example
use anyhow::Result;
use mistralrs::{
ChatCompletionChunkResponse, ChunkChoice, Delta, IsqType, PagedAttentionMetaBuilder,
Response, TextMessageRole, TextMessages, TextModelBuilder,
};
#[tokio::main]
async fn main() -> Result<()> {
let model = TextModelBuilder::new("microsoft/Phi-3.5-mini-instruct".to_string())
.with_isq(IsqType::Q8_0)
.with_logging()
.with_paged_attn(|| PagedAttentionMetaBuilder::default().build())?
.build()
.await?;
let messages = TextMessages::new()
.add_message(
TextMessageRole::System,
"You are an AI agent with a specialty in programming.",
)
.add_message(
TextMessageRole::User,
"Hello! How are you? Please write generic binary search function in Rust.",
);
let mut stream = model.stream_chat_request(messages).await?;
while let Some(chunk) = stream.next().await {
if let Response::Chunk(ChatCompletionChunkResponse { choices, .. }) = chunk {
if let Some(ChunkChoice {
delta:
Delta {
content: Some(content),
..
},
..
}) = choices.first()
{
print!("{}", content);
};
}
}
Ok(())
}§MCP example
The MCP client integrates seamlessly with mistral.rs model builders:
use mistralrs::{TextModelBuilder, IsqType, McpClientConfig, McpServerConfig, McpServerSource};
#[tokio::main]
async fn main() -> anyhow::Result<()> {
let mcp_config = McpClientConfig {
servers: vec![/* your server configs */],
auto_register_tools: true,
tool_timeout_secs: Some(30),
max_concurrent_calls: Some(5),
};
let model = TextModelBuilder::new("path/to/model".to_string())
.with_isq(IsqType::Q8_0)
.with_mcp_client(mcp_config) // MCP tools automatically registered
.build()
.await?;
// MCP tools are now available for automatic tool calling
Ok(())
}Re-exports§
pub use model_builder_trait::AnyModelBuilder;pub use model_builder_trait::MultiModelBuilder;pub use mistralrs_core::llguidance;pub use schemars;
Modules§
- core
- Low-level types and internals re-exported from
mistralrs_core. - model_
builder_ trait - speech_
utils
Structs§
- Agent
- An agent that runs an agentic loop with tool calling
- Agent
Builder - Builder for creating agents with a fluent API
- Agent
Config - Configuration for the agentic loop
- Agent
Response - Final response from the agent
- Agent
Step - Represents a single step in the agent execution
- Agent
Stream - Stream of agent events during execution
- AnyMoe
Config - AnyMoe
Model Builder - Audio
Input - Raw audio input consisting of PCM samples and a sample rate.
- Called
Function - Called function with name and arguments
- Chat
Completion Chunk Response - Chat completion streaming request chunk.
- Chat
Completion Response - An OpenAI compatible chat completion response.
- Choice
- Chat completion choice.
- Chunk
Choice - Chat completion streaming chunk choice.
- Completion
Response - An OpenAI compatible completion response.
- Delta
- Delta in content for streaming response.
- Diffusion
Generation Params - Diffusion
Model Builder - Configure a text model with the various parameters for loading, running, and other inference behaviors.
- DrySampling
Params - Embedding
Model Builder - Configure an embedding model with the various parameters for loading, running, and other inference behaviors.
- Embedding
Request - A validated embedding request constructed via
EmbeddingRequestBuilder. - Embedding
Request Builder - Builder for configuring embedding requests.
- Function
- Function definition for a tool
- Gguf
Lora Model Builder - Wrapper of
GgufModelBuilderfor LoRA models. - Gguf
Model Builder - Configure a text GGUF model with the various parameters for loading, running, and other inference behaviors.
- GgufX
Lora Model Builder - Wrapper of
GgufModelBuilderfor X-LoRA models. - Layer
Topology - Logprobs
- Logprobs per token.
- Lora
Model Builder - Wrapper of
TextModelBuilderfor LoRA models. - McpClient
- MCP client that manages connections to multiple MCP servers
- McpClient
Config - Configuration for MCP client integration
- McpServer
Config - Configuration for an individual MCP server
- McpTool
Info - Information about a tool discovered from an MCP server
- Mistral
Rs - The MistralRs struct handles sending requests to multiple engines.
It is the core multi-threaded component of mistral.rs, and uses
mpscSenderandReceiverprimitives to send and receive requests to the appropriate engine based on model ID. - Model
- The object used to interact with the model. This can be used with many varietes of models,
and as such may be created with one of: - Normal
Request - A normal request request to the
MistralRs. - Paged
Attention Config - All memory counts in MB. Default for block size is 32.
- Paged
Attention Meta Builder - Builder for PagedAttention metadata.
- Request
Builder - A way to add messages with finer control given.
- Response
Message - Chat completion response message.
- Sampling
Params - Sampling params are used to control sampling.
- Search
Function Parameters - Search
Result - Speculative
Config - Metadata for a speculative pipeline
- Speech
Model Builder - Configure a text model with the various parameters for loading, running, and other inference behaviors.
- Tensor
- The core struct for manipulating tensors.
- Text
Messages - Plain text (chat) messages.
- Text
Model Builder - Configure a text model with the various parameters for loading, running, and other inference behaviors.
- Text
Speculative Builder - Tool
- Tool definition
- Tool
Call Response - Tool
Result - Result of a tool execution
- TopLogprob
- Top-n logprobs element
- Topology
- Uqff
Embedding Model Builder - Configure a UQFF embedding model with the various parameters for loading, running, and other inference behaviors.
This wraps and implements
DerefMutfor the UqffEmbeddingModelBuilder, so users should take care to not call UQFF-related methods. - Uqff
Text Model Builder - Configure a UQFF text model with the various parameters for loading, running, and other inference behaviors.
This wraps and implements
DerefMutfor the TextModelBuilder, so users should take care to not call UQFF-related methods. - Uqff
Vision Model Builder - Configure a UQFF text model with the various parameters for loading, running, and other inference behaviors.
This wraps and implements
DerefMutfor the VisionModelBuilder, so users should take care to not call UQFF-related methods. - Usage
- OpenAI compatible (superset) usage during a request.
- Vision
Messages - Text (chat) messages with images and/or audios.
- Vision
Model Builder - Configure a vision model with the various parameters for loading, running, and other inference behaviors.
- WebSearch
Options - XLora
Model Builder - Wrapper of
TextModelBuilderfor X-LoRA models.
Enums§
- Agent
Event - Events yielded during agent streaming
- Agent
Stop Reason - Reason why the agent stopped executing
- AnyMoe
Expert Type - Auto
Device MapParams - Constraint
- Control the constraint with llguidance.
- DType
- The different types of elements allowed in tensors.
- Default
Scheduler Method - The scheduler method controld how sequences are scheduled during each step of the engine. For each scheduling step, the scheduler method is used if there are not only running, only waiting sequences, or none. If is it used, then it is used to allow waiting sequences to run.
- Device
- Cpu, Cuda, or Metal
- Device
MapSetting - Diffusion
Loader Type - The architecture to load the vision model as.
- Embedding
Request Input - An individual embedding input.
- Image
Generation Response Format - Image generation response format
- IsqType
- McpServer
Source - Supported MCP server transport sources
- Memory
GpuConfig - Model
Category - Category of the model. This can also be used to extract model-category specific tools, such as the vision model prompt prefixer.
- ModelD
Type - DType for the model.
- Request
- A request to the Engine, encapsulating the various parameters as well as
the
mpscresponseSenderused to return theResponse. - Request
Message - Message or messages for a
Request. - Response
- The response enum contains 3 types of variants:
- Response
Ok - Scheduler
Config - Search
Embedding Model - Embedding model used for ranking web search results internally.
- Speech
Loader Type - Stop
Tokens - Stop sequences or ids.
- Text
Message Role - A chat message role.
- Token
Source - The source of the HF token.
- Tool
Call Type - Tool
Callback Type - Unified tool callback that can be sync or async
- Tool
Choice - Tool
Type - Type of tool
Traits§
- Custom
Logits Processor - Customizable logits processor.
- Request
Like - A type which can be used as a chat request.
Functions§
- best_
device - Gets the best device, cpu, cuda if compiled with CUDA, or Metal
- cross_
entropy_ loss - The cross-entropy loss.
- initialize_
logging - This should be called to initialize the debug flag and logging. This should not be called in mistralrs-core code due to Rust usage.
- paged_
attn_ supported trueif built with CUDA (requires Unix) /Metal- parse_
isq_ value - Parse ISQ value.
Type Aliases§
- Async
Tool Callback - Async tool callback type for native async tool support
- Llguidance
Grammar - Message
Content - Result
- Search
Callback - Callback used to override how search results are gathered. The returned vector must be sorted in decreasing order of relevance.
- Tool
Callback - Callback used for custom tool functions. Receives the called function (name and JSON arguments) and returns the tool output as a string.
Attribute Macros§
- tool
- The
#[tool]attribute macro for defining tools.