Crate mistralrs

Expand description

This crate is the Rust SDK for mistral.rs, providing an asynchronous interface for LLM inference.

To get started loading a model, check out the following builders:

For loading multiple models simultaneously, use MultiModelBuilder. The returned Model supports _with_model method variants and runtime model management (unload/reload).

§Example

use anyhow::Result;
use mistralrs::{
    IsqType, PagedAttentionMetaBuilder, TextMessageRole, TextMessages, TextModelBuilder,
};

#[tokio::main]
async fn main() -> Result<()> {
    let model = TextModelBuilder::new("microsoft/Phi-3.5-mini-instruct".to_string())
        .with_isq(IsqType::Q8_0)
        .with_logging()
        .with_paged_attn(|| PagedAttentionMetaBuilder::default().build())?
        .build()
        .await?;

    let messages = TextMessages::new()
        .add_message(
            TextMessageRole::System,
            "You are an AI agent with a specialty in programming.",
        )
        .add_message(
            TextMessageRole::User,
            "Hello! How are you? Please write generic binary search function in Rust.",
        );

    let response = model.send_chat_request(messages).await?;

    println!("{}", response.choices[0].message.content.as_ref().unwrap());
    dbg!(
        response.usage.avg_prompt_tok_per_sec,
        response.usage.avg_compl_tok_per_sec
    );

    Ok(())
}

§Streaming example

   use anyhow::Result;
   use mistralrs::{
       ChatCompletionChunkResponse, ChunkChoice, Delta, IsqType, PagedAttentionMetaBuilder,
       Response, TextMessageRole, TextMessages, TextModelBuilder,
   };

   #[tokio::main]
   async fn main() -> Result<()> {
       let model = TextModelBuilder::new("microsoft/Phi-3.5-mini-instruct".to_string())
           .with_isq(IsqType::Q8_0)
           .with_logging()
           .with_paged_attn(|| PagedAttentionMetaBuilder::default().build())?
           .build()
           .await?;

       let messages = TextMessages::new()
           .add_message(
               TextMessageRole::System,
               "You are an AI agent with a specialty in programming.",
           )
           .add_message(
               TextMessageRole::User,
               "Hello! How are you? Please write generic binary search function in Rust.",
           );

       let mut stream = model.stream_chat_request(messages).await?;
       while let Some(chunk) = stream.next().await {
           if let Response::Chunk(ChatCompletionChunkResponse { choices, .. }) = chunk {
               if let Some(ChunkChoice {
                   delta:
                       Delta {
                           content: Some(content),
                           ..
                       },
                   ..
               }) = choices.first()
               {
                   print!("{}", content);
               };
           }
       }
       Ok(())
   }

§MCP example

The MCP client integrates seamlessly with mistral.rs model builders:

use mistralrs::{TextModelBuilder, IsqType, McpClientConfig, McpServerConfig, McpServerSource};

#[tokio::main]
async fn main() -> anyhow::Result<()> {
    let mcp_config = McpClientConfig {
        servers: vec![/* your server configs */],
        auto_register_tools: true,
        tool_timeout_secs: Some(30),
        max_concurrent_calls: Some(5),
    };
     
    let model = TextModelBuilder::new("path/to/model".to_string())
        .with_isq(IsqType::Q8_0)
        .with_mcp_client(mcp_config)  // MCP tools automatically registered
        .build()
        .await?;
     
    // MCP tools are now available for automatic tool calling
    Ok(())
}

Re-exports§

pub use model_builder_trait::AnyModelBuilder;
pub use model_builder_trait::MultiModelBuilder;
pub use mistralrs_core::llguidance;
pub use schemars;

Modules§

core: Low-level types and internals re-exported from mistralrs_core.
model_builder_trait
speech_utils

Structs§

Agent: An agent that runs an agentic loop with tool calling
AgentBuilder: Builder for creating agents with a fluent API
AgentConfig: Configuration for the agentic loop
AgentResponse: Final response from the agent
AgentStep: Represents a single step in the agent execution
AgentStream: Stream of agent events during execution
AnyMoeConfig
AnyMoeModelBuilder
AudioInput: Raw audio input consisting of PCM samples and a sample rate.
CalledFunction: Called function with name and arguments
ChatCompletionChunkResponse: Chat completion streaming request chunk.
ChatCompletionResponse: An OpenAI compatible chat completion response.
Choice: Chat completion choice.
ChunkChoice: Chat completion streaming chunk choice.
CompletionResponse: An OpenAI compatible completion response.
Delta: Delta in content for streaming response.
DiffusionGenerationParams
DiffusionModelBuilder: Configure a text model with the various parameters for loading, running, and other inference behaviors.
DrySamplingParams
EmbeddingModelBuilder: Configure an embedding model with the various parameters for loading, running, and other inference behaviors.
EmbeddingRequest: A validated embedding request constructed via EmbeddingRequestBuilder.
EmbeddingRequestBuilder: Builder for configuring embedding requests.
Function: Function definition for a tool
GgufLoraModelBuilder: Wrapper of GgufModelBuilder for LoRA models.
GgufModelBuilder: Configure a text GGUF model with the various parameters for loading, running, and other inference behaviors.
GgufXLoraModelBuilder: Wrapper of GgufModelBuilder for X-LoRA models.
LayerTopology
Logprobs: Logprobs per token.
LoraModelBuilder: Wrapper of TextModelBuilder for LoRA models.
McpClient: MCP client that manages connections to multiple MCP servers
McpClientConfig: Configuration for MCP client integration
McpServerConfig: Configuration for an individual MCP server
McpToolInfo: Information about a tool discovered from an MCP server
MistralRs: The MistralRs struct handles sending requests to multiple engines. It is the core multi-threaded component of mistral.rs, and uses mpsc Sender and Receiver primitives to send and receive requests to the appropriate engine based on model ID.
Model: The object used to interact with the model. This can be used with many varietes of models,
and as such may be created with one of:
NormalRequest: A normal request request to the MistralRs.
PagedAttentionConfig: All memory counts in MB. Default for block size is 32.
PagedAttentionMetaBuilder: Builder for PagedAttention metadata.
RequestBuilder: A way to add messages with finer control given.
ResponseMessage: Chat completion response message.
SamplingParams: Sampling params are used to control sampling.
SearchFunctionParameters
SearchResult
SpeculativeConfig: Metadata for a speculative pipeline
SpeechModelBuilder: Configure a text model with the various parameters for loading, running, and other inference behaviors.
Tensor: The core struct for manipulating tensors.
TextMessages: Plain text (chat) messages.
TextModelBuilder: Configure a text model with the various parameters for loading, running, and other inference behaviors.
TextSpeculativeBuilder
Tool: Tool definition
ToolCallResponse
ToolResult: Result of a tool execution
TopLogprob: Top-n logprobs element
Topology
UqffEmbeddingModelBuilder: Configure a UQFF embedding model with the various parameters for loading, running, and other inference behaviors. This wraps and implements DerefMut for the UqffEmbeddingModelBuilder, so users should take care to not call UQFF-related methods.
UqffTextModelBuilder: Configure a UQFF text model with the various parameters for loading, running, and other inference behaviors. This wraps and implements DerefMut for the TextModelBuilder, so users should take care to not call UQFF-related methods.
UqffVisionModelBuilder: Configure a UQFF text model with the various parameters for loading, running, and other inference behaviors. This wraps and implements DerefMut for the VisionModelBuilder, so users should take care to not call UQFF-related methods.
Usage: OpenAI compatible (superset) usage during a request.
VisionMessages: Text (chat) messages with images and/or audios.
VisionModelBuilder: Configure a vision model with the various parameters for loading, running, and other inference behaviors.
WebSearchOptions
XLoraModelBuilder: Wrapper of TextModelBuilder for X-LoRA models.

Enums§

AgentEvent: Events yielded during agent streaming
AgentStopReason: Reason why the agent stopped executing
AnyMoeExpertType
AutoDeviceMapParams
Constraint: Control the constraint with llguidance.
DType: The different types of elements allowed in tensors.
DefaultSchedulerMethod: The scheduler method controld how sequences are scheduled during each step of the engine. For each scheduling step, the scheduler method is used if there are not only running, only waiting sequences, or none. If is it used, then it is used to allow waiting sequences to run.
Device: Cpu, Cuda, or Metal
DeviceMapSetting
DiffusionLoaderType: The architecture to load the vision model as.
EmbeddingRequestInput: An individual embedding input.
ImageGenerationResponseFormat: Image generation response format
IsqType
McpServerSource: Supported MCP server transport sources
MemoryGpuConfig
ModelCategory: Category of the model. This can also be used to extract model-category specific tools, such as the vision model prompt prefixer.
ModelDType: DType for the model.
Request: A request to the Engine, encapsulating the various parameters as well as the mpsc response Sender used to return the Response.
RequestMessage: Message or messages for a Request.
Response: The response enum contains 3 types of variants:
ResponseOk
SchedulerConfig
SearchEmbeddingModel: Embedding model used for ranking web search results internally.
SpeechLoaderType
StopTokens: Stop sequences or ids.
TextMessageRole: A chat message role.
TokenSource: The source of the HF token.
ToolCallType
ToolCallbackType: Unified tool callback that can be sync or async
ToolChoice
ToolType: Type of tool

Traits§

CustomLogitsProcessor: Customizable logits processor.
RequestLike: A type which can be used as a chat request.

Functions§

best_device: Gets the best device, cpu, cuda if compiled with CUDA, or Metal
cross_entropy_loss: The cross-entropy loss.
initialize_logging: This should be called to initialize the debug flag and logging. This should not be called in mistralrs-core code due to Rust usage.
paged_attn_supported: true if built with CUDA (requires Unix) /Metal
parse_isq_value: Parse ISQ value.

Type Aliases§

AsyncToolCallback: Async tool callback type for native async tool support
LlguidanceGrammar
MessageContent
Result
SearchCallback: Callback used to override how search results are gathered. The returned vector must be sorted in decreasing order of relevance.
ToolCallback: Callback used for custom tool functions. Receives the called function (name and JSON arguments) and returns the tool output as a string.

Attribute Macros§

tool: The #[tool] attribute macro for defining tools.

Crate mistralrs

Crate mistralrs Copy item path

§Example

§Streaming example

§MCP example

Re-exports§

Modules§

Structs§

Enums§

Traits§

Functions§

Type Aliases§

Attribute Macros§

Crate mistralrs