lmonade-runtime 0.1.0-alpha.2

Actor-based runtime for LLM inference orchestration and resource management
docs.rs failed to build lmonade-runtime-0.1.0-alpha.2
Please check the build logs for more information.
See Builds for ideas on how to fix a failed build, or Metadata for how to configure docs.rs builds.
If you believe this is docs.rs' fault, open an issue.
Visit the last successful build: lmonade-runtime-0.1.0-alpha.1

lmonade-runtime

High-performance actor-based runtime for LLM inference orchestration.

Overview

This crate provides the core runtime infrastructure for Lmonade:

  • Actor-based model management using Tokio
  • Efficient request routing and scheduling
  • Resource orchestration (GPU/CPU allocation)
  • Model lifecycle management (loading, caching, eviction)
  • Concurrent inference with automatic batching

Key Components

  • Actor System: Erlang-inspired actor model for fault tolerance (src/actor/)
  • Model Hub: Central orchestrator for multi-model serving (src/actor/model_hub.rs)
  • Resource Management: GPU allocation and memory management (src/resources/)
  • LLM Engine: Core inference engine integration (src/llm_engine.rs)
  • Model Runner: Execution layer for model inference (src/model_runner.rs)

Architecture

The runtime uses an actor-based architecture where:

  • Each model runs in its own actor with isolated state
  • The ModelHub orchestrates routing and resource allocation
  • Supervisors provide fault tolerance and recovery
  • Message passing ensures thread-safe communication

Usage

use lmonade_runtime::actor::model_hub::ModelHub;
use lmonade_runtime::actor::messages::{LoadModelRequest, GenerateRequest};

// Create and start the model hub
let hub = ModelHub::new(Default::default()).await?;

// Load a model
hub.load_model(LoadModelRequest {
    model_id: "tinyllama".to_string(),
    model_path: "/path/to/model".into(),
    config: Default::default(),
}).await?;

// Generate text
let response = hub.generate(GenerateRequest {
    model_id: "tinyllama".to_string(),
    prompt: "Hello, world!".to_string(),
    params: Default::default(),
}).await?;

Documentation

For detailed documentation:

Status

The runtime is under active development with basic TinyLlama inference working. Production features like distributed serving and advanced scheduling are in progress.