aha 0.2.5

aha model inference library, now supports Qwen(2.5VL/3/3VL/3.5/ASR/3Embedding/3Reranker), MiniCPM4, VoxCPM/1.5, DeepSeek-OCR/2, Hunyuan-OCR, PaddleOCR-VL/1.5, RMBG2.0, GLM(ASR-Nano-2512/OCR), Fun-ASR-Nano-2512, LFM(2/2.5/2VL/2.5VL)
Documentation

aha

Lightweight AI Inference Engine β€” All-in-one Solution for Text, Vision, Speech, and OCR

aha is a high-performance, cross-platform AI inference engine built with Rust and the Candle framework. It brings state-of-the-art AI models to your local machineβ€”no API keys, no cloud dependencies, just pure, fast AI running directly on your hardware.

Supported Models

Category Models
Text Qwen3, MiniCPM4, LFM2, LFM2.5
Vision Qwen2.5-VL, Qwen3-VL, Qwen3.5, LFM2.5-VL, LFM2-VL
OCR DeepSeek-OCR, DeepSeek-OCR-2 , PaddleOCR-VL PaddleOCR-VL1.5, Hunyuan-OCR, GLM-OCR
ASR GLM-ASR-Nano, Fun-ASR-Nano, Qwen3-ASR
TTS VoxCPM, VoxCPM1.5
Image RMBG-2.0 (background removal)
Embedding Qwen3-Embedding, all-MiniLM-L6-v2
Reranker Qwen3-Reranker

Why aha?

  • πŸš€ High-Performance Inference - Powered by Candle framework for efficient tensor computation and model inference
  • πŸ”§ Unified Interface β€” One tool for text, vision, speech, and OCR
  • πŸ“¦ Local-First β€” All processing runs locally, no data leaves your machine
  • 🎯 Cross-Platform β€” Works on Linux, macOS, and Windows
  • ⚑ GPU Accelerated β€” Optional CUDA support for faster inference
  • πŸ›‘οΈ Memory Safe β€” Built with Rust for reliability
  • 🧠 Attention Optimization - Optional Flash Attention support for optimized long sequence processing

Changelog

0.2.5 (2026-04-06)

  • add qwen3-embedding/qwen3-reranker/all-minilm-l6-v2

2026-04-03

  • CLI update: subcommand must be specified
  • ChatCompletionParameters add repeat_penalty and repeat_last_n
  • generate add penalty repeat

2026-04-02

  • refactor generate code
  • <think>...</think> The content of the thought chain is returned using the reasoning_content field.
  • chat response add time info

2026-04-01

  • refactor deepseek_ocr/fun_asr_nano generate code

2026-03-31

  • add server and cli mod
  • aha model name use modelscope id replace
  • update WhichModel
  • Usage add time info
  • dependencies delete aha_openai_dive,chrono

2026-03-30

  • add LFM2.5VL-1.6B
  • add LFM2VL-1.6B

v0.2.4 (2026-03-23)

  • add LFM2.5-1.2B-Instruct
  • add LFM2-1.2B

View full changelog β†’

Quick Start

Installation

git clone https://github.com/jhqxxx/aha.git
cd aha
cargo build --release

Optional Features:

# CUDA (NVIDIA GPU acceleration)
cargo build --release --features cuda

# Metal (Apple GPU acceleration for macOS)
cargo build --release --features metal

# Flash Attention (faster inference)
cargo build --release --features cuda,flash-attn

# FFmpeg (multimedia processing)
cargo build --release --features ffmpeg

CLI Quick Reference


# List all supported models
aha list

# Download model only
aha download -m Qwen/Qwen3-ASR-0.6B

# Download model and start service
aha cli -m Qwen/Qwen3-ASR-0.6B

# Run inference directly (without starting service)
aha run -m Qwen/Qwen3-ASR-0.6B -i "audio.wav"

# Run local all-MiniLM-L6-v2 embedding (native safetensors)
aha run -m all-minilm-l6-v2 -i "Rust embedding test" --weight-path D:\model_download\all-MiniLM-L6-v2

# Run local all-MiniLM-L6-v2 embedding (GGUF)
aha run -m all-minilm-l6-v2 -i "Rust embedding test" --artifact-format gguf --gguf-path D:\model_download\All-MiniLM-L6-v2-Embedding-GGUF --tokenizer-dir D:\model_download\all-MiniLM-L6-v2

# Run local all-MiniLM-L6-v2 embedding (ONNX)
aha run -m all-minilm-l6-v2 -i "Rust embedding test" --artifact-format onnx --onnx-path D:\model_download\all-MiniLM-L6-v2\onnx --tokenizer-dir D:\model_download\all-MiniLM-L6-v2

# Run local GLM-OCR (GGUF)
aha run -m glm-ocr -i .\assets\img\ocr_test1.png --artifact-format gguf --gguf-path D:\model_download\GLM-OCR-GGUF

# Run local GLM-OCR (ONNX)
aha run -m glm-ocr -i .\assets\img\ocr_test1.png --artifact-format onnx --onnx-path D:\model_download\GLM-OCR-ONNX --tokenizer-dir D:\model_download\GLM-OCR-ONNX

# Start service only (model already downloaded)
aha serv -m Qwen/Qwen3-ASR-0.6B -p 10100

Chat

aha serv -m Qwen/Qwen3-0.6B -p 10100

Then use the unified (OpenAI-compatible) API:

curl http://localhost:10100/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-0.6B",
    "messages": [{"role": "user", "content": "Hello!"}],
    "stream": false
  }
'

Documentation

Document Description
Getting Started First steps with aha
Installation Detailed installation guide
CLI Reference Command-line interface
API Documentation Library & REST API
Supported Models Available AI models
Concepts Architecture & design
Development Contributing guide
Changelog Version history

Development

Using aha as a Library

cargo add aha

# VoxCPM example
use aha::models::voxcpm::generate::VoxCPMGenerate;
use aha::utils::audio_utils::save_wav;
use anyhow::Result;

fn main() -> Result<()> {
    let model_path = "xxx/openbmb/VoxCPM-0.5B/";

    let mut voxcpm_generate = VoxCPMGenerate::init(model_path, None, None)?;

    let generate = voxcpm_generate.generate(
        "The sun is shining bright, flowers smile at me, birds say early early early".to_string(),
        None,
        None,
        2,
        100,
        10,
        2.0,
        false,
        6.0,
    )?;

    let _ = save_wav(&generate, "voxcpm.wav")?;
    Ok(())
}

Extending New Models

  • Create new model file in src/models/
  • Export in src/models/mod.rs
  • Add support for CLI model inference in src/exec/
  • Add tests and examples in tests/

Features

  • High-performance inference via Candle framework
  • Multi-modal model support (vision, language, speech)
  • Clean, easy-to-use API design
  • Minimal dependencies, compact binaries
  • Flash Attention support for long sequences
  • FFmpeg support for multimedia processing

License

Apache-2.0 β€” See LICENSE for details.

Acknowledgments

  • Candle - Excellent Rust ML framework
  • All model authors and contributors

Wechat

260405 expired