aha

Lightweight AI Inference Engine — All-in-one Solution for Text, Vision, Speech, and OCR

aha is a high-performance, cross-platform AI inference engine built with Rust and the Candle framework. It brings state-of-the-art AI models to your local machine—no API keys, no cloud dependencies, just pure, fast AI running directly on your hardware.

Changelog

v0.2.0 (2026-02-05)

Added Qwen3-ASR speech recognition model

v0.1.9 (2026-01-31)

Added CLI list subcommand to show supported models
Added CLI subcommand structure support (cli, serv, download, run)
Fixed Qwen3VL thinking startswith bug
Fixed aha run multiple inputs bug

v0.1.8 (2026-01-17)

Added Qwen3 text model support
Added Fun-ASR-Nano-2512 speech recognition model
Fixed ModelScope Fun-ASR-Nano model load error
Updated audio resampling with rubato

v0.1.7 (2026-01-07)

Added GLM-ASR-Nano-2512 speech recognition model
Merged Metal (GPU) support for Apple Silicon
Added dynamic home directory and model download script

View full changelog →

Quick Start

Installation

git clone https://github.com/jhqxxx/aha.git
cd aha
cargo build --release

Optional Features:

# CUDA (NVIDIA GPU acceleration)
cargo build --release --features cuda

# Metal (Apple GPU acceleration for macOS)
cargo build --release --features metal

# Flash Attention (faster inference)
cargo build --release --features cuda,flash-attn

# FFmpeg (multimedia processing)
cargo build --release --features ffmpeg

CLI Quick Reference


# List all supported models
aha list

# Download model only
aha download -m qwen3asr-0.6b

# Download model and start service
aha -m qwen3asr-0.6b

# Run inference directly (without starting service)
aha run -m qwen3asr-0.6b -i "audio.wav"

# Start service only (model already downloaded)
aha serv -m qwen3asr-0.6b -p 10100

Chat

aha serv -m qwen3-0.6b -p 10100

Then use the unified (OpenAI-compatible) API:

curl http://localhost:10100/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3-0.6b",
    "messages": [{"role": "user", "content": "Hello!"}],
    "stream": false
  }
'

Supported Models

Category	Models
Text	Qwen3, MiniCPM4
Vision	Qwen2.5-VL, Qwen3-VL
OCR	DeepSeek-OCR, Hunyuan-OCR, PaddleOCR-VL
ASR	GLM-ASR-Nano, Fun-ASR-Nano, Qwen3-ASR
Audio	VoxCPM, VoxCPM1.5
Image	RMBG-2.0 (background removal)

Documentation

Document	Description
Getting Started	First steps with aha
Installation	Detailed installation guide
CLI Reference	Command-line interface
API Documentation	Library & REST API
Supported Models	Available AI models
Concepts	Architecture & design
Development	Contributing guide
Changelog	Version history

Why aha?

🚀 High-Performance Inference - Powered by Candle framework for efficient tensor computation and model inference
🔧 Unified Interface — One tool for text, vision, speech, and OCR
📦 Local-First — All processing runs locally, no data leaves your machine
🎯 Cross-Platform — Works on Linux, macOS, and Windows
⚡ GPU Accelerated — Optional CUDA support for faster inference
🛡️ Memory Safe — Built with Rust for reliability
🧠 Attention Optimization - Optional Flash Attention support for optimized long sequence processing

Development

Using aha as a Library

cargo add aha

# VoxCPM example
use aha::models::voxcpm::generate::VoxCPMGenerate;
use aha::utils::audio_utils::save_wav;
use anyhow::Result;

fn main() -> Result<()> {
    let model_path = "xxx/openbmb/VoxCPM-0.5B/";

    let mut voxcpm_generate = VoxCPMGenerate::init(model_path, None, None)?;

    let generate = voxcpm_generate.generate(
        "The sun is shining bright, flowers smile at me, birds say early early early".to_string(),
        None,
        None,
        2,
        100,
        10,
        2.0,
        false,
        6.0,
    )?;

    let _ = save_wav(&generate, "voxcpm.wav")?;
    Ok(())
}

Extending New Models

Create new model file in src/models/
Export in src/models/mod.rs
Add support for CLI model inference in src/exec/
Add tests and examples in tests/

Features

High-performance inference via Candle framework
Multi-modal model support (vision, language, speech)
Clean, easy-to-use API design
Minimal dependencies, compact binaries
Flash Attention support for long sequences
FFmpeg support for multimedia processing

License

Apache-2.0 — See LICENSE for details.

Acknowledgments

Candle - Excellent Rust ML framework
All model authors and contributors

aha 0.2.0

aha