aha
Lightweight AI Inference Engine β All-in-one Solution for Text, Vision, Speech, and OCR
aha is a high-performance, cross-platform AI inference engine built with Rust and the Candle framework. It brings state-of-the-art AI models to your local machineβno API keys, no cloud dependencies, just pure, fast AI running directly on your hardware.
Changelog
v0.2.4 (2026-03-23)
- add LFM2.5-1.2B-Instruct
- add LFM2-1.2B
v0.2.3 (2026-03-18)
- add DeepSeek-OCR-2
2026-03-17
- add PaddleOCR-VL1.5 model
- fix qwen3.5 position_ids create bug
- cli param add
- gguf_path: Local GGUF model weight path (required for loading models with GGUF)
- mmproj_path: Local path to mmproj GGUF weights (required for multimodal GGUF loading)
- WhichModel add qwen3.5-gguf
2026-03-16
- Added Qwen3.5 mmproj
2026-03-14
- update rust version
- Added Qwen3.5 gguf support, but the 4B model still has issues; to be resolved.
v0.2.2 (2026-03-07)
- Added GLM-OCR model
v0.2.1 (2026-03-05)
- Added Qwen3.5 model
Quick Start
Installation
Optional Features:
# CUDA (NVIDIA GPU acceleration)
# Metal (Apple GPU acceleration for macOS)
# Flash Attention (faster inference)
# FFmpeg (multimedia processing)
CLI Quick Reference
# List all supported models
# Download model only
# Download model and start service
# Run inference directly (without starting service)
# Start service only (model already downloaded)
Chat
Then use the unified (OpenAI-compatible) API:
Supported Models
| Category | Models |
|---|---|
| Text | Qwen3, MiniCPM4, LFM2-1.2B, LFM2.5-1.2B-Instruct |
| Vision | Qwen2.5-VL, Qwen3-VL, Qwen3.5 |
| OCR | DeepSeek-OCR, DeepSeek-OCR-2 , Hunyuan-OCR, PaddleOCR-VL, PaddleOCR-VL1.5 |
| ASR | GLM-ASR-Nano, Fun-ASR-Nano, Qwen3-ASR |
| Audio | VoxCPM, VoxCPM1.5 |
| Image | RMBG-2.0 (background removal) |
Documentation
| Document | Description |
|---|---|
| Getting Started | First steps with aha |
| Installation | Detailed installation guide |
| CLI Reference | Command-line interface |
| API Documentation | Library & REST API |
| Supported Models | Available AI models |
| Concepts | Architecture & design |
| Development | Contributing guide |
| Changelog | Version history |
Why aha?
- π High-Performance Inference - Powered by Candle framework for efficient tensor computation and model inference
- π§ Unified Interface β One tool for text, vision, speech, and OCR
- π¦ Local-First β All processing runs locally, no data leaves your machine
- π― Cross-Platform β Works on Linux, macOS, and Windows
- β‘ GPU Accelerated β Optional CUDA support for faster inference
- π‘οΈ Memory Safe β Built with Rust for reliability
- π§ Attention Optimization - Optional Flash Attention support for optimized long sequence processing
Development
Using aha as a Library
cargo add aha
# VoxCPM example
use VoxCPMGenerate;
use save_wav;
use Result;
Extending New Models
- Create new model file in src/models/
- Export in src/models/mod.rs
- Add support for CLI model inference in src/exec/
- Add tests and examples in tests/
Features
- High-performance inference via Candle framework
- Multi-modal model support (vision, language, speech)
- Clean, easy-to-use API design
- Minimal dependencies, compact binaries
- Flash Attention support for long sequences
- FFmpeg support for multimedia processing
License
Apache-2.0 β See LICENSE for details.
Acknowledgments
- Candle - Excellent Rust ML framework
- All model authors and contributors
