Features
- Real-time streaming — partial transcription results via WebSocket as you speak
- On-device inference — no cloud APIs, no API keys, zero cost, full privacy
- 5.3% WER on Russian — GigaAM v3 e2e_rnnt, 3-4× better accuracy than Whisper-large-v3 on Russian benchmarks
- CoreML & Neural Engine — Conformer encoder optimized for Apple Silicon via CoreML acceleration
- CUDA acceleration — Linux x86_64 with NVIDIA GPU support via CUDA 12+
- Multi-format audio — WAV, M4A/AAC, MP3, OGG/Vorbis, FLAC support for file transcription
- INT8 quantization — reduced memory footprint and faster inference
- Automatic punctuation — end-to-end model includes text normalization
- Docker ready — containerized deployment with configurable host/port binding
- Auto-download — model fetched from HuggingFace on first run (~850MB)
Quick Start
Cargo
# Listening on ws://127.0.0.1:9876/ws
Docker
# CPU image (any platform)
# CUDA image (Linux x86_64, requires NVIDIA GPU + CUDA 12+ drivers on host)
# Model auto-downloaded on first run (~850MB)
CLI Usage
Start STT Server
# Options:
# --port 9876 (default: 9876)
# --host 127.0.0.1 (default: 127.0.0.1, use 0.0.0.0 for Docker)
# --model-dir ~/.gigastt/models
Server binds to local address only by default (127.0.0.1). Use --host 0.0.0.0 in Docker to accept external connections.
Transcribe Audio File (Offline)
# Outputs transcribed Russian text to stdout
# Supported: WAV, M4A, MP3, OGG, FLAC (mono or auto-mixed to mono)
Download Model Only
# Downloads to ~/.gigastt/models/ (~850MB)
WebSocket API
Connection & Message Flow
Connect to ws://127.0.0.1:9876/ws and send PCM16 mono audio frames. Default sample rate is 48kHz; configure via the configure message. Server resamples to 16kHz internally.
Client Server
│ │
├──────── connect ────────────→ │
│ │
│ ←────── Ready message ─────── │
│ {type:"ready", version:"1.0"} │
│ │
├────── binary frames ────────→ │
│ (PCM16, 48kHz) │
│ │
│ ←────── Partial results ────── │
│ {type:"partial", text:"что"} │
│ │
│ ←─────── Final result ──────── │
│ {type:"final", text:"Что?"} │
│ │
└───────── close ──────────────→ │
Message Types
Full protocol documentation in docs/asyncapi.yaml.
| Direction | Type | Fields | Notes |
|---|---|---|---|
| Server | ready |
model, sample_rate, version |
Sent on connection. Includes protocol v1.0. |
| Server | partial |
text, timestamp, words |
Interim transcription (may change with more audio) |
| Server | final |
text, timestamp, words |
Complete utterance with punctuation |
| Server | error |
message, code |
Error occurred; connection may close |
| Client | stop |
— | Request finalization of buffered audio |
| Client | configure |
sample_rate, diarization |
Set input sample rate (8000/16000/24000/44100/48000) and optionally enable speaker diarization. Send before first audio frame. |
Example Session
// ... send PCM16 audio at 8kHz ...
REST API
The server exposes HTTP endpoints on the same port as the WebSocket endpoint.
GET /health
Returns server status.
# {"status":"ok"}
POST /v1/transcribe
Transcribe an audio file (WAV, M4A, MP3, OGG, FLAC). Returns the full transcript when complete.
# {"text":"Что такое Node.js?","words":[],"duration":3.5}
POST /v1/transcribe/stream
Transcribe an audio file with streaming Server-Sent Events (SSE). Returns partial results as they arrive.
# data: {"type":"partial","text":"что такое"}
# data: {"type":"partial","text":"что такое Node"}
# data: {"type":"final","text":"Что такое Node.js?"}
Client Examples
See examples/ for ready-to-use WebSocket clients:
- Python:
python examples/python_client.py recording.wav - JavaScript:
node examples/js_client.mjs recording.wav
Performance
Benchmarks
| Metric | v0.2 |
|---|---|
| WER (Russian) | 5.3% |
| vs Whisper-large-v3 | 3-4× better |
| Latency (16s audio) | ~800ms (M1) |
| Memory | ~500MB |
Acceleration
- CoreML — Conformer encoder optimized via ONNX Runtime's CoreML execution provider (macOS ARM64)
- Neural Engine — INT8 quantization leverages Apple Neural Engine for 2-3× speedup (macOS ARM64)
- CUDA — ONNX Runtime CUDA execution provider for NVIDIA GPUs on Linux x86_64; falls back to CPU at runtime if no GPU is available
- Streaming — stateful decoder persists across chunks; no full-audio re-inference needed
Relative throughput: CPU < CUDA < CoreML (Apple Silicon).
Architecture
┌─────────────────────────────────────┐
│ Audio Input (PCM16, 48/16kHz) │
└──────────────┬──────────────────────┘
│
▼
┌─────────────────────────────────────┐
│ Mel Spectrogram (64 bins) │
│ FFT=320, hop=160, HTK │
└──────────────┬──────────────────────┘
│
▼
┌─────────────────────────────────────┐
│ Conformer Encoder (ONNX) │
│ 16 layers, d=768, 240M params │
│ ┌─ CoreML execution (M1/M2/M3/M4) │
│ ├─ CUDA execution (Linux x86_64) │
│ └─ INT8 quantized │
└──────────────┬──────────────────────┘
│
▼
┌─────────────────────────────────────┐
│ RNN-T Decoder + Joiner (ONNX) │
│ ┌─ Stateful: h/c persisted │
│ └─ Per-chunk processing │
└──────────────┬──────────────────────┘
│
▼
┌─────────────────────────────────────┐
│ BPE Tokenizer (1025 tokens) │
│ + Automatic Punctuation │
└──────────────┬──────────────────────┘
│
▼
Final Russian Text
Model
GigaAM v3 e2e_rnnt — Conformer-based RNN-T ASR by SberDevices:
| Property | Value |
|---|---|
| Architecture | RNN-T (encoder + decoder + joiner) |
| Encoder | 16-layer Conformer, 768-dim, 240M params |
| Training Data | 700K+ hours of Russian speech |
| Vocabulary | 1025 BPE tokens |
| Input | 16kHz mono PCM16 |
| Quantization | INT8 (v0.2+) |
| License | MIT |
| Download Size | ~850MB (encoder 844MB, decoder 4.4MB, joiner 2.6MB) |
Requirements
| macOS ARM64 | Linux x86_64 | |
|---|---|---|
| OS | macOS 14+ (Sonoma) | Any modern Linux distro |
| CPU | Apple Silicon (M1–M4) | x86_64 |
| GPU | — | NVIDIA GPU with CUDA 12+ (optional) |
| Disk | ~1.5GB (model + binary) | ~1.5GB (model + binary) |
| RAM | ~500MB during inference | ~500MB during inference |
| Rust | 1.85+ (edition 2024) | 1.85+ (edition 2024) |
Installation
From crates.io
From source
Build & Development
# Features are mutually exclusive — do not combine coreml and cuda.
# Download model (required for integration tests, ~850MB)
License
MIT — see LICENSE
Acknowledgments
- GigaAM by SberDevices — the speech recognition model
- onnx-asr by @istupakov — ONNX model export and reference implementation
- ONNX Runtime — inference engine with CoreML & Neural Engine support
- ort — Rust bindings for ONNX Runtime