gigastt turns any machine into a real-time Russian speech recognition server. One binary, one command, state-of-the-art accuracy — everything runs locally.
&&
# WebSocket: ws://127.0.0.1:9876/v1/ws
# REST API: http://127.0.0.1:9876/v1/transcribe
Demo
}
Why gigastt?
| gigastt | whisper.cpp | faster-whisper | Vosk | sherpa-onnx | Cloud APIs | |
|---|---|---|---|---|---|---|
| Model | GigaAM v3 | Whisper large-v3 | Whisper large-v3 | Vosk models | varies | vendor |
| WER (Russian) | 11.37% | ~18% | ~18% | ~20%+ | model-dependent | 5–10% |
| Languages | Russian | 99 | 99 | 20+ | 10+ | 100+ |
| Streaming | real-time WebSocket | — | — | WebSocket + gRPC | WebSocket + TCP | varies |
| Latency (16s, M1) | ~700ms | ~4s | ~2s | ~3s | ~1.5s | network |
| Privacy | 100% local | 100% local | 100% local | 100% local | 100% local | data leaves device |
| Setup | cargo install |
cmake + make | pip install |
pip install |
cmake or pip | API key + billing |
| Implementation | Rust | C/C++ | Python/C++ | C++/Java | C++ | N/A |
| Bindings | Rust, C FFI | C, Python, Go, JS… | Python | Python, Java, JS, Go… | C, Python, Java, Swift… | SDK per vendor |
| INT8 quantization | auto, 0% WER loss | GGML quant | CTranslate2 quant | — | — | N/A |
| Concurrent sessions | configurable pool | 1 | 1 | 1 | 1 | provider limits |
| Cost | free | free | free | free | free | $0.006/min+ |
Trade-off: gigastt supports Russian only. If you need multilingual recognition, consider whisper.cpp or sherpa-onnx. If you need the best Russian accuracy running locally — gigastt is the only Rust-native option built on GigaAM v3, the current SOTA for Russian ASR. Trained on 700K+ hours of Russian speech. WER measured on 9 994 Golos crowd-sourced samples (50 394 words).
Who is this for?
- Real-time voice assistants — WebSocket streaming with sub-second latency
- Call-center transcription — speaker diarization + REST batch processing
- Offline document processing — transcribe meeting recordings without cloud upload
- Privacy-first mobile apps — embed via C-ABI FFI on Android with on-device inference
- Research & ML pipelines — standalone
gigastt-corelibrary for Rust ML stacks
Features
- Real-time streaming — partial transcription via WebSocket as you speak
- REST API + SSE — file transcription with instant or streaming response
- Hardware acceleration — CoreML + Neural Engine (macOS), CUDA 12+ (Linux), CPU everywhere
- INT8 quantization — 4x smaller model, 43% faster inference
- Multi-format audio — WAV, M4A/AAC, MP3, OGG/Vorbis, FLAC
- Speaker diarization — identify who said what (optional feature)
- Automatic punctuation — GigaAM v3 model produces punctuated, normalized text
- Auto-download — model fetched from HuggingFace on first run (~850 MB)
- Docker ready — CPU and CUDA images with multi-stage builds
- Hardened — connection limits, frame caps, idle timeouts, sanitized errors
Quick Start
Install & Run
# Homebrew (macOS ARM64 / Linux x86_64)
# From crates.io (requires `protoc` on PATH: `brew install protobuf` / `apt install protobuf-compiler`)
# From source
The model (~850 MB) downloads automatically on first run.
Docker
# CPU — model auto-downloads on first run (~850 MB)
# CUDA (Linux, requires NVIDIA Container Toolkit)
# Baked image — model included at build time, zero cold-start (~1.1 GB)
Transcribe a File
# CLI
# REST API
# {"text":"Привет, как дела?","words":[],"duration":3.5}
API
WebSocket — Real-time Streaming
Connect to ws://127.0.0.1:9876/v1/ws, send PCM16 audio frames, receive transcription in real time.
Client Server
| |
|-------- connect --------------> |
| |
| <------- ready ----------------- |
| {type:"ready", version:"1.0"} |
| |
|------- configure (optional) --> |
| {type:"configure", |
| sample_rate:16000} |
| |
|-------- binary PCM16 --------> |
| |
| <------- partial --------------- |
| {type:"partial", text:"привет"} |
| |
| <------- final ----------------- |
| {type:"final", |
| text:"Привет, как дела?"} |
Supported sample rates: 8, 16, 24, 44.1, 48 kHz (default 48 kHz, resampled to 16 kHz internally).
REST API
| Endpoint | Method | Description |
|---|---|---|
/health |
GET | Health check ({"status":"ok"}) |
/ready |
GET | Readiness probe (200 when engine pool is ready) |
/v1/models |
GET | Model info (encoder type, pool size, capabilities) |
/v1/transcribe |
POST | File transcription, full JSON response |
/v1/transcribe/stream |
POST | File transcription with SSE streaming |
/v1/ws |
GET | WebSocket upgrade for real-time streaming |
/metrics |
GET | Prometheus metrics (enabled with --metrics) |
SSE streaming example:
# data: {"type":"partial","text":"привет как"}
# data: {"type":"partial","text":"привет как дела"}
# data: {"type":"final","text":"Привет, как дела?"}
Full protocol spec: docs/asyncapi.yaml
Error Responses
| HTTP | Code | When |
|---|---|---|
| 400 | bad_request |
Invalid audio format or malformed request |
| 413 | payload_too_large |
File exceeds --body-limit-bytes (default 50 MiB) |
| 429 | rate_limit_exceeded |
Per-IP token bucket exhausted; Retry-After header included |
| 503 | pool_saturated |
All inference sessions busy; Retry-After: 30 |
| 503 | pool_closed |
Server is shutting down, pool closed to new checkouts |
// Example: pool saturation
HTTP/1.1 503 Service Unavailable
Retry-After: 30
Client Libraries
Ready-to-use WebSocket clients in examples/:
Python
Bun (TypeScript)
Go
# go mod init gigastt-client && go get github.com/gorilla/websocket
Kotlin
# See header in KotlinClient.kt for Gradle/Maven deps
Performance
| Metric | Value |
|---|---|
| WER (Russian) | 11.37% (9 994 Golos crowd samples, 50 394 words, 95% CI [10.9%, 11.9%]) |
| INT8 vs FP32 | 0% WER degradation (11.37% vs 11.46% on 9 994 samples) |
| Latency (16s audio, M1) | ~700 ms (encoder 667 ms + decode 31 ms) |
| Memory (RSS) | ~560 MB |
| Model size | 851 MB (FP32) / 222 MB (INT8) |
| Concurrent sessions | up to 4 (configurable via --pool-size) |
Cross-ASR Comparison (9 994 samples, Golos crowd, CPU)
| Engine | Model | WER | RTF | Size |
|---|---|---|---|---|
| Vosk | vosk-model-ru-0.42 | 4.27% | 0.107x | 1.3 GB |
| gigastt | GigaAM v3 (INT8) | 11.37% | 0.335x | 230 MB |
| whisper.cpp | Large v3 | 14.96% | 1.108x | ~3 GB |
| faster-whisper | Large v3 (INT8) | 15.73% | 1.224x | ~3 GB |
Note: Vosk leads on this clean-speech subset (Golos crowd) with excellent accuracy, but requires 1.3 GB. gigastt targets streaming real-time use-cases with sub-200ms latency, hardware acceleration (CoreML/CUDA/NNAPI), and a 6× smaller model footprint. Full reproducible benchmark:
benchmark/README.md. Raw results:benchmark-results-local.
Hardware Acceleration
| Platform | Feature flag | Execution Provider |
|---|---|---|
| macOS ARM64 (M1-M4) | --features coreml |
CoreML + Neural Engine |
| Linux x86_64 + NVIDIA | --features cuda |
CUDA 12+ |
| Android / ARM64 | --features nnapi |
NNAPI (NPU/DSP) |
| Any platform | (default) | CPU |
coreml and cuda are mutually exclusive; nnapi can be combined with either.
How the CoreML path works. The Conformer encoder has a dynamic time axis, and CoreML cannot reliably execute partitions compiled with dynamic shapes — they fail at prediction time (issue #42). gigastt therefore compiles the model in MLProgram format and restricts CoreML to statically-shaped subgraphs: the heavy convolution/matmul blocks run on the Neural Engine, dynamic-shape ops stay on the CPU EP. Measured on an Apple M1 Pro (INT8 encoder, release build, median of 5 runs): ~3× faster encoder inference on a 4 s WAV (~210 ms vs ~690 ms) and ~5.6× faster on a 2-minute file (~5.5 s vs ~31 s) vs the pure-CPU build.
Automatic CPU fallback. On startup the engine runs a ~1 s silent warmup probe through the full pipeline. If CoreML fails to load or fails the probe, the engine logs a warning (falling back to CPU execution provider) and transparently rebuilds its sessions on the CPU EP — a broken CoreML stack degrades performance instead of crashing. CoreML support remains model-dependent: a future model revision may shift more (or fewer) ops onto the Neural Engine.
INT8 Quantization
Quantized encoder: 4x smaller, ~43% faster, 0% WER degradation (verified on 9 994 Golos samples / 50 394 words). Auto-detected at runtime.
Since v0.9.0 quantization is always compiled in and auto-invoked on first download or serve — no feature flag and no manual steps needed. The quantize Cargo feature is retained as a no-op for backward compat.
# Automatic (recommended)
# Opt out of auto-quantization (FP32 only)
# or: GIGASTT_SKIP_QUANTIZE=1 gigastt serve
# Manual re-quantization
Project Structure
gigastt is organized as a 3-crate Cargo workspace:
| Crate | Type | Purpose |
|---|---|---|
gigastt-core |
lib (rlib) | Inference engine, model download, quantization, protocol types |
gigastt-ffi |
lib (cdylib) | C-ABI FFI layer for Android / mobile embedding |
gigastt |
bin | Server binary (axum HTTP/WS) + CLI |
gigastt-core has no server dependencies — embed inference in any Rust project with gigastt-core = "2.0".
Architecture
Audio Input
(PCM16, multi-rate)
|
v
+-----------------+
| Mel Spectrogram | 64 bins, FFT=320, hop=160
+-----------------+
|
v
+------------------------+
| Conformer Encoder | 16 layers, 768-dim, 240M params
| (ONNX Runtime) | CoreML | CUDA | CPU
+------------------------+
|
v
+------------------------+
| RNN-T Decoder + Joiner | Stateful: h/c persisted
| (ONNX Runtime) | across streaming chunks
+------------------------+
|
v
+------------------------+
| BPE Tokenizer | 1025 tokens
| + Auto-punctuation |
+------------------------+
|
v
Russian Text
Android / FFI
gigastt can be embedded into Android applications via a C-ABI FFI layer (no HTTP server, no JNI boilerplate required).
# Build libgigastt_ffi.so for Android (arm64)
| Function | Purpose |
|---|---|
gigastt_engine_new(model_dir) |
Load engine (default pool_size = 4) |
gigastt_engine_new_with_pool_size(model_dir, pool_size) |
Load engine with custom RAM budget |
gigastt_transcribe_file(engine, wav_path) |
Synchronous file transcription |
gigastt_stream_new(engine) |
Start a real-time streaming session |
gigastt_stream_process_chunk(...) |
Feed PCM16 audio, get JSON segments |
gigastt_stream_flush(...) |
Finalize stream |
The nnapi feature on gigastt-ffi pulls in ort/nnapi for NPU/DSP acceleration on Android: cargo ndk ... build -p gigastt-ffi --features nnapi.
For pool sizing on mobile: use pool_size = 1 to stay within ~350 MB RAM.
Full integration guide: ANDROID.md
Kotlin bridge: ffi/android/GigasttBridge.kt
CLI Reference
Key flags for the most common commands. Every flag also has an environment variable — see the full CLI reference.
# Start server
# Transcribe a file
# Re-quantize encoder (native Rust, ~2 min one-time)
| Flag | Default | Description |
|---|---|---|
--port |
9876 | Listen port |
--host |
127.0.0.1 | Bind address (loopback-only by default) |
--bind-all |
— | Allow non-loopback bind |
--pool-size |
4 | Concurrent inference sessions |
--metrics |
— | Expose Prometheus at /metrics |
--idle-timeout-secs |
300 | WebSocket idle timeout |
--max-session-secs |
3600 | Wall-clock session cap |
--rate-limit-per-minute |
0 | Per-IP rate limit (0 = off) |
--skip-quantize |
— | Skip INT8 quantization on first run |
Model
GigaAM v3 e2e_rnnt by SberDevices:
| Property | Value |
|---|---|
| Architecture | RNN-T (Conformer encoder + LSTM decoder + joiner) |
| Encoder | 16-layer Conformer, 768-dim, 240M params |
| Training data | 700K+ hours of Russian speech |
| Vocabulary | 1025 BPE tokens |
| Input | 16 kHz mono PCM16 |
| Quantization | INT8 available (v0.2+) |
| License | MIT |
| Download | ~850 MB (encoder 844 MB, decoder 4.4 MB, joiner 2.6 MB) |
Requirements
| macOS ARM64 | Linux x86_64 | |
|---|---|---|
| OS | macOS 14+ (Sonoma) | Any modern distro |
| CPU | Apple Silicon (M1-M4) | x86_64 |
| GPU | (integrated, via CoreML) | NVIDIA + CUDA 12+ (optional) |
| Disk | ~1.5 GB | ~1.5 GB |
| RAM | ~560 MB | ~560 MB |
| Rust | 1.87+ | 1.87+ |
Security
- Loopback-only bind. The server refuses to listen on anything other than
127.0.0.1/::1/localhostunless the operator explicitly passes--bind-all(or setsGIGASTT_ALLOW_BIND_ANY=1). Prevents accidental public exposure behind a reverse proxy or stray port forward. - Cross-origin requests denied by default. A browser page at
https://evil.example.comcannot drive-by connect to the local WebSocket / REST API. Loopback origins are always allowed; extra origins must be added via--allow-origin https://app.example.com(repeatable). LegacyAccess-Control-Allow-Origin: *behaviour is opt-in via--cors-allow-any. - Retry-After on backpressure. Pool saturation returns HTTP 503 with a
Retry-After: 30header; WebSocketerrorpayloads includeretry_after_ms: 30000so clients can back off without guessing. - WebSocket frame limit: 512 KB.
- Session pool: max 4 concurrent sessions (configurable via
--pool-size). - Audio buffer cap: 5 s (streaming) / 10 min (file upload).
- Internal errors sanitized — no path or model leakage to clients.
- Idle connection timeout: 300 s.
- Per-IP rate limiting (optional, off by default):
--rate-limit-per-minute Nenables a token-bucket limiter on all/v1/*endpoints;/healthis exempt. Returns HTTP 429 when the bucket is exhausted. Privacy-first default: disabled.
Remote deployment (TLS + reverse proxy): see docs/deployment.md.
Troubleshooting
| Symptom | Cause | Fix |
|---|---|---|
protoc not found during build |
Missing Protocol Buffers compiler | brew install protobuf (macOS) or apt install protobuf-compiler (Debian/Ubuntu) |
| Model download hangs or fails | Network / HuggingFace availability | Retry gigastt download; check ~/.gigastt/models/ permissions |
Cannot quantize: FP32 encoder not found |
Partial download | Delete ~/.gigastt/models/ and re-run gigastt download |
| OOM on startup | Pool size too large for available RAM | Lower --pool-size (default 4); each session loads the full encoder |
| CoreML not used on macOS | Built without --features coreml |
Re-build: cargo build --release --features coreml |
falling back to CPU execution provider in logs |
CoreML failed to compile or execute the model on this macOS/model combo | Transcription still works on CPU; clear ~/.gigastt/models/coreml_cache/ and retry, or file an issue with the warning text |
| CUDA not available on Linux | Built without --features cuda or missing CUDA 12+ |
Re-build: cargo build --release --features cuda; verify nvidia-smi |
| WebSocket closes with 1008 | Session exceeded --max-session-secs |
Increase --max-session-secs or send shorter streams |
| 429 Too Many Requests | Rate limiter enabled and bucket exhausted | Wait for Retry-After interval, or disable with --rate-limit-per-minute 0 |
| Empty transcription for noisy audio | Input too quiet or wrong format | Ensure 16-bit PCM; normalize audio level; check supported formats |
Testing
240+ unit tests (including property-based via proptest) + 33 e2e/load/soak tests + WER benchmark + 4 cargo-fuzz targets + 3 criterion micro-benchmarks:
# E2E tests (require model, serial to avoid OOM)
# Load & soak (local only)
# Fuzzing (nightly) & micro-benchmarks (no model needed)
Cross-ASR Benchmark
A reproducible benchmark comparing gigastt against whisper.cpp, faster-whisper, and Vosk on Russian speech (Golos crowd dataset):
- Methodology & Docker:
benchmark/README.md - Self-hosted runners (CoreML / CUDA):
docs/self-hosted-runner.md - Live results:
benchmark-results-localbranch
Contributing
See CONTRIBUTING.md — development setup, PR guidelines, and release checklist.
License
MIT — see LICENSE
Acknowledgments
- GigaAM by SberDevices — the speech recognition model
- onnx-asr by @istupakov — ONNX model export and reference
- ONNX Runtime — inference engine
- ort — Rust bindings for ONNX Runtime