<p align="center">
<h1 align="center">gigastt</h1>
<p align="center"><strong>On-device Russian speech recognition with 10.4% WER</strong></p>
<p align="center">Local STT server powered by GigaAM v3 — no cloud, no API keys, full privacy</p>
<p align="center">
<a href="https://github.com/ekhodzitsky/gigastt/actions"><img src="https://github.com/ekhodzitsky/gigastt/actions/workflows/ci.yml/badge.svg" alt="CI"></a>
<a href="https://crates.io/crates/gigastt"><img src="https://img.shields.io/crates/v/gigastt.svg" alt="crates.io"></a>
<a href="https://github.com/ekhodzitsky/gigastt/blob/main/LICENSE"><img src="https://img.shields.io/badge/license-MIT-blue.svg" alt="MIT License"></a>
<a href="https://github.com/ekhodzitsky/gigastt/blob/main/CHANGELOG.md"><img src="https://img.shields.io/badge/changelog-Keep%20a%20Changelog-orange" alt="Changelog"></a>
<p align="center"><b>English</b> | <a href="README_RU.md">Русский</a></p>
</p>
<p align="center">
<sub>Latest: <b>v0.9.4</b> — dependency rollup, zero functional changes. See <a href="CHANGELOG.md">CHANGELOG</a>.</sub>
</p>
---
**gigastt** turns any machine into a real-time Russian speech recognition server. One binary, one command, state-of-the-art accuracy — everything runs locally.
```sh
cargo install gigastt && gigastt serve
# WebSocket: ws://127.0.0.1:9876/v1/ws
# REST API: http://127.0.0.1:9876/v1/transcribe
```
### Demo
```sh
$ gigastt transcribe recording.wav
Привет, как дела?
$ curl -X POST http://127.0.0.1:9876/v1/transcribe \
-H "Content-Type: application/octet-stream" \
--data-binary @recording.wav
{"text":"Привет, как дела?","words":[],"duration":3.5}
```
## Why gigastt?
| **WER (Russian)** | **10.4%** | ~18% | 5-10% |
| **Latency (16s audio, M1)** | **~700ms** | ~4s | network-dependent |
| **Streaming** | real-time WebSocket | batch only | varies |
| **Privacy** | 100% local | 100% local | data leaves device |
| **Cost** | free forever | free | $0.006/min+ |
| **Setup** | `cargo install` | Python + deps | API key + billing |
| **Binary size** | single binary | Python runtime | N/A |
| **INT8 quantization** | auto, 0% WER loss | manual | N/A |
| **Concurrent sessions** | 4 (configurable) | 1 | provider limits |
> GigaAM v3 was trained on **700K+ hours** of Russian speech. It delivers better accuracy than Whisper-large-v3 on Russian benchmarks while running faster on Apple Silicon and NVIDIA GPUs. WER measured on 993 Golos crowd-sourced samples (4991 words).
## Features
- **Real-time streaming** — partial transcription via WebSocket as you speak
- **REST API + SSE** — file transcription with instant or streaming response
- **Hardware acceleration** — CoreML + Neural Engine (macOS), CUDA 12+ (Linux), CPU everywhere
- **INT8 quantization** — 4x smaller model, 43% faster inference
- **Multi-format audio** — WAV, M4A/AAC, MP3, OGG/Vorbis, FLAC
- **Speaker diarization** — identify who said what (optional feature)
- **Automatic punctuation** — GigaAM v3 model produces punctuated, normalized text
- **Auto-download** — model fetched from HuggingFace on first run (~850 MB)
- **Docker ready** — CPU and CUDA images with multi-stage builds
- **Hardened** — connection limits, frame caps, idle timeouts, sanitized errors
## Quick Start
### Install & Run
```sh
# Homebrew (macOS ARM64 / Linux x86_64)
brew tap ekhodzitsky/gigastt https://github.com/ekhodzitsky/gigastt
brew install gigastt
gigastt serve
# From crates.io (requires `protoc` on PATH: `brew install protobuf` / `apt install protobuf-compiler`)
cargo install gigastt
gigastt serve
# From source
git clone https://github.com/ekhodzitsky/gigastt
cd gigastt
cargo run --release -- serve
```
The model (~850 MB) downloads automatically on first run.
### Docker
```sh
# CPU — model auto-downloads on first run (~850 MB)
docker build -t gigastt .
docker run -p 9876:9876 gigastt
# CUDA (Linux, requires NVIDIA Container Toolkit)
docker build -f Dockerfile.cuda -t gigastt-cuda .
docker run --gpus all -p 9876:9876 gigastt-cuda
# Baked image — model included at build time, zero cold-start (~1.1 GB)
docker build --build-arg GIGASTT_BAKE_MODEL=1 -t gigastt:baked .
```
### Transcribe a File
```sh
# CLI
gigastt transcribe recording.wav
# REST API
curl -X POST http://127.0.0.1:9876/v1/transcribe \
-H "Content-Type: application/octet-stream" \
--data-binary @recording.wav
# {"text":"Привет, как дела?","words":[],"duration":3.5}
```
## API
### WebSocket — Real-time Streaming
Connect to `ws://127.0.0.1:9876/v1/ws` (canonical; `ws://…/ws` is a deprecated alias), send PCM16 audio frames, receive transcription in real time.
```
Client Server
| |
| <------- ready ----------------- |
| {type:"ready", version:"1.0"} |
| |
|------- configure (optional) --> |
| {type:"configure", |
| sample_rate:16000} |
| |
|-------- binary PCM16 --------> |
| |
| <------- partial --------------- |
| {type:"partial", text:"привет"} |
| |
| <------- final ----------------- |
| {type:"final", |
| text:"Привет, как дела?"} |
```
**Supported sample rates:** 8, 16, 24, 44.1, 48 kHz (default 48 kHz, resampled to 16 kHz internally).
### REST API
| `/health` | GET | Health check (`{"status":"ok"}`) |
| `/v1/models` | GET | Model info (encoder type, pool size, capabilities) |
| `/v1/transcribe` | POST | File transcription, full JSON response |
| `/v1/transcribe/stream` | POST | File transcription with SSE streaming |
| `/v1/ws` | GET | WebSocket upgrade for real-time streaming (canonical) |
| `/ws` | GET | Deprecated alias for `/v1/ws` — removal planned for v1.0 |
| `/metrics` | GET | Prometheus metrics (enabled with `--metrics`). Returns 404 otherwise |
**SSE streaming example:**
```sh
curl -X POST http://127.0.0.1:9876/v1/transcribe/stream \
-H "Content-Type: application/octet-stream" \
--data-binary @recording.wav
# data: {"type":"partial","text":"привет как"}
# data: {"type":"partial","text":"привет как дела"}
# data: {"type":"final","text":"Привет, как дела?"}
```
Full protocol spec: [`docs/asyncapi.yaml`](docs/asyncapi.yaml)
#### Error Responses
| 400 | `bad_request` | Invalid audio format or malformed request |
| 413 | `payload_too_large` | File exceeds `--body-limit-bytes` (default 50 MiB) |
| 429 | `rate_limit_exceeded` | Per-IP token bucket exhausted; `Retry-After` header included |
| 503 | `pool_saturated` | All inference sessions busy; `Retry-After: 30` |
| 503 | `pool_closed` | Server is shutting down, pool closed to new checkouts |
```json
// Example: pool saturation
HTTP/1.1 503 Service Unavailable
Retry-After: 30
{"code":"pool_saturated","message":"All inference sessions are busy"}
```
### Client Libraries
Ready-to-use WebSocket clients in [`examples/`](examples/):
#### Python
```sh
pip install websockets
python examples/python_client.py recording.wav
```
#### Bun (TypeScript)
```sh
bun examples/bun_client.ts recording.wav
```
#### Go
```sh
# go mod init gigastt-client && go get github.com/gorilla/websocket
go run examples/go_client.go recording.wav
```
#### Kotlin
```sh
# See header in KotlinClient.kt for Gradle/Maven deps
kotlinc examples/KotlinClient.kt -include-runtime -d client.jar
java -jar client.jar recording.wav
```
## Performance
| **WER (Russian)** | 10.4% (993 Golos crowd samples, 4991 words) |
| **INT8 vs FP32** | 0% WER degradation (10.4% vs 10.5% on 993 samples) |
| **Latency (16s audio, M1)** | ~700 ms (encoder 667 ms + decode 31 ms) |
| **Memory (RSS)** | ~560 MB |
| **Model size** | 851 MB (FP32) / 222 MB (INT8) |
| **Concurrent sessions** | up to 4 (configurable via `--pool-size`) |
### Hardware Acceleration
| macOS ARM64 (M1-M4) | `--features coreml` | CoreML + Neural Engine |
| Linux x86_64 + NVIDIA | `--features cuda` | CUDA 12+ |
| Any platform | _(default)_ | CPU |
```sh
cargo build --release --features coreml # macOS: CoreML + Neural Engine
cargo build --release --features cuda # Linux: NVIDIA CUDA 12+
cargo build --release # CPU (any platform)
```
Features are compile-time and mutually exclusive.
### INT8 Quantization
Quantized encoder: 4x smaller, ~43% faster, 0% WER degradation (verified on 993 Golos samples / 4991 words). Auto-detected at runtime.
Since v0.9.0 quantization is always compiled in and auto-invoked on first `download` or `serve` — no feature flag and no manual steps needed. The `quantize` Cargo feature is retained as a no-op for backward compat.
```sh
# Automatic (recommended)
cargo install gigastt
gigastt serve # downloads model + auto-quantizes on first run
# Opt out of auto-quantization (FP32 only)
gigastt serve --skip-quantize
# or: GIGASTT_SKIP_QUANTIZE=1 gigastt serve
# Manual re-quantization
gigastt quantize # native Rust quantization
gigastt quantize --force # re-quantize even if INT8 model exists
```
## Architecture
```
Audio Input
(PCM16, multi-rate)
|
v
+-----------------+
| Mel Spectrogram | 64 bins, FFT=320, hop=160
+-----------------+
|
v
+------------------------+
| Conformer Encoder | 16 layers, 768-dim, 240M params
| (ONNX Runtime) | CoreML | CUDA | CPU
+------------------------+
|
v
+------------------------+
| RNN-T Decoder + Joiner | Stateful: h/c persisted
| (ONNX Runtime) | across streaming chunks
+------------------------+
|
v
+------------------------+
| BPE Tokenizer | 1025 tokens
| + Auto-punctuation |
+------------------------+
|
v
Russian Text
```
## Android / FFI
gigastt can be embedded into Android applications via a C-ABI FFI layer (no HTTP server, no JNI boilerplate required).
```sh
# Build libgigastt.so for Android (arm64)
cargo ndk -t arm64-v8a -o ./jniLibs build --release \
--no-default-features --features ffi
```
| `gigastt_engine_new(model_dir)` | Load engine (default pool_size = 4) |
| `gigastt_engine_new_with_pool_size(model_dir, pool_size)` | Load engine with custom RAM budget |
| `gigastt_transcribe_file(engine, wav_path)` | Synchronous file transcription |
| `gigastt_stream_new(engine)` | Start a real-time streaming session |
| `gigastt_stream_process_chunk(...)` | Feed PCM16 audio, get JSON segments |
| `gigastt_stream_flush(...)` | Finalize stream |
The `ffi` feature pulls in `ort/nnapi` for NPU/DSP acceleration on Android.
For pool sizing on mobile: use `pool_size = 1` to stay within ~350 MB RAM.
Full integration guide: [`ANDROID.md`](ANDROID.md)
Kotlin bridge: [`ffi/android/GigasttBridge.kt`](ffi/android/GigasttBridge.kt)
## CLI Reference
Key flags for the most common commands. Every flag also has an environment variable — see the [full CLI reference](docs/cli.md).
```sh
# Start server
gigastt serve --port 9876 --bind-all --metrics
# Transcribe a file
gigastt transcribe recording.wav
# Re-quantize encoder (native Rust, ~2 min one-time)
gigastt quantize --force
```
| Flag | Default | Description |
|---|---|---|
| `--port` | 9876 | Listen port |
| `--host` | 127.0.0.1 | Bind address (loopback-only by default) |
| `--bind-all` | — | Allow non-loopback bind |
| `--pool-size` | 4 | Concurrent inference sessions |
| `--metrics` | — | Expose Prometheus at `/metrics` |
| `--idle-timeout-secs` | 300 | WebSocket idle timeout |
| `--max-session-secs` | 3600 | Wall-clock session cap |
| `--rate-limit-per-minute` | 0 | Per-IP rate limit (0 = off) |
| `--skip-quantize` | — | Skip INT8 quantization on first run |
## Model
[**GigaAM v3 e2e_rnnt**](https://huggingface.co/istupakov/gigaam-v3-onnx) by [SberDevices](https://github.com/salute-developers/GigaAM):
| Property | Value |
|---|---|
| Architecture | RNN-T (Conformer encoder + LSTM decoder + joiner) |
| Encoder | 16-layer Conformer, 768-dim, 240M params |
| Training data | 700K+ hours of Russian speech |
| Vocabulary | 1025 BPE tokens |
| Input | 16 kHz mono PCM16 |
| Quantization | INT8 available (v0.2+) |
| License | MIT |
| Download | ~850 MB (encoder 844 MB, decoder 4.4 MB, joiner 2.6 MB) |
## Requirements
| | macOS ARM64 | Linux x86_64 |
|---|---|---|
| **OS** | macOS 14+ (Sonoma) | Any modern distro |
| **CPU** | Apple Silicon (M1-M4) | x86_64 |
| **GPU** | _(integrated, via CoreML)_ | NVIDIA + CUDA 12+ (optional) |
| **Disk** | ~1.5 GB | ~1.5 GB |
| **RAM** | ~560 MB | ~560 MB |
| **Rust** | 1.85+ | 1.85+ |
## Security
- **Loopback-only bind.** The server refuses to listen on anything other than
`127.0.0.1` / `::1` / `localhost` unless the operator explicitly passes
`--bind-all` (or sets `GIGASTT_ALLOW_BIND_ANY=1`). Prevents accidental public
exposure behind a reverse proxy or stray port forward.
- **Cross-origin requests denied by default.** A browser page at
`https://evil.example.com` cannot drive-by connect to the local WebSocket /
REST API. Loopback origins are always allowed; extra origins must be added
via `--allow-origin https://app.example.com` (repeatable). Legacy
`Access-Control-Allow-Origin: *` behaviour is opt-in via
`--cors-allow-any`.
- **Retry-After on backpressure.** Pool saturation returns HTTP 503 with a
`Retry-After: 30` header; WebSocket `error` payloads include
`retry_after_ms: 30000` so clients can back off without guessing.
- **WebSocket frame limit:** 512 KB.
- **Session pool:** max 4 concurrent sessions (configurable via `--pool-size`).
- **Audio buffer cap:** 5 s (streaming) / 10 min (file upload).
- **Internal errors sanitized** — no path or model leakage to clients.
- **Idle connection timeout:** 300 s.
- **Per-IP rate limiting** (optional, off by default): `--rate-limit-per-minute N`
enables a token-bucket limiter on all `/v1/*` endpoints; `/health` is exempt.
Returns HTTP 429 when the bucket is exhausted. Privacy-first default: disabled.
Remote deployment (TLS + reverse proxy): see [`docs/deployment.md`](docs/deployment.md).
## Troubleshooting
| Symptom | Cause | Fix |
|---|---|---|
| `protoc` not found during build | Missing Protocol Buffers compiler | `brew install protobuf` (macOS) or `apt install protobuf-compiler` (Debian/Ubuntu) |
| Model download hangs or fails | Network / HuggingFace availability | Retry `gigastt download`; check `~/.gigastt/models/` permissions |
| `Cannot quantize: FP32 encoder not found` | Partial download | Delete `~/.gigastt/models/` and re-run `gigastt download` |
| OOM on startup | Pool size too large for available RAM | Lower `--pool-size` (default 4); each session loads the full encoder |
| CoreML not used on macOS | Built without `--features coreml` | Re-build: `cargo build --release --features coreml` |
| CUDA not available on Linux | Built without `--features cuda` or missing CUDA 12+ | Re-build: `cargo build --release --features cuda`; verify `nvidia-smi` |
| WebSocket closes with 1008 | Session exceeded `--max-session-secs` | Increase `--max-session-secs` or send shorter streams |
| 429 Too Many Requests | Rate limiter enabled and bucket exhausted | Wait for `Retry-After` interval, or disable with `--rate-limit-per-minute 0` |
| Empty transcription for noisy audio | Input too quiet or wrong format | Ensure 16-bit PCM; normalize audio level; check supported formats |
## Testing
125 unit tests + 30 e2e tests + load & soak tests:
```sh
cargo test # 125 unit tests (no model needed)
cargo clippy # Lint (zero warnings)
# E2E tests (require model, serial to avoid OOM)
cargo run -- download
cargo test --test e2e_rest --test e2e_ws --test e2e_errors --test e2e_shutdown -- --ignored --test-threads=1
# Load & soak (local only)
cargo test --test load_test -- --ignored
cargo test --test soak_test -- --ignored
```
## Contributing
See [CONTRIBUTING.md](CONTRIBUTING.md) — development setup, PR guidelines, and release checklist.
## License
MIT — see [LICENSE](LICENSE)
## Acknowledgments
- [**GigaAM**](https://github.com/salute-developers/GigaAM) by [SberDevices](https://github.com/salute-developers) — the speech recognition model
- [**onnx-asr**](https://github.com/istupakov/onnx-asr) by [@istupakov](https://github.com/istupakov) — ONNX model export and reference
- [**ONNX Runtime**](https://github.com/microsoft/onnxruntime) — inference engine
- [**ort**](https://github.com/pykeio/ort) — Rust bindings for ONNX Runtime