gigastt 2.0.14

Local STT server powered by GigaAM v3 e2e_rnnt — on-device Russian speech recognition via ONNX Runtime
Documentation

gigastt turns any machine into a real-time Russian speech recognition server. One binary, one command, state-of-the-art accuracy — everything runs locally.

cargo install gigastt && gigastt serve
# WebSocket: ws://127.0.0.1:9876/v1/ws
# REST API:  http://127.0.0.1:9876/v1/transcribe

Demo

$ gigastt transcribe recording.wav
Привет, как дела?

$ curl -X POST http://127.0.0.1:9876/v1/transcribe \
    -H "Content-Type: application/octet-stream" \
    --data-binary @recording.wav
{"text":"Привет, как дела?","words":[],"duration":3.5}

Why gigastt?

gigastt whisper.cpp faster-whisper Vosk sherpa-onnx Cloud APIs
Model GigaAM v3 Whisper large-v3 Whisper large-v3 Vosk models varies vendor
WER (Russian) 11.37% ~18% ~18% ~20%+ model-dependent 5–10%
Languages Russian 99 99 20+ 10+ 100+
Streaming real-time WebSocket WebSocket + gRPC WebSocket + TCP varies
Latency (16s, M1) ~700ms ~4s ~2s ~3s ~1.5s network
Privacy 100% local 100% local 100% local 100% local 100% local data leaves device
Setup cargo install cmake + make pip install pip install cmake or pip API key + billing
Implementation Rust C/C++ Python/C++ C++/Java C++ N/A
Bindings Rust, C FFI C, Python, Go, JS… Python Python, Java, JS, Go… C, Python, Java, Swift… SDK per vendor
INT8 quantization auto, 0% WER loss GGML quant CTranslate2 quant N/A
Concurrent sessions configurable pool 1 1 1 1 provider limits
Cost free free free free free $0.006/min+

Trade-off: gigastt supports Russian only. If you need multilingual recognition, consider whisper.cpp or sherpa-onnx. If you need the best Russian accuracy running locally — gigastt is the only Rust-native option built on GigaAM v3, the current SOTA for Russian ASR. Trained on 700K+ hours of Russian speech. WER measured on 9 994 Golos crowd-sourced samples (50 394 words).

Who is this for?

  • Real-time voice assistants — WebSocket streaming with sub-second latency
  • Call-center transcription — speaker diarization + REST batch processing
  • Offline document processing — transcribe meeting recordings without cloud upload
  • Privacy-first mobile apps — embed via C-ABI FFI on Android with on-device inference
  • Research & ML pipelines — standalone gigastt-core library for Rust ML stacks

Features

  • Real-time streaming — partial transcription via WebSocket as you speak
  • REST API + SSE — file transcription with instant or streaming response
  • Hardware acceleration — CoreML + Neural Engine (macOS), CUDA 12+ (Linux), CPU everywhere
  • INT8 quantization — 4x smaller model, 43% faster inference
  • Multi-format audio — WAV, M4A/AAC, MP3, OGG/Vorbis, FLAC
  • Speaker diarization — identify who said what (optional feature)
  • Automatic punctuation — GigaAM v3 model produces punctuated, normalized text
  • Auto-download — model fetched from HuggingFace on first run (~850 MB)
  • Docker ready — CPU and CUDA images with multi-stage builds
  • Hardened — connection limits, frame caps, idle timeouts, sanitized errors

Quick Start

Install & Run

# Homebrew (macOS ARM64 / Linux x86_64)
brew tap ekhodzitsky/gigastt https://github.com/ekhodzitsky/gigastt
brew install gigastt
gigastt serve

# From crates.io (requires `protoc` on PATH: `brew install protobuf` / `apt install protobuf-compiler`)
cargo install gigastt
gigastt serve

# From source
git clone https://github.com/ekhodzitsky/gigastt
cd gigastt
cargo run --release -- serve

The model (~850 MB) downloads automatically on first run.

Docker

# CPU — model auto-downloads on first run (~850 MB)
docker build -t gigastt .
docker run -p 9876:9876 gigastt

# CUDA (Linux, requires NVIDIA Container Toolkit)
docker build -f Dockerfile.cuda -t gigastt-cuda .
docker run --gpus all -p 9876:9876 gigastt-cuda

# Baked image — model included at build time, zero cold-start (~1.1 GB)
docker build --build-arg GIGASTT_BAKE_MODEL=1 -t gigastt:baked .

Transcribe a File

# CLI
gigastt transcribe recording.wav

# REST API
curl -X POST http://127.0.0.1:9876/v1/transcribe \
  -H "Content-Type: application/octet-stream" \
  --data-binary @recording.wav
# {"text":"Привет, как дела?","words":[],"duration":3.5}

API

WebSocket — Real-time Streaming

Connect to ws://127.0.0.1:9876/v1/ws, send PCM16 audio frames, receive transcription in real time.

Client                            Server
  |                                 |
  |-------- connect --------------> |
  |                                 |
  | <------- ready ----------------- |
  | {type:"ready", version:"1.0"}  |
  |                                 |
  |------- configure (optional) --> |
  | {type:"configure",              |
  |  sample_rate:16000}             |
  |                                 |
  |-------- binary PCM16 --------> |
  |                                 |
  | <------- partial --------------- |
  | {type:"partial", text:"привет"} |
  |                                 |
  | <------- final ----------------- |
  | {type:"final",                  |
  |  text:"Привет, как дела?"}      |

Supported sample rates: 8, 16, 24, 44.1, 48 kHz (default 48 kHz, resampled to 16 kHz internally).

REST API

Endpoint Method Description
/health GET Health check ({"status":"ok"})
/ready GET Readiness probe (200 when engine pool is ready)
/v1/models GET Model info (encoder type, pool size, capabilities)
/v1/transcribe POST File transcription, full JSON response
/v1/transcribe/stream POST File transcription with SSE streaming
/v1/ws GET WebSocket upgrade for real-time streaming
/metrics GET Prometheus metrics (enabled with --metrics)

SSE streaming example:

curl -X POST http://127.0.0.1:9876/v1/transcribe/stream \
  -H "Content-Type: application/octet-stream" \
  --data-binary @recording.wav
# data: {"type":"partial","text":"привет как"}
# data: {"type":"partial","text":"привет как дела"}
# data: {"type":"final","text":"Привет, как дела?"}

Full protocol spec: docs/asyncapi.yaml

Error Responses

HTTP Code When
400 bad_request Invalid audio format or malformed request
413 payload_too_large File exceeds --body-limit-bytes (default 50 MiB)
429 rate_limit_exceeded Per-IP token bucket exhausted; Retry-After header included
503 pool_saturated All inference sessions busy; Retry-After: 30
503 pool_closed Server is shutting down, pool closed to new checkouts
// Example: pool saturation
HTTP/1.1 503 Service Unavailable
Retry-After: 30

{"code":"pool_saturated","message":"All inference sessions are busy"}

Client Libraries

Ready-to-use WebSocket clients in examples/:

Python

pip install websockets
python examples/python_client.py recording.wav

Bun (TypeScript)

bun examples/bun_client.ts recording.wav

Go

# go mod init gigastt-client && go get github.com/gorilla/websocket
go run examples/go_client.go recording.wav

Kotlin

# See header in KotlinClient.kt for Gradle/Maven deps
kotlinc examples/KotlinClient.kt -include-runtime -d client.jar
java -jar client.jar recording.wav

Performance

Metric Value
WER (Russian) 11.37% (9 994 Golos crowd samples, 50 394 words, 95% CI [10.9%, 11.9%])
INT8 vs FP32 0% WER degradation (11.37% vs 11.46% on 9 994 samples)
Latency (16s audio, M1) ~700 ms (encoder 667 ms + decode 31 ms)
Memory (RSS) ~560 MB
Model size 851 MB (FP32) / 222 MB (INT8)
Concurrent sessions up to 4 (configurable via --pool-size)

Cross-ASR Comparison (9 994 samples, Golos crowd, CPU)

Engine Model WER RTF Size
Vosk vosk-model-ru-0.42 4.27% 0.107x 1.3 GB
gigastt GigaAM v3 (INT8) 11.37% 0.335x 230 MB
whisper.cpp Large v3 14.96% 1.108x ~3 GB
faster-whisper Large v3 (INT8) 15.73% 1.224x ~3 GB

Note: Vosk leads on this clean-speech subset (Golos crowd) with excellent accuracy, but requires 1.3 GB. gigastt targets streaming real-time use-cases with sub-200ms latency, hardware acceleration (CoreML/CUDA/NNAPI), and a 6× smaller model footprint. Full reproducible benchmark: benchmark/README.md. Raw results: benchmark-results-local.

Hardware Acceleration

Platform Feature flag Execution Provider
macOS ARM64 (M1-M4) --features coreml CoreML + Neural Engine
Linux x86_64 + NVIDIA --features cuda CUDA 12+
Android / ARM64 --features nnapi NNAPI (NPU/DSP)
Any platform (default) CPU
cargo build --release --features coreml   # macOS: CoreML + Neural Engine
cargo build --release --features cuda     # Linux: NVIDIA CUDA 12+
cargo build --release                     # CPU (any platform)

coreml and cuda are mutually exclusive; nnapi can be combined with either.

How the CoreML path works. The Conformer encoder has a dynamic time axis, and CoreML cannot reliably execute partitions compiled with dynamic shapes — they fail at prediction time (issue #42). gigastt therefore compiles the model in MLProgram format and restricts CoreML to statically-shaped subgraphs: the heavy convolution/matmul blocks run on the Neural Engine, dynamic-shape ops stay on the CPU EP. Measured on an Apple M1 Pro (INT8 encoder, release build, median of 5 runs): ~3× faster encoder inference on a 4 s WAV (~210 ms vs ~690 ms) and ~5.6× faster on a 2-minute file (~5.5 s vs ~31 s) vs the pure-CPU build.

Automatic CPU fallback. On startup the engine runs a ~1 s silent warmup probe through the full pipeline. If CoreML fails to load or fails the probe, the engine logs a warning (falling back to CPU execution provider) and transparently rebuilds its sessions on the CPU EP — a broken CoreML stack degrades performance instead of crashing. CoreML support remains model-dependent: a future model revision may shift more (or fewer) ops onto the Neural Engine.

INT8 Quantization

Quantized encoder: 4x smaller, ~43% faster, 0% WER degradation (verified on 9 994 Golos samples / 50 394 words). Auto-detected at runtime.

Since v0.9.0 quantization is always compiled in and auto-invoked on first download or serve — no feature flag and no manual steps needed. The quantize Cargo feature is retained as a no-op for backward compat.

# Automatic (recommended)
cargo install gigastt
gigastt serve           # downloads model + auto-quantizes on first run

# Opt out of auto-quantization (FP32 only)
gigastt serve --skip-quantize
# or: GIGASTT_SKIP_QUANTIZE=1 gigastt serve

# Manual re-quantization
gigastt quantize                     # native Rust quantization
gigastt quantize --force             # re-quantize even if INT8 model exists

Project Structure

gigastt is organized as a 3-crate Cargo workspace:

Crate Type Purpose
gigastt-core lib (rlib) Inference engine, model download, quantization, protocol types
gigastt-ffi lib (cdylib) C-ABI FFI layer for Android / mobile embedding
gigastt bin Server binary (axum HTTP/WS) + CLI

gigastt-core has no server dependencies — embed inference in any Rust project with gigastt-core = "2.0".

Architecture

                    Audio Input
                   (PCM16, multi-rate)
                        |
                        v
               +-----------------+
               | Mel Spectrogram |  64 bins, FFT=320, hop=160
               +-----------------+
                        |
                        v
            +------------------------+
            |   Conformer Encoder    |  16 layers, 768-dim, 240M params
            |  (ONNX Runtime)        |  CoreML | CUDA | CPU
            +------------------------+
                        |
                        v
            +------------------------+
            | RNN-T Decoder + Joiner |  Stateful: h/c persisted
            |  (ONNX Runtime)        |  across streaming chunks
            +------------------------+
                        |
                        v
            +------------------------+
            |   BPE Tokenizer        |  1025 tokens
            |   + Auto-punctuation   |
            +------------------------+
                        |
                        v
                  Russian Text

Android / FFI

gigastt can be embedded into Android applications via a C-ABI FFI layer (no HTTP server, no JNI boilerplate required).

# Build libgigastt_ffi.so for Android (arm64)
cargo ndk -t arm64-v8a -o ./jniLibs build --release -p gigastt-ffi
Function Purpose
gigastt_engine_new(model_dir) Load engine (default pool_size = 4)
gigastt_engine_new_with_pool_size(model_dir, pool_size) Load engine with custom RAM budget
gigastt_transcribe_file(engine, wav_path) Synchronous file transcription
gigastt_stream_new(engine) Start a real-time streaming session
gigastt_stream_process_chunk(...) Feed PCM16 audio, get JSON segments
gigastt_stream_flush(...) Finalize stream

The nnapi feature on gigastt-ffi pulls in ort/nnapi for NPU/DSP acceleration on Android: cargo ndk ... build -p gigastt-ffi --features nnapi. For pool sizing on mobile: use pool_size = 1 to stay within ~350 MB RAM.

Full integration guide: ANDROID.md
Kotlin bridge: ffi/android/GigasttBridge.kt

CLI Reference

Key flags for the most common commands. Every flag also has an environment variable — see the full CLI reference.

# Start server
gigastt serve --port 9876 --bind-all --metrics

# Transcribe a file
gigastt transcribe recording.wav

# Re-quantize encoder (native Rust, ~2 min one-time)
gigastt quantize --force
Flag Default Description
--port 9876 Listen port
--host 127.0.0.1 Bind address (loopback-only by default)
--bind-all Allow non-loopback bind
--pool-size 4 Concurrent inference sessions
--metrics Expose Prometheus at /metrics
--idle-timeout-secs 300 WebSocket idle timeout
--max-session-secs 3600 Wall-clock session cap
--rate-limit-per-minute 0 Per-IP rate limit (0 = off)
--skip-quantize Skip INT8 quantization on first run

Model

GigaAM v3 e2e_rnnt by SberDevices:

Property Value
Architecture RNN-T (Conformer encoder + LSTM decoder + joiner)
Encoder 16-layer Conformer, 768-dim, 240M params
Training data 700K+ hours of Russian speech
Vocabulary 1025 BPE tokens
Input 16 kHz mono PCM16
Quantization INT8 available (v0.2+)
License MIT
Download ~850 MB (encoder 844 MB, decoder 4.4 MB, joiner 2.6 MB)

Requirements

macOS ARM64 Linux x86_64
OS macOS 14+ (Sonoma) Any modern distro
CPU Apple Silicon (M1-M4) x86_64
GPU (integrated, via CoreML) NVIDIA + CUDA 12+ (optional)
Disk ~1.5 GB ~1.5 GB
RAM ~560 MB ~560 MB
Rust 1.87+ 1.87+

Security

  • Loopback-only bind. The server refuses to listen on anything other than 127.0.0.1 / ::1 / localhost unless the operator explicitly passes --bind-all (or sets GIGASTT_ALLOW_BIND_ANY=1). Prevents accidental public exposure behind a reverse proxy or stray port forward.
  • Cross-origin requests denied by default. A browser page at https://evil.example.com cannot drive-by connect to the local WebSocket / REST API. Loopback origins are always allowed; extra origins must be added via --allow-origin https://app.example.com (repeatable). Legacy Access-Control-Allow-Origin: * behaviour is opt-in via --cors-allow-any.
  • Retry-After on backpressure. Pool saturation returns HTTP 503 with a Retry-After: 30 header; WebSocket error payloads include retry_after_ms: 30000 so clients can back off without guessing.
  • WebSocket frame limit: 512 KB.
  • Session pool: max 4 concurrent sessions (configurable via --pool-size).
  • Audio buffer cap: 5 s (streaming) / 10 min (file upload).
  • Internal errors sanitized — no path or model leakage to clients.
  • Idle connection timeout: 300 s.
  • Per-IP rate limiting (optional, off by default): --rate-limit-per-minute N enables a token-bucket limiter on all /v1/* endpoints; /health is exempt. Returns HTTP 429 when the bucket is exhausted. Privacy-first default: disabled.

Remote deployment (TLS + reverse proxy): see docs/deployment.md.

Troubleshooting

Symptom Cause Fix
protoc not found during build Missing Protocol Buffers compiler brew install protobuf (macOS) or apt install protobuf-compiler (Debian/Ubuntu)
Model download hangs or fails Network / HuggingFace availability Retry gigastt download; check ~/.gigastt/models/ permissions
Cannot quantize: FP32 encoder not found Partial download Delete ~/.gigastt/models/ and re-run gigastt download
OOM on startup Pool size too large for available RAM Lower --pool-size (default 4); each session loads the full encoder
CoreML not used on macOS Built without --features coreml Re-build: cargo build --release --features coreml
falling back to CPU execution provider in logs CoreML failed to compile or execute the model on this macOS/model combo Transcription still works on CPU; clear ~/.gigastt/models/coreml_cache/ and retry, or file an issue with the warning text
CUDA not available on Linux Built without --features cuda or missing CUDA 12+ Re-build: cargo build --release --features cuda; verify nvidia-smi
WebSocket closes with 1008 Session exceeded --max-session-secs Increase --max-session-secs or send shorter streams
429 Too Many Requests Rate limiter enabled and bucket exhausted Wait for Retry-After interval, or disable with --rate-limit-per-minute 0
Empty transcription for noisy audio Input too quiet or wrong format Ensure 16-bit PCM; normalize audio level; check supported formats

Testing

240+ unit tests (including property-based via proptest) + 33 e2e/load/soak tests + WER benchmark + 4 cargo-fuzz targets + 3 criterion micro-benchmarks:

cargo test --workspace               # 240+ unit tests (no model needed)
cargo clippy --workspace --all-targets  # Lint (zero warnings)

# E2E tests (require model, serial to avoid OOM)
cargo run -p gigastt -- download
cargo test -p gigastt --test e2e_rest --test e2e_ws --test e2e_errors --test e2e_shutdown --test e2e_rate_limit -- --ignored --test-threads=1

# Load & soak (local only)
cargo test -p gigastt --test load_test -- --ignored
cargo test -p gigastt --test soak_test -- --ignored

# Fuzzing (nightly) & micro-benchmarks (no model needed)
cargo +nightly fuzz run audio_decode    # also: protocol_parse, pcm16_framing, tokenizer
cargo bench -p gigastt-core --features __internals

Cross-ASR Benchmark

A reproducible benchmark comparing gigastt against whisper.cpp, faster-whisper, and Vosk on Russian speech (Golos crowd dataset):

cd benchmark
pip install -r requirements.txt
python benchmark.py --max-samples 100

Contributing

See CONTRIBUTING.md — development setup, PR guidelines, and release checklist.

License

MIT — see LICENSE

Acknowledgments