gigastt-0.1.1 is not a library.
Visit the last successful build:
gigastt-2.3.0
Features
- Real-time streaming — partial transcription results appear as you speak via WebSocket
- On-device — no cloud APIs, no API keys, zero cost, full privacy
- Punctuation — automatic punctuation and text normalization (e2e model)
- Russian-first — GigaAM v3 e2e_rnnt, 50% better than Whisper-large-v3 on Russian benchmarks
- Fast — Conformer encoder (240M params) optimized for Apple Silicon via CoreML
- Auto-download — model fetched from HuggingFace on first run
Quick Start
# Listening on ws://127.0.0.1:9876
# Model auto-downloaded (~850MB) on first run
Usage
Start STT server
# Options: --port 9876 --model-dir ~/.gigastt/models
Transcribe a file
# Outputs transcribed Russian text to stdout
# Requires: mono PCM16 WAV at 16kHz
Download model only
# Downloads to ~/.gigastt/models/
WebSocket Protocol
Connect to ws://127.0.0.1:9876 and send PCM16 mono 16kHz binary frames.
Messages
| Direction | Type | Fields | Description |
|---|---|---|---|
| Server | ready |
model, sample_rate |
Server is ready to accept audio |
| Server | partial |
text, timestamp |
Interim transcription (may change) |
| Server | final |
text, timestamp |
Complete utterance with punctuation |
| Server | error |
message, code |
Error occurred |
Example
Client Examples
See examples/ for ready-to-use WebSocket clients:
- Python:
python examples/python_client.py recording.wav - JavaScript:
node examples/js_client.mjs recording.wav
Model
GigaAM v3 e2e_rnnt — Conformer-based ASR model by SberDevices:
| Property | Value |
|---|---|
| Parameters | 240M (Conformer encoder) |
| Architecture | RNN-T (encoder + decoder + joiner) |
| Training data | 700K+ hours of Russian speech |
| Vocabulary | 1025 BPE tokens |
| Input | 16kHz mono PCM16 |
| License | MIT |
Requirements
- macOS 14+ on Apple Silicon (M1/M2/M3/M4)
- ~1.5GB disk space (model + binary)
- ~500MB RAM during inference
- Rust 1.75+ (for building from source)
Architecture
Audio (PCM16 16kHz)
│
▼
┌──────────────────┐
│ Mel Spectrogram │ 64 bins, FFT=320, hop=160
└────────┬─────────┘
▼
┌──────────────────┐
│ Conformer │ 16 layers, d=768
│ Encoder (ONNX) │
└────────┬─────────┘
▼
┌──────────────────┐
│ RNN-T Decoder │ LSTM h/c state persisted
│ + Joiner (ONNX) │ across audio chunks
└────────┬─────────┘
▼
Text (with punctuation)
License
MIT