gigastt 0.1.1

Local STT server powered by GigaAM v3 e2e_rnnt — on-device Russian speech recognition via ONNX Runtime
gigastt-0.1.1 is not a library.
Visit the last successful build: gigastt-2.3.0

Features

  • Real-time streaming — partial transcription results appear as you speak via WebSocket
  • On-device — no cloud APIs, no API keys, zero cost, full privacy
  • Punctuation — automatic punctuation and text normalization (e2e model)
  • Russian-first — GigaAM v3 e2e_rnnt, 50% better than Whisper-large-v3 on Russian benchmarks
  • Fast — Conformer encoder (240M params) optimized for Apple Silicon via CoreML
  • Auto-download — model fetched from HuggingFace on first run

Quick Start

cargo install gigastt
gigastt serve
# Listening on ws://127.0.0.1:9876
# Model auto-downloaded (~850MB) on first run

Usage

Start STT server

gigastt serve
# Options: --port 9876 --model-dir ~/.gigastt/models

Transcribe a file

gigastt transcribe recording.wav
# Outputs transcribed Russian text to stdout
# Requires: mono PCM16 WAV at 16kHz

Download model only

gigastt download
# Downloads to ~/.gigastt/models/

WebSocket Protocol

Connect to ws://127.0.0.1:9876 and send PCM16 mono 16kHz binary frames.

Messages

Direction Type Fields Description
Server ready model, sample_rate Server is ready to accept audio
Server partial text, timestamp Interim transcription (may change)
Server final text, timestamp Complete utterance with punctuation
Server error message, code Error occurred

Example

{"type": "ready", "model": "gigaam-v3-e2e-rnnt", "sample_rate": 16000}
{"type": "partial", "text": "что такое"}
{"type": "final", "text": "Что такое Node.js?"}

Client Examples

See examples/ for ready-to-use WebSocket clients:

  • Python: python examples/python_client.py recording.wav
  • JavaScript: node examples/js_client.mjs recording.wav

Model

GigaAM v3 e2e_rnnt — Conformer-based ASR model by SberDevices:

Property Value
Parameters 240M (Conformer encoder)
Architecture RNN-T (encoder + decoder + joiner)
Training data 700K+ hours of Russian speech
Vocabulary 1025 BPE tokens
Input 16kHz mono PCM16
License MIT

Requirements

  • macOS 14+ on Apple Silicon (M1/M2/M3/M4)
  • ~1.5GB disk space (model + binary)
  • ~500MB RAM during inference
  • Rust 1.75+ (for building from source)

Architecture

Audio (PCM16 16kHz)
  │
  ▼
┌──────────────────┐
│  Mel Spectrogram  │  64 bins, FFT=320, hop=160
└────────┬─────────┘
         ▼
┌──────────────────┐
│  Conformer       │  16 layers, d=768
│  Encoder (ONNX)  │
└────────┬─────────┘
         ▼
┌──────────────────┐
│  RNN-T Decoder   │  LSTM h/c state persisted
│  + Joiner (ONNX) │  across audio chunks
└────────┬─────────┘
         ▼
  Text (with punctuation)

License

MIT

Acknowledgments

  • GigaAM by SberDevices — the speech recognition model
  • onnx-asr by @istupakov — ONNX model export and reference implementation
  • ort — Rust bindings for ONNX Runtime