gigastt 0.2.0 - Docs.rs

Features

Real-time streaming — partial transcription results via WebSocket as you speak
On-device inference — no cloud APIs, no API keys, zero cost, full privacy
5.3% WER on Russian — GigaAM v3 e2e_rnnt, 3-4× better accuracy than Whisper-large-v3 on Russian benchmarks
CoreML & Neural Engine — Conformer encoder optimized for Apple Silicon via CoreML acceleration
Multi-format audio — WAV, M4A/AAC, MP3, OGG/Vorbis, FLAC support for file transcription
INT8 quantization — reduced memory footprint and faster inference
Automatic punctuation — end-to-end model includes text normalization
Docker ready — containerized deployment with configurable host/port binding
Auto-download — model fetched from HuggingFace on first run (~850MB)

Quick Start

Cargo

cargo install gigastt
gigastt serve
# Listening on ws://127.0.0.1:9876

Homebrew

brew install ekhodzitsky/gigastt/gigastt
gigastt serve

Docker

docker run -p 9876:9876 ghcr.io/ekhodzitsky/gigastt:latest serve --host 0.0.0.0
# Model auto-downloaded on first run (~850MB)

CLI Usage

Start STT Server

gigastt serve
# Options:
#   --port 9876              (default: 9876)
#   --host 127.0.0.1         (default: 127.0.0.1, use 0.0.0.0 for Docker)
#   --model-dir ~/.gigastt/models

Server binds to local address only by default (127.0.0.1). Use --host 0.0.0.0 in Docker to accept external connections.

Transcribe Audio File (Offline)

gigastt transcribe recording.wav
# Outputs transcribed Russian text to stdout
# Supported: WAV, M4A, MP3, OGG, FLAC (mono or auto-mixed to mono)

Download Model Only

gigastt download
# Downloads to ~/.gigastt/models/ (~850MB)

WebSocket API

Connection & Message Flow

Connect to ws://127.0.0.1:9876 and send PCM16 mono audio frames at 48kHz. Server auto-resamples to 16kHz internally.

Client                          Server
  │                               │
  ├──────── connect ────────────→ │
  │                               │
  │ ←────── Ready message ─────── │
  │ {type:"ready", version:"1.0"} │
  │                               │
  ├────── binary frames ────────→ │
  │ (PCM16, 48kHz)                │
  │                               │
  │ ←────── Partial results ────── │
  │ {type:"partial", text:"что"}  │
  │                               │
  │ ←─────── Final result ──────── │
  │ {type:"final", text:"Что?"}   │
  │                               │
  └───────── close ──────────────→ │

Message Types

Full protocol documentation in docs/asyncapi.yaml.

Direction	Type	Fields	Notes
Server	`ready`	`model`, `sample_rate`, `version`	Sent on connection. Includes protocol v1.0.
Server	`partial`	`text`, `timestamp`	Interim transcription (may change with more audio)
Server	`final`	`text`, `timestamp`	Complete utterance with punctuation
Server	`error`	`message`, `code`	Error occurred; connection may close
Client	`stop`	—	Request finalization (planned for v0.3)

Example Session

{"type": "ready", "model": "gigaam-v3-e2e-rnnt", "sample_rate": 16000, "version": "1.0"}
{"type": "partial", "text": "что такое", "timestamp": 0.5}
{"type": "partial", "text": "что такое Node", "timestamp": 1.2}
{"type": "final", "text": "Что такое Node.js?", "timestamp": 2.1}

Client Examples

See examples/ for ready-to-use WebSocket clients:

Python: python examples/python_client.py recording.wav
JavaScript: node examples/js_client.mjs recording.wav

Performance

Benchmarks

Metric	v0.2
WER (Russian)	5.3%
vs Whisper-large-v3	3-4× better
Latency (16s audio)	~800ms (M1)
Memory	~500MB

Acceleration

CoreML — Conformer encoder optimized via ONNX Runtime's CoreML execution provider
Neural Engine — INT8 quantization leverages Apple Neural Engine for 2-3× speedup
Streaming — stateful decoder persists across chunks; no full-audio re-inference needed

Architecture

┌─────────────────────────────────────┐
│ Audio Input (PCM16, 48/16kHz)       │
└──────────────┬──────────────────────┘
               │
               ▼
┌─────────────────────────────────────┐
│ Mel Spectrogram (64 bins)           │
│ FFT=320, hop=160, HTK               │
└──────────────┬──────────────────────┘
               │
               ▼
┌─────────────────────────────────────┐
│ Conformer Encoder (ONNX)            │
│ 16 layers, d=768, 240M params       │
│ ┌─ CoreML execution (M1/M2/M3/M4)   │
│ └─ INT8 quantized                   │
└──────────────┬──────────────────────┘
               │
               ▼
┌─────────────────────────────────────┐
│ RNN-T Decoder + Joiner (ONNX)       │
│ ┌─ Stateful: h/c persisted          │
│ └─ Per-chunk processing             │
└──────────────┬──────────────────────┘
               │
               ▼
┌─────────────────────────────────────┐
│ BPE Tokenizer (1025 tokens)         │
│ + Automatic Punctuation             │
└──────────────┬──────────────────────┘
               │
               ▼
      Final Russian Text

Model

GigaAM v3 e2e_rnnt — Conformer-based RNN-T ASR by SberDevices:

Property	Value
Architecture	RNN-T (encoder + decoder + joiner)
Encoder	16-layer Conformer, 768-dim, 240M params
Training Data	700K+ hours of Russian speech
Vocabulary	1025 BPE tokens
Input	16kHz mono PCM16
Quantization	INT8 (v0.2+)
License	MIT
Download Size	~850MB (encoder 844MB, decoder 4.4MB, joiner 2.6MB)

Requirements

OS: macOS 14+ (Sonoma or newer)
CPU: Apple Silicon (M1, M2, M3, M4)
Disk: ~1.5GB (model + binary)
RAM: ~500MB during inference
Rust: 1.75+ (for building from source)

Installation

From crates.io

cargo install gigastt

From source

git clone https://github.com/ekhodzitsky/gigastt
cd gigastt
cargo install --path .

Docker

# See Dockerfile in repo for production image
docker build -t gigastt .
docker run -p 9876:9876 gigastt serve --host 0.0.0.0

Build & Development

cargo build              # Debug build
cargo build --release   # Release (LTO, stripped)
cargo test              # Run tests
cargo clippy            # Lint

# Download model (required for integration tests, ~850MB)
cargo run -- download

License

MIT — see LICENSE

Acknowledgments

GigaAM by SberDevices — the speech recognition model
onnx-asr by @istupakov — ONNX model export and reference implementation
ONNX Runtime — inference engine with CoreML & Neural Engine support
ort — Rust bindings for ONNX Runtime