gigastt 0.4.3

Local STT server powered by GigaAM v3 e2e_rnnt — on-device Russian speech recognition via ONNX Runtime
Documentation

Features

  • Real-time streaming — partial transcription results via WebSocket as you speak
  • On-device inference — no cloud APIs, no API keys, zero cost, full privacy
  • 5.3% WER on Russian — GigaAM v3 e2e_rnnt, 3-4× better accuracy than Whisper-large-v3 on Russian benchmarks
  • CoreML & Neural Engine — Conformer encoder optimized for Apple Silicon via CoreML acceleration
  • CUDA acceleration — Linux x86_64 with NVIDIA GPU support via CUDA 12+
  • Multi-format audio — WAV, M4A/AAC, MP3, OGG/Vorbis, FLAC support for file transcription
  • INT8 quantization — reduced memory footprint and faster inference
  • Automatic punctuation — end-to-end model includes text normalization
  • Docker ready — containerized deployment with configurable host/port binding
  • Auto-download — model fetched from HuggingFace on first run (~850MB)

Quick Start

Cargo

cargo install gigastt
gigastt serve
# Listening on ws://127.0.0.1:9876/ws

Docker

# CPU image (any platform)
docker build -t gigastt .
docker run -p 9876:9876 gigastt serve --host 0.0.0.0

# CUDA image (Linux x86_64, requires NVIDIA GPU + CUDA 12+ drivers on host)
docker build -f Dockerfile.cuda -t gigastt-cuda .
docker run --gpus all -p 9876:9876 gigastt-cuda serve --host 0.0.0.0

# Model auto-downloaded on first run (~850MB)

CLI Usage

Start STT Server

gigastt serve
# Options:
#   --port 9876              (default: 9876)
#   --host 127.0.0.1         (default: 127.0.0.1, use 0.0.0.0 for Docker)
#   --model-dir ~/.gigastt/models

Server binds to local address only by default (127.0.0.1). Use --host 0.0.0.0 in Docker to accept external connections.

Transcribe Audio File (Offline)

gigastt transcribe recording.wav
# Outputs transcribed Russian text to stdout
# Supported: WAV, M4A, MP3, OGG, FLAC (mono or auto-mixed to mono)

Download Model Only

gigastt download
# Downloads to ~/.gigastt/models/ (~850MB)

WebSocket API

Connection & Message Flow

Connect to ws://127.0.0.1:9876/ws and send PCM16 mono audio frames. Default sample rate is 48kHz; configure via the configure message. Server resamples to 16kHz internally.

Client                          Server
  │                               │
  ├──────── connect ────────────→ │
  │                               │
  │ ←────── Ready message ─────── │
  │ {type:"ready", version:"1.0"} │
  │                               │
  ├────── binary frames ────────→ │
  │ (PCM16, 48kHz)                │
  │                               │
  │ ←────── Partial results ────── │
  │ {type:"partial", text:"что"}  │
  │                               │
  │ ←─────── Final result ──────── │
  │ {type:"final", text:"Что?"}   │
  │                               │
  └───────── close ──────────────→ │

Message Types

Full protocol documentation in docs/asyncapi.yaml.

Direction Type Fields Notes
Server ready model, sample_rate, version Sent on connection. Includes protocol v1.0.
Server partial text, timestamp, words Interim transcription (may change with more audio)
Server final text, timestamp, words Complete utterance with punctuation
Server error message, code Error occurred; connection may close
Client stop Request finalization of buffered audio
Client configure sample_rate, diarization Set input sample rate (8000/16000/24000/44100/48000) and optionally enable speaker diarization. Send before first audio frame.

Example Session

{"type": "ready", "model": "gigaam-v3-e2e-rnnt", "sample_rate": 48000, "version": "1.0", "supported_rates": [8000, 16000, 24000, 44100, 48000]}
{"type": "configure", "sample_rate": 8000}
// ... send PCM16 audio at 8kHz ...
{"type": "partial", "text": "что такое", "timestamp": 0.5}
{"type": "partial", "text": "что такое Node", "timestamp": 1.2}
{"type": "final", "text": "Что такое Node.js?", "timestamp": 2.1}

REST API

The server exposes HTTP endpoints on the same port as the WebSocket endpoint.

GET /health

Returns server status.

curl http://127.0.0.1:9876/health
# {"status":"ok"}

POST /v1/transcribe

Transcribe an audio file (WAV, M4A, MP3, OGG, FLAC). Returns the full transcript when complete.

curl -X POST http://127.0.0.1:9876/v1/transcribe \
  -H "Content-Type: application/octet-stream" \
  --data-binary @recording.wav
# {"text":"Что такое Node.js?","words":[],"duration":3.5}

POST /v1/transcribe/stream

Transcribe an audio file with streaming Server-Sent Events (SSE). Returns partial results as they arrive.

curl -X POST http://127.0.0.1:9876/v1/transcribe/stream \
  -H "Content-Type: application/octet-stream" \
  --data-binary @recording.wav
# data: {"type":"partial","text":"что такое"}
# data: {"type":"partial","text":"что такое Node"}
# data: {"type":"final","text":"Что такое Node.js?"}

Client Examples

See examples/ for ready-to-use WebSocket clients:

  • Python: python examples/python_client.py recording.wav
  • JavaScript: node examples/js_client.mjs recording.wav

Performance

Benchmarks

Metric v0.2
WER (Russian) 5.3%
vs Whisper-large-v3 3-4× better
Latency (16s audio) ~800ms (M1)
Memory ~500MB

Acceleration

  • CoreML — Conformer encoder optimized via ONNX Runtime's CoreML execution provider (macOS ARM64)
  • Neural Engine — INT8 quantization leverages Apple Neural Engine for 2-3× speedup (macOS ARM64)
  • CUDA — ONNX Runtime CUDA execution provider for NVIDIA GPUs on Linux x86_64; falls back to CPU at runtime if no GPU is available
  • Streaming — stateful decoder persists across chunks; no full-audio re-inference needed

Relative throughput: CPU < CUDA < CoreML (Apple Silicon).

Architecture

┌─────────────────────────────────────┐
│ Audio Input (PCM16, 48/16kHz)       │
└──────────────┬──────────────────────┘
               │
               ▼
┌─────────────────────────────────────┐
│ Mel Spectrogram (64 bins)           │
│ FFT=320, hop=160, HTK               │
└──────────────┬──────────────────────┘
               │
               ▼
┌─────────────────────────────────────┐
│ Conformer Encoder (ONNX)            │
│ 16 layers, d=768, 240M params       │
│ ┌─ CoreML execution (M1/M2/M3/M4)   │
│ ├─ CUDA execution (Linux x86_64)    │
│ └─ INT8 quantized                   │
└──────────────┬──────────────────────┘
               │
               ▼
┌─────────────────────────────────────┐
│ RNN-T Decoder + Joiner (ONNX)       │
│ ┌─ Stateful: h/c persisted          │
│ └─ Per-chunk processing             │
└──────────────┬──────────────────────┘
               │
               ▼
┌─────────────────────────────────────┐
│ BPE Tokenizer (1025 tokens)         │
│ + Automatic Punctuation             │
└──────────────┬──────────────────────┘
               │
               ▼
      Final Russian Text

Model

GigaAM v3 e2e_rnnt — Conformer-based RNN-T ASR by SberDevices:

Property Value
Architecture RNN-T (encoder + decoder + joiner)
Encoder 16-layer Conformer, 768-dim, 240M params
Training Data 700K+ hours of Russian speech
Vocabulary 1025 BPE tokens
Input 16kHz mono PCM16
Quantization INT8 (v0.2+)
License MIT
Download Size ~850MB (encoder 844MB, decoder 4.4MB, joiner 2.6MB)

Requirements

macOS ARM64 Linux x86_64
OS macOS 14+ (Sonoma) Any modern Linux distro
CPU Apple Silicon (M1–M4) x86_64
GPU NVIDIA GPU with CUDA 12+ (optional)
Disk ~1.5GB (model + binary) ~1.5GB (model + binary)
RAM ~500MB during inference ~500MB during inference
Rust 1.85+ (edition 2024) 1.85+ (edition 2024)

Installation

From crates.io

cargo install gigastt

From source

git clone https://github.com/ekhodzitsky/gigastt
cd gigastt
cargo install --path .

Build & Development

cargo build                        # CPU-only (any platform)
cargo build --features coreml     # macOS ARM64: CoreML + Neural Engine
cargo build --features cuda       # Linux x86_64: NVIDIA CUDA 12+
cargo build --release             # Release build (LTO, stripped)
cargo test                        # Run tests
cargo clippy                      # Lint

# Features are mutually exclusive — do not combine coreml and cuda.

# Download model (required for integration tests, ~850MB)
cargo run -- download

License

MIT — see LICENSE

Acknowledgments