Status: 0.1.0 — pre-alpha, work in progress. The server scaffolding (HTTP, WebSocket, SSE, rate-limit, metrics, graceful shutdown) is forked from the production-grade
gigasttstack. The Vietnamese inference path (model fetch, mel features, BPE tokenizer, RNN-T decode for Zipformer-vi) is the active work-in-progress tracked toward the first functional release.
phostt turns any machine into a Vietnamese speech recognition server that runs entirely on-device. Zipformer-vi RNN-T weights ship pre-quantized from sherpa-onnx; the binary is one command, the model is ~75 MB, everything runs locally.
# Once 0.1.0 ships:
&&
# WebSocket: ws://127.0.0.1:9876/v1/ws
# REST API: http://127.0.0.1:9876/v1/transcribe
Why phostt?
| phostt | PhoWhisper-large | Cloud APIs | |
|---|---|---|---|
| Architecture | Zipformer + RNN-T | Whisper enc-dec | varies |
| Model size (INT8) | ~75 MB | ~1.5 GB | server-side |
| WER (GigaSpeech2-vi) | ~7.7% | n/a | varies |
| Privacy | 100% local | 100% local | data leaves device |
| Cost | free forever | free | $0.006/min+ |
| Setup | cargo install |
Python + deps | API key + billing |
| Streaming | real-time WebSocket | batch only | varies |
The Zipformer-vi-30M weights ship via sherpa-onnx releases (Apache 2.0, trained on ~6000 hours of Vietnamese speech, ICASSP-published WER on the VLSP and GigaSpeech2 benchmarks).
Features
- Real-time streaming — partial transcription via WebSocket as you speak ✅
- REST API + SSE — file transcription with instant or streaming response
- Hardware acceleration — CoreML + Neural Engine (macOS), CUDA 12+ (Linux), CPU everywhere
- Pre-quantized INT8 — encoder ships at ~75 MB INT8 from upstream
- Multi-format audio — WAV, M4A/AAC, MP3, OGG/Vorbis, FLAC
- Auto-download — model fetched from sherpa-onnx GitHub releases on first run
- Docker ready — CPU and CUDA images with multi-stage builds
- Hardened — connection limits, frame caps, idle timeouts, sanitized errors
Quick Start (preview)
# From source
The Zipformer-vi ONNX bundle (~75 MB) downloads automatically on first run
into ~/.phostt/models/.
Smoke test
With the server running (or using the transcribe command directly):
Expected output (approximate — exact text depends on the test utterance):
xin chào
Latency note: Debug build: ~50 ms total on 3.7 s of audio on M1. This is approximate and will vary by hardware and build profile. Release builds with LTO are significantly faster.
Docker
# CPU (any platform)
# CUDA (Linux + NVIDIA Container Toolkit)
Acknowledgements
phostt is a Vietnamese fork of gigastt,
which provides the production-grade server scaffolding (HTTP/WS/SSE,
rate-limit, metrics, graceful shutdown). The Vietnamese inference uses the
Zipformer-Transducer weights packaged by the
sherpa-onnx project.
License
MIT — see LICENSE.