phostt turns any machine into a Vietnamese speech recognition server that runs entirely on-device. Zipformer-vi RNN-T weights ship pre-quantized from sherpa-onnx; the binary is one command, the model is ~75 MB, everything runs locally.
&&
# WebSocket: ws://127.0.0.1:9876/v1/ws
# REST API: http://127.0.0.1:9876/v1/transcribe
Or build from source:
Why phostt?
| phostt | PhoWhisper-large | Cloud APIs | |
|---|---|---|---|
| Architecture | Zipformer + RNN-T | Whisper enc-dec | varies |
| Model size (INT8) | ~75 MB | ~1.5 GB | server-side |
| WER (GigaSpeech2-vi) | ~7.7% | n/a | varies |
| Latency (3.7 s audio) | ~61 ms | ~300 ms | network + queue |
| Throughput | 61× RTF | ~3× RTF | varies |
| Privacy | 100% local | 100% local | data leaves device |
| Cost | free forever | free | $0.006/min+ |
| Setup | cargo install |
Python + deps | API key + billing |
| Streaming | real-time WebSocket | batch only | varies |
The Zipformer-vi-30M weights ship via sherpa-onnx releases (Apache 2.0, trained on ~70,000 hours of Vietnamese speech, ICASSP-published WER on the VLSP and GigaSpeech2 benchmarks).
Features
- Real-time streaming — partial transcription via WebSocket as you speak
- REST API + SSE — file transcription with instant or streaming response
- Hardware acceleration — CoreML + Neural Engine (macOS), CUDA 12+ (Linux), CPU everywhere
- Pre-quantized INT8 — encoder ships at ~75 MB INT8 from upstream
- Multi-format audio — WAV, M4A/AAC, MP3, OGG/Vorbis, FLAC
- Auto-download — model fetched from sherpa-onnx GitHub releases on first run
- Speaker diarization — optional
diarizationfeature for multi-speaker sessions - Docker ready — CPU and CUDA images with multi-stage builds
- Android FFI — C-ABI + Kotlin bridge for mobile integration
- Hardened — connection limits, frame caps, idle timeouts, sanitized errors, rate limiting
Platform Support
| Platform | Target | Backend | Notes |
|---|---|---|---|
| macOS (Apple Silicon) | aarch64-apple-darwin |
CoreML / CPU | Neural Engine + CPU fallback |
| macOS (Intel) | x86_64-apple-darwin |
CPU | |
| Linux (x86_64) | x86_64-unknown-linux-gnu |
CUDA 12+ / CPU | CUDA via --features cuda |
| Linux (ARM64) | aarch64-unknown-linux-gnu |
CPU | Buildable, not CI-tested yet |
| Android | aarch64-linux-android, armv7-linux-androideabi |
NNAPI / CPU | Via cargo-ndk + ffi feature |
| Windows | x86_64-pc-windows-msvc |
CPU | Community-maintained |
iOS is theoretically supported via CoreML (
--features coreml,ffi), but not yet verified in CI.
Quick Start
Install
The first run downloads the ~75 MB Zipformer-vi ONNX bundle automatically into ~/.phostt/models/.
Serve
# Listening on ws://127.0.0.1:9876/v1/ws
# REST API at http://127.0.0.1:9876/v1/transcribe
Smoke test
Expected output (from the bundled Vietnamese test fixture):
RỒI CŨNG HỖ TRỢ CHO LÂU LÂU CŨNG CHO GẠO CHO NÀY KIA
Docker
# CPU (any platform)
# CUDA (Linux + NVIDIA Container Toolkit)
Benchmarks
Measured on Apple Silicon M2 Pro, release build, 3.74 s Vietnamese test audio:
| Backend | Mean Latency | Median | P95 | RTF | Peak RSS |
|---|---|---|---|---|---|
| CPU | 60 ms | 60 ms | 61 ms | 62× | 1.4 GB |
| CoreML (Neural Engine) | 93 ms | 90 ms | 124 ms | 40× | 1.2 GB |
RTF = real-time factor (audio seconds processed per wall-clock second). For this 30M-param INT8 model, CPU is faster than CoreML on Apple Silicon.
Auto-updated benchmark history: BENCHMARKS.md.
Quality / WER
| Dataset | Samples | WER | Notes |
|---|---|---|---|
| GigaSpeech2-vi (clean) | — | ~7.7% | Published upstream benchmark on clean Vietnamese speech |
| FLEURS Vietnamese | 857 | 103.08% | Foreign names, numbers, and English terms transcribed phonetically into Vietnamese |
The FLEURS baseline is intentionally high — the dataset contains many proper names, digits, and English loanwords that the model transcribes phonetically into Vietnamese. It is used for regression tracking only (threshold: 1.05), not as an absolute quality metric. For production-quality assessment, refer to the GigaSpeech2-vi benchmark above.
Architecture
┌─────────────┐ ┌─────────────┐ ┌─────────────────────┐
│ Client │────▶│ axum HTTP │────▶│ SessionPool │
│ (WS/REST) │ │ router │ │ (async-channel) │
└─────────────┘ └─────────────┘ └─────────────────────┘
│
┌───────────────────────────┘
▼
┌────────────────┐
│ SessionTriplet │──▶ Zipformer Encoder (ONNX)
│ (enc/dec/join) │──▶ RNN-T Decoder (greedy)
└────────────────┘──▶ Joiner
│
▼
┌────────────────┐
│ StreamingState │──▶ overlap-buffer / VAD
│ (per-connection)│ → partial + final segments
└────────────────┘
Mobile / FFI
phostt exposes a C-ABI for Android integration:
PhosttEngine* engine = ;
PhosttStream* stream = ;
char* json = ;
// ... free with phostt_string_free(json) ...
See ANDROID.md for NDK setup, Kotlin bridge (ffi/android/PhosttBridge.kt), and model bundling strategies.
Roadmap
- v0.3.0 — Silero VAD streaming, configurable overlap-buffer, auto-benchmark CI
- v0.4.0 — Polyvoice diarization, security/resource hardening, benchmark RSS
- iOS build verification (CoreML +
ffifeature) - Quantized embedding extractor for faster diarization
- Offline batch re-clustering pass for improved speaker accuracy
See TODO.md for the full tracker.
Contributing
See CONTRIBUTING.md. Quick start for developers:
Acknowledgements
phostt is a Vietnamese fork of gigastt,
which provides the production-grade server scaffolding (HTTP/WS/SSE,
rate-limit, metrics, graceful shutdown). The Vietnamese inference uses the
Zipformer-Transducer weights packaged by the
sherpa-onnx project.
License
MIT — see LICENSE.