phostt turns any machine into a Vietnamese speech recognition server that runs entirely on-device. Zipformer-vi RNN-T weights ship pre-quantized from sherpa-onnx; the binary is one command, the model is ~75 MB, everything runs locally.
&&
# WebSocket: ws://127.0.0.1:9876/v1/ws
# REST API: http://127.0.0.1:9876/v1/transcribe
Or build from source:
Table of Contents
- Why phostt?
- Features
- Platform Support
- Quick Start
- Benchmarks
- Quality / WER
- Architecture
- Mobile / FFI
- Roadmap
- Known Limitations
- Troubleshooting
- Contributing
- Security
- Acknowledgements
- License
Why phostt?
| phostt | PhoWhisper-large | Cloud APIs | |
|---|---|---|---|
| Architecture | Zipformer + RNN-T | Whisper enc-dec | varies |
| Model size (INT8) | ~75 MB | ~1.5 GB | server-side |
| WER (GigaSpeech2-vi) | ~7.7% | n/a | varies |
| Latency (3.7 s audio) | ~61 ms | ~300 ms | network + queue |
| Throughput | 61× RTF | ~3× RTF | varies |
| Privacy | 100% local | 100% local | data leaves device |
| Cost | free forever | free | $0.006/min+ |
| Setup | cargo install |
Python + deps | API key + billing |
| Streaming | real-time WebSocket | batch only | varies |
The Zipformer-vi-30M weights ship via sherpa-onnx releases (Apache 2.0, trained on ~70,000 hours of Vietnamese speech, ICASSP-published WER on the VLSP and GigaSpeech2 benchmarks).
Features
- Real-time streaming — partial transcription via WebSocket as you speak
- REST API + SSE — file transcription with instant or streaming response
- Hardware acceleration — CoreML + Neural Engine (macOS), CUDA 12+ (Linux), CPU everywhere
- Pre-quantized INT8 — encoder ships at ~75 MB INT8 from upstream
- Multi-format audio — WAV, M4A/AAC, MP3, OGG/Vorbis, FLAC
- Auto-download — model fetched from sherpa-onnx GitHub releases on first run
- Speaker diarization — optional
diarizationfeature for multi-speaker sessions - Docker ready — CPU and CUDA images with multi-stage builds
- Android FFI — C-ABI + Kotlin bridge for mobile integration
- Hardened — connection limits, frame caps, idle timeouts, sanitized errors, rate limiting
Platform Support
| Platform | Target | Backend | Notes |
|---|---|---|---|
| macOS (Apple Silicon) | aarch64-apple-darwin |
CoreML / CPU | Neural Engine + CPU fallback |
| macOS (Intel) | x86_64-apple-darwin |
CPU | |
| Linux (x86_64) | x86_64-unknown-linux-gnu |
CUDA 12+ / CPU | CUDA via --features cuda |
| Linux (ARM64) | aarch64-unknown-linux-gnu |
CPU | Buildable, not CI-tested yet |
| Android | aarch64-linux-android, armv7-linux-androideabi |
NNAPI / CPU | Via cargo-ndk + ffi feature |
| Windows | x86_64-pc-windows-msvc |
CPU | Community-maintained |
iOS is theoretically supported via CoreML (
--features coreml,ffi), but not yet verified in CI.
Quick Start
Install
The first run downloads the ~75 MB Zipformer-vi ONNX bundle automatically into ~/.phostt/models/.
Serve
# Listening on ws://127.0.0.1:9876/v1/ws
# REST API at http://127.0.0.1:9876/v1/transcribe
Smoke test
Expected output (from the bundled Vietnamese test fixture):
RỒI CŨNG HỖ TRỢ CHO LÂU LÂU CŨNG CHO GẠO CHO NÀY KIA
Usage Examples
REST API (single file):
REST API (streaming SSE):
WebSocket (real-time):
# Connect and stream PCM16 chunks as you speak
With hardware acceleration:
# macOS Apple Silicon — CoreML Neural Engine
# Linux + NVIDIA — CUDA 12
Docker
# CPU (any platform)
# CUDA (Linux + NVIDIA Container Toolkit)
Or use Docker Compose:
Benchmarks
Measured on Apple Silicon M2 Pro, release build, 3.74 s Vietnamese test audio:
| Backend | Mean Latency | Median | P95 | RTF | Peak RSS |
|---|---|---|---|---|---|
| CPU | 60 ms | 60 ms | 61 ms | 62× | 1.4 GB |
| CoreML (Neural Engine) | 93 ms | 90 ms | 124 ms | 40× | 1.2 GB |
RTF = real-time factor (audio seconds processed per wall-clock second). For this 30M-param INT8 model, CPU is faster than CoreML on Apple Silicon.
Auto-updated benchmark history: BENCHMARKS.md.
Quality / WER
GigaSpeech2-vi (clean): ~7.7% — published upstream benchmark on clean Vietnamese speech.
For detailed benchmark history (latency, throughput, memory) and regression tracking datasets, see BENCHMARKS.md.
Architecture
┌─────────────┐ ┌─────────────┐ ┌─────────────────────┐
│ Client │────▶│ axum HTTP │────▶│ SessionPool │
│ (WS/REST) │ │ router │ │ (async-channel) │
└─────────────┘ └─────────────┘ └─────────────────────┘
│
┌───────────────────────────┘
▼
┌────────────────┐
│ SessionTriplet │──▶ Zipformer Encoder (ONNX)
│ (enc/dec/join) │──▶ RNN-T Decoder (greedy)
└────────────────┘──▶ Joiner
│
▼
┌────────────────┐
│ StreamingState │──▶ overlap-buffer / VAD
│ (per-connection)│ → partial + final segments
└────────────────┘
Mobile / FFI
phostt exposes a C-ABI for Android integration:
PhosttEngine* engine = ;
PhosttStream* stream = ;
char* json = ;
// ... free with phostt_string_free(json) ...
See ANDROID.md for NDK setup, Kotlin bridge (ffi/android/PhosttBridge.kt), and model bundling strategies.
Roadmap
- v0.3.0 — Silero VAD streaming, configurable overlap-buffer, auto-benchmark CI
- v0.4.0 — Polyvoice diarization, security/resource hardening, benchmark RSS
- v0.4.1 — Dependency updates (rubato 2.0, sha2 0.11), docs polish, CI improvements
- iOS build verification (CoreML +
ffifeature) — theoretically supported, not yet CI-tested - Quantized embedding extractor for faster diarization
- Offline batch re-clustering pass for improved speaker accuracy
Known Limitations
- Out-of-domain audio (English loanwords, numbers, proper names) may produce phonetic Vietnamese transcriptions rather than verbatim text. This is expected for a mono-lingual model trained on ~70,000 hours of Vietnamese speech.
- Memory footprint (~1.4 GB peak RSS) may be too heavy for <2 GB devices. Consider the CPU backend and a smaller batch size for embedded use.
- iOS is theoretically supported via CoreML +
ffi, but has not been verified in CI. - Windows builds are community-maintained and not CI-tested.
Troubleshooting
| Symptom | Cause | Fix |
|---|---|---|
Model not found on first run |
Auto-download failed or proxy blocks GitHub | Set PHOSTT_MODEL_DIR to a local path with extracted weights |
| High latency (>200 ms) on CPU | Debug build or missing release profile |
Always run cargo run --release or cargo install |
| CoreML slower than CPU | Neural Engine overhead on short audio | CPU is actually faster for this 30M-param INT8 model; CoreML wins on larger models |
SIGKILL during model load |
OOM on low-RAM system | Close other apps, use CPU backend, or run on a machine with ≥4 GB RAM |
| WebSocket closes immediately | Rate limit hit or origin mismatch | Check logs; disable rate limiting with --rate-limit 0 for local testing |
| Diarization missing speakers | diarization feature not enabled |
Rebuild with --features diarization |
See TODO.md for the full tracker.
Contributing
See CONTRIBUTING.md. Quick start for developers:
Security
Please report security vulnerabilities privately — see SECURITY.md for contact details and supported versions.
Acknowledgements
phostt is a Vietnamese fork of gigastt,
which provides the production-grade server scaffolding (HTTP/WS/SSE,
rate-limit, metrics, graceful shutdown). The Vietnamese inference uses the
Zipformer-Transducer weights packaged by the
sherpa-onnx project.
License
MIT — see LICENSE.