phostt 0.1.0 - Docs.rs

Status: 0.1.0 — pre-alpha, work in progress. The server scaffolding (HTTP, WebSocket, SSE, rate-limit, metrics, graceful shutdown) is forked from the production-grade gigastt stack. The Vietnamese inference path (model fetch, mel features, BPE tokenizer, RNN-T decode for Zipformer-vi) is the active work-in-progress tracked toward the first functional release.

phostt turns any machine into a Vietnamese speech recognition server that runs entirely on-device. Zipformer-vi RNN-T weights ship pre-quantized from sherpa-onnx; the binary is one command, the model is ~75 MB, everything runs locally.

# Once 0.1.0 ships:
cargo install phostt && phostt serve
# WebSocket: ws://127.0.0.1:9876/v1/ws
# REST API:  http://127.0.0.1:9876/v1/transcribe

Why phostt?

	phostt	PhoWhisper-large	Cloud APIs
Architecture	Zipformer + RNN-T	Whisper enc-dec	varies
Model size (INT8)	~75 MB	~1.5 GB	server-side
WER (GigaSpeech2-vi)	~7.7%	n/a	varies
Privacy	100% local	100% local	data leaves device
Cost	free forever	free	$0.006/min+
Setup	`cargo install`	Python + deps	API key + billing
Streaming	real-time WebSocket	batch only	varies

The Zipformer-vi-30M weights ship via sherpa-onnx releases (Apache 2.0, trained on ~6000 hours of Vietnamese speech, ICASSP-published WER on the VLSP and GigaSpeech2 benchmarks).

Features

Real-time streaming — partial transcription via WebSocket as you speak ✅
REST API + SSE — file transcription with instant or streaming response
Hardware acceleration — CoreML + Neural Engine (macOS), CUDA 12+ (Linux), CPU everywhere
Pre-quantized INT8 — encoder ships at ~75 MB INT8 from upstream
Multi-format audio — WAV, M4A/AAC, MP3, OGG/Vorbis, FLAC
Auto-download — model fetched from sherpa-onnx GitHub releases on first run
Docker ready — CPU and CUDA images with multi-stage builds
Hardened — connection limits, frame caps, idle timeouts, sanitized errors

Quick Start (preview)

# From source
git clone https://github.com/ekhodzitsky/phostt
cd phostt
cargo run --release -- serve

The Zipformer-vi ONNX bundle (~75 MB) downloads automatically on first run into ~/.phostt/models/.

Smoke test

With the server running (or using the transcribe command directly):

phostt transcribe tests/fixtures/vi_sample.wav

Expected output (approximate — exact text depends on the test utterance):

xin chào

Latency note: Debug build: ~50 ms total on 3.7 s of audio on M1. This is approximate and will vary by hardware and build profile. Release builds with LTO are significantly faster.

Docker

# CPU (any platform)
docker build -t phostt .
docker run -p 9876:9876 phostt

# CUDA (Linux + NVIDIA Container Toolkit)
docker build -f Dockerfile.cuda -t phostt-cuda .
docker run --gpus all -p 9876:9876 phostt-cuda

Acknowledgements

phostt is a Vietnamese fork of gigastt, which provides the production-grade server scaffolding (HTTP/WS/SSE, rate-limit, metrics, graceful shutdown). The Vietnamese inference uses the Zipformer-Transducer weights packaged by the sherpa-onnx project.

License

MIT — see LICENSE.