phostt 0.4.1 - Docs.rs

phostt turns any machine into a Vietnamese speech recognition server that runs entirely on-device. Zipformer-vi RNN-T weights ship pre-quantized from sherpa-onnx; the binary is one command, the model is ~75 MB, everything runs locally.

cargo install phostt && phostt serve
# WebSocket: ws://127.0.0.1:9876/v1/ws
# REST API:  http://127.0.0.1:9876/v1/transcribe

Or build from source:

git clone https://github.com/ekhodzitsky/phostt
cd phostt
cargo run --release -- serve

Why phostt?

	phostt	PhoWhisper-large	Cloud APIs
Architecture	Zipformer + RNN-T	Whisper enc-dec	varies
Model size (INT8)	~75 MB	~1.5 GB	server-side
WER (GigaSpeech2-vi)	~7.7%	n/a	varies
Latency (3.7 s audio)	~61 ms	~300 ms	network + queue
Throughput	61× RTF	~3× RTF	varies
Privacy	100% local	100% local	data leaves device
Cost	free forever	free	$0.006/min+
Setup	`cargo install`	Python + deps	API key + billing
Streaming	real-time WebSocket	batch only	varies

The Zipformer-vi-30M weights ship via sherpa-onnx releases (Apache 2.0, trained on ~70,000 hours of Vietnamese speech, ICASSP-published WER on the VLSP and GigaSpeech2 benchmarks).

Features

Real-time streaming — partial transcription via WebSocket as you speak
REST API + SSE — file transcription with instant or streaming response
Hardware acceleration — CoreML + Neural Engine (macOS), CUDA 12+ (Linux), CPU everywhere
Pre-quantized INT8 — encoder ships at ~75 MB INT8 from upstream
Multi-format audio — WAV, M4A/AAC, MP3, OGG/Vorbis, FLAC
Auto-download — model fetched from sherpa-onnx GitHub releases on first run
Speaker diarization — optional diarization feature for multi-speaker sessions
Docker ready — CPU and CUDA images with multi-stage builds
Android FFI — C-ABI + Kotlin bridge for mobile integration
Hardened — connection limits, frame caps, idle timeouts, sanitized errors, rate limiting

Platform Support

Platform	Target	Backend	Notes
macOS (Apple Silicon)	`aarch64-apple-darwin`	CoreML / CPU	Neural Engine + CPU fallback
macOS (Intel)	`x86_64-apple-darwin`	CPU
Linux (x86_64)	`x86_64-unknown-linux-gnu`	CUDA 12+ / CPU	CUDA via `--features cuda`
Linux (ARM64)	`aarch64-unknown-linux-gnu`	CPU	Buildable, not CI-tested yet
Android	`aarch64-linux-android`, `armv7-linux-androideabi`	NNAPI / CPU	Via `cargo-ndk` + `ffi` feature
Windows	`x86_64-pc-windows-msvc`	CPU	Community-maintained

iOS is theoretically supported via CoreML (--features coreml,ffi), but not yet verified in CI.

Quick Start

Install

cargo install phostt

The first run downloads the ~75 MB Zipformer-vi ONNX bundle automatically into ~/.phostt/models/.

Serve

phostt serve
# Listening on ws://127.0.0.1:9876/v1/ws
# REST API at http://127.0.0.1:9876/v1/transcribe

Smoke test

phostt transcribe ~/.phostt/models/test_wavs/0.wav

Expected output (from the bundled Vietnamese test fixture):

RỒI CŨNG HỖ TRỢ CHO LÂU LÂU CŨNG CHO GẠO CHO NÀY KIA

Usage Examples

REST API (single file):

curl -X POST http://localhost:9876/v1/transcribe \
  -H "Content-Type: audio/wav" \
  --data-binary @sample.wav

REST API (streaming SSE):

curl -X POST http://localhost:9876/v1/stream \
  -H "Content-Type: audio/wav" \
  --data-binary @sample.wav

WebSocket (real-time):

# Connect and stream PCM16 chunks as you speak
websocat ws://localhost:9876/v1/ws

With hardware acceleration:

# macOS Apple Silicon — CoreML Neural Engine
phostt serve --features coreml

# Linux + NVIDIA — CUDA 12
phostt serve --features cuda

Docker

# CPU (any platform)
docker build -t phostt .
docker run -p 9876:9876 phostt

# CUDA (Linux + NVIDIA Container Toolkit)
docker build -f Dockerfile.cuda -t phostt-cuda .
docker run --gpus all -p 9876:9876 phostt-cuda

Benchmarks

Measured on Apple Silicon M2 Pro, release build, 3.74 s Vietnamese test audio:

Backend	Mean Latency	Median	P95	RTF	Peak RSS
CPU	60 ms	60 ms	61 ms	62×	1.4 GB
CoreML (Neural Engine)	93 ms	90 ms	124 ms	40×	1.2 GB

RTF = real-time factor (audio seconds processed per wall-clock second). For this 30M-param INT8 model, CPU is faster than CoreML on Apple Silicon.

Auto-updated benchmark history: BENCHMARKS.md.

Quality / WER

GigaSpeech2-vi (clean): ~7.7% — published upstream benchmark on clean Vietnamese speech.

For detailed benchmark history (latency, throughput, memory) and regression tracking datasets, see BENCHMARKS.md.

Architecture

┌─────────────┐     ┌─────────────┐     ┌─────────────────────┐
│   Client    │────▶│  axum HTTP  │────▶│   SessionPool       │
│  (WS/REST)  │     │   router    │     │  (async-channel)    │
└─────────────┘     └─────────────┘     └─────────────────────┘
                                                │
                    ┌───────────────────────────┘
                    ▼
           ┌────────────────┐
           │ SessionTriplet │──▶ Zipformer Encoder (ONNX)
           │ (enc/dec/join) │──▶ RNN-T Decoder (greedy)
           └────────────────┘──▶ Joiner
                    │
                    ▼
           ┌────────────────┐
           │ StreamingState │──▶ overlap-buffer / VAD
           │ (per-connection)│    → partial + final segments
           └────────────────┘

Mobile / FFI

phostt exposes a C-ABI for Android integration:

PhosttEngine* engine = phostt_engine_new("/path/to/models");
PhosttStream* stream = phostt_stream_new(engine);
char* json = phostt_stream_process_chunk(engine, stream, pcm16, len, 16000);
// ... free with phostt_string_free(json) ...

See ANDROID.md for NDK setup, Kotlin bridge (ffi/android/PhosttBridge.kt), and model bundling strategies.

Roadmap

v0.3.0 — Silero VAD streaming, configurable overlap-buffer, auto-benchmark CI
v0.4.0 — Polyvoice diarization, security/resource hardening, benchmark RSS
iOS build verification (CoreML + ffi feature) — theoretically supported, not yet CI-tested
Quantized embedding extractor for faster diarization
Offline batch re-clustering pass for improved speaker accuracy

Known Limitations

Out-of-domain audio (English loanwords, numbers, proper names) may produce phonetic Vietnamese transcriptions rather than verbatim text. This is expected for a mono-lingual model trained on ~70,000 hours of Vietnamese speech.
Memory footprint (~1.4 GB peak RSS) may be too heavy for <2 GB devices. Consider the CPU backend and a smaller batch size for embedded use.
iOS is theoretically supported via CoreML + ffi, but has not been verified in CI.
Windows builds are community-maintained and not CI-tested.

Troubleshooting

Symptom	Cause	Fix
`Model not found` on first run	Auto-download failed or proxy blocks GitHub	Set `PHOSTT_MODEL_DIR` to a local path with extracted weights
High latency (>200 ms) on CPU	Debug build or missing `release` profile	Always run `cargo run --release` or `cargo install`
CoreML slower than CPU	Neural Engine overhead on short audio	CPU is actually faster for this 30M-param INT8 model; CoreML wins on larger models
`SIGKILL` during model load	OOM on low-RAM system	Close other apps, use CPU backend, or run on a machine with ≥4 GB RAM
WebSocket closes immediately	Rate limit hit or origin mismatch	Check logs; disable rate limiting with `--rate-limit 0` for local testing
Diarization missing speakers	`diarization` feature not enabled	Rebuild with `--features diarization`

See TODO.md for the full tracker.

Contributing

See CONTRIBUTING.md. Quick start for developers:

cargo build --release --features coreml   # or cuda
cargo test                                # 146 fast unit tests, no model needed
cargo clippy --all-targets -- -D warnings
cargo deny check

Acknowledgements

phostt is a Vietnamese fork of gigastt, which provides the production-grade server scaffolding (HTTP/WS/SSE, rate-limit, metrics, graceful shutdown). The Vietnamese inference uses the Zipformer-Transducer weights packaged by the sherpa-onnx project.

License

MIT — see LICENSE.