phostt 0.4.1

Local STT server powered by Zipformer-vi RNN-T — on-device Vietnamese speech recognition via ONNX Runtime
Documentation

phostt turns any machine into a Vietnamese speech recognition server that runs entirely on-device. Zipformer-vi RNN-T weights ship pre-quantized from sherpa-onnx; the binary is one command, the model is ~75 MB, everything runs locally.

cargo install phostt && phostt serve
# WebSocket: ws://127.0.0.1:9876/v1/ws
# REST API:  http://127.0.0.1:9876/v1/transcribe

Or build from source:

git clone https://github.com/ekhodzitsky/phostt
cd phostt
cargo run --release -- serve

Why phostt?

phostt PhoWhisper-large Cloud APIs
Architecture Zipformer + RNN-T Whisper enc-dec varies
Model size (INT8) ~75 MB ~1.5 GB server-side
WER (GigaSpeech2-vi) ~7.7% n/a varies
Latency (3.7 s audio) ~61 ms ~300 ms network + queue
Throughput 61× RTF ~3× RTF varies
Privacy 100% local 100% local data leaves device
Cost free forever free $0.006/min+
Setup cargo install Python + deps API key + billing
Streaming real-time WebSocket batch only varies

The Zipformer-vi-30M weights ship via sherpa-onnx releases (Apache 2.0, trained on ~70,000 hours of Vietnamese speech, ICASSP-published WER on the VLSP and GigaSpeech2 benchmarks).

Features

  • Real-time streaming — partial transcription via WebSocket as you speak
  • REST API + SSE — file transcription with instant or streaming response
  • Hardware acceleration — CoreML + Neural Engine (macOS), CUDA 12+ (Linux), CPU everywhere
  • Pre-quantized INT8 — encoder ships at ~75 MB INT8 from upstream
  • Multi-format audio — WAV, M4A/AAC, MP3, OGG/Vorbis, FLAC
  • Auto-download — model fetched from sherpa-onnx GitHub releases on first run
  • Speaker diarization — optional diarization feature for multi-speaker sessions
  • Docker ready — CPU and CUDA images with multi-stage builds
  • Android FFI — C-ABI + Kotlin bridge for mobile integration
  • Hardened — connection limits, frame caps, idle timeouts, sanitized errors, rate limiting

Platform Support

Platform Target Backend Notes
macOS (Apple Silicon) aarch64-apple-darwin CoreML / CPU Neural Engine + CPU fallback
macOS (Intel) x86_64-apple-darwin CPU
Linux (x86_64) x86_64-unknown-linux-gnu CUDA 12+ / CPU CUDA via --features cuda
Linux (ARM64) aarch64-unknown-linux-gnu CPU Buildable, not CI-tested yet
Android aarch64-linux-android, armv7-linux-androideabi NNAPI / CPU Via cargo-ndk + ffi feature
Windows x86_64-pc-windows-msvc CPU Community-maintained

iOS is theoretically supported via CoreML (--features coreml,ffi), but not yet verified in CI.

Quick Start

Install

cargo install phostt

The first run downloads the ~75 MB Zipformer-vi ONNX bundle automatically into ~/.phostt/models/.

Serve

phostt serve
# Listening on ws://127.0.0.1:9876/v1/ws
# REST API at http://127.0.0.1:9876/v1/transcribe

Smoke test

phostt transcribe ~/.phostt/models/test_wavs/0.wav

Expected output (from the bundled Vietnamese test fixture):

RỒI CŨNG HỖ TRỢ CHO LÂU LÂU CŨNG CHO GẠO CHO NÀY KIA

Usage Examples

REST API (single file):

curl -X POST http://localhost:9876/v1/transcribe \
  -H "Content-Type: audio/wav" \
  --data-binary @sample.wav

REST API (streaming SSE):

curl -X POST http://localhost:9876/v1/stream \
  -H "Content-Type: audio/wav" \
  --data-binary @sample.wav

WebSocket (real-time):

# Connect and stream PCM16 chunks as you speak
websocat ws://localhost:9876/v1/ws

With hardware acceleration:

# macOS Apple Silicon — CoreML Neural Engine
phostt serve --features coreml

# Linux + NVIDIA — CUDA 12
phostt serve --features cuda

Docker

# CPU (any platform)
docker build -t phostt .
docker run -p 9876:9876 phostt

# CUDA (Linux + NVIDIA Container Toolkit)
docker build -f Dockerfile.cuda -t phostt-cuda .
docker run --gpus all -p 9876:9876 phostt-cuda

Benchmarks

Measured on Apple Silicon M2 Pro, release build, 3.74 s Vietnamese test audio:

Backend Mean Latency Median P95 RTF Peak RSS
CPU 60 ms 60 ms 61 ms 62× 1.4 GB
CoreML (Neural Engine) 93 ms 90 ms 124 ms 40× 1.2 GB

RTF = real-time factor (audio seconds processed per wall-clock second). For this 30M-param INT8 model, CPU is faster than CoreML on Apple Silicon.

Auto-updated benchmark history: BENCHMARKS.md.

Quality / WER

GigaSpeech2-vi (clean): ~7.7% — published upstream benchmark on clean Vietnamese speech.

For detailed benchmark history (latency, throughput, memory) and regression tracking datasets, see BENCHMARKS.md.

Architecture

┌─────────────┐     ┌─────────────┐     ┌─────────────────────┐
│   Client    │────▶│  axum HTTP  │────▶│   SessionPool       │
│  (WS/REST)  │     │   router    │     │  (async-channel)    │
└─────────────┘     └─────────────┘     └─────────────────────┘
                                                │
                    ┌───────────────────────────┘
                    ▼
           ┌────────────────┐
           │ SessionTriplet │──▶ Zipformer Encoder (ONNX)
           │ (enc/dec/join) │──▶ RNN-T Decoder (greedy)
           └────────────────┘──▶ Joiner
                    │
                    ▼
           ┌────────────────┐
           │ StreamingState │──▶ overlap-buffer / VAD
           │ (per-connection)│    → partial + final segments
           └────────────────┘

Mobile / FFI

phostt exposes a C-ABI for Android integration:

PhosttEngine* engine = phostt_engine_new("/path/to/models");
PhosttStream* stream = phostt_stream_new(engine);
char* json = phostt_stream_process_chunk(engine, stream, pcm16, len, 16000);
// ... free with phostt_string_free(json) ...

See ANDROID.md for NDK setup, Kotlin bridge (ffi/android/PhosttBridge.kt), and model bundling strategies.

Roadmap

  • v0.3.0 — Silero VAD streaming, configurable overlap-buffer, auto-benchmark CI
  • v0.4.0 — Polyvoice diarization, security/resource hardening, benchmark RSS
  • iOS build verification (CoreML + ffi feature) — theoretically supported, not yet CI-tested
  • Quantized embedding extractor for faster diarization
  • Offline batch re-clustering pass for improved speaker accuracy

Known Limitations

  • Out-of-domain audio (English loanwords, numbers, proper names) may produce phonetic Vietnamese transcriptions rather than verbatim text. This is expected for a mono-lingual model trained on ~70,000 hours of Vietnamese speech.
  • Memory footprint (~1.4 GB peak RSS) may be too heavy for <2 GB devices. Consider the CPU backend and a smaller batch size for embedded use.
  • iOS is theoretically supported via CoreML + ffi, but has not been verified in CI.
  • Windows builds are community-maintained and not CI-tested.

Troubleshooting

Symptom Cause Fix
Model not found on first run Auto-download failed or proxy blocks GitHub Set PHOSTT_MODEL_DIR to a local path with extracted weights
High latency (>200 ms) on CPU Debug build or missing release profile Always run cargo run --release or cargo install
CoreML slower than CPU Neural Engine overhead on short audio CPU is actually faster for this 30M-param INT8 model; CoreML wins on larger models
SIGKILL during model load OOM on low-RAM system Close other apps, use CPU backend, or run on a machine with ≥4 GB RAM
WebSocket closes immediately Rate limit hit or origin mismatch Check logs; disable rate limiting with --rate-limit 0 for local testing
Diarization missing speakers diarization feature not enabled Rebuild with --features diarization

See TODO.md for the full tracker.

Contributing

See CONTRIBUTING.md. Quick start for developers:

cargo build --release --features coreml   # or cuda
cargo test                                # 146 fast unit tests, no model needed
cargo clippy --all-targets -- -D warnings
cargo deny check

Acknowledgements

phostt is a Vietnamese fork of gigastt, which provides the production-grade server scaffolding (HTTP/WS/SSE, rate-limit, metrics, graceful shutdown). The Vietnamese inference uses the Zipformer-Transducer weights packaged by the sherpa-onnx project.

License

MIT — see LICENSE.