# CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
## Project
**phostt** — local speech-to-text server for Vietnamese, powered by Zipformer-vi RNN-T. On-device inference via ONNX Runtime. No cloud, no API keys, full privacy.
- **Repository**: https://github.com/ekhodzitsky/phostt
- **License**: MIT
- **Crates.io**: published — https://crates.io/crates/phostt
- **Status**: 0.4.3. Forked from [`gigastt`](https://github.com/ekhodzitsky/gigastt) v0.9.4 (Russian STT). The HTTP/WS/SSE/metrics/shutdown stack is production-grade and unchanged. The inference path (model fetch, 80-bin mel features, SentencePiece BPE tokenizer, stateless RNN-T decode, overlap-buffer streaming) is fully wired and tested against Vietnamese audio fixtures.
## Build & Test
```sh
cargo build # CPU-only debug build (default, any platform)
cargo build --features coreml # macOS ARM64 (CoreML / Neural Engine)
cargo build --features cuda # Linux x86_64 (CUDA 12+)
cargo build --release # Release build (LTO, stripped)
cargo test # Unit tests (no model required)
cargo clippy # Lint (no expected warnings)
```
`--features coreml` is confirmed working on macOS ARM64 (Apple Silicon).
## Model
**Zipformer-vi-int8-2025-04-20** packaged by `sherpa-onnx` (Apache 2.0):
- Source bundle: `sherpa-onnx-zipformer-vi-int8-2025-04-20.tar.bz2`
(~77 MB INT8) from [k2-fsa/sherpa-onnx releases](https://github.com/k2-fsa/sherpa-onnx/releases)
- Files inside: `encoder.int8.onnx` (70.9 MB), `decoder.onnx` (5.2 MB),
`joiner.int8.onnx` (1.0 MB), `bpe.model` (271 KB), `tokens.txt`
- Sample rate: 16 kHz · Mel bins: 80 · Vocab: SentencePiece BPE
- Decode: RNN-T greedy (stateless decoder — no LSTM h/c)
- Training: ~70,000 hours of Vietnamese speech (`zzasdf/viet_iter3_pseudo_label`)
- WER: ~8–10% on VLSP2020-T1 (estimated, vs ~12% for the older 6k-hour model)
The bundle is downloaded on first run into `~/.phostt/models/`. SHA-256
verification + atomic rename + automatic filename normalization
(`encoder-epoch-12-avg-8.int8.onnx` → `encoder.int8.onnx`).
## Architecture
```
src/
lib.rs # Public module exports
main.rs # CLI (clap): serve, download, transcribe, inspect
model/mod.rs # Model bundle download (tar.bz2 → .onnx + bpe.model + tokens.txt)
inference/
mod.rs # Engine: ONNX session pool, StreamingState, DecoderState
features.rs # Mel spectrogram (80 bins, FFT=400, hop=160) via kaldi_native_fbank::OnlineFeature
tokenizer.rs # SentencePiece BPE (bpe.model)
decode.rs # RNN-T greedy decode (stateless decoder)
audio.rs # Audio loading, resampling, channel mixing
error.rs # Typed error types (PhosttError)
inspect.rs # ONNX session I/O metadata printer
server/
mod.rs # axum router: HTTP + WebSocket on single port
http.rs # REST handlers: /health, /v1/models, /v1/transcribe, /v1/transcribe/stream
rate_limit.rs # In-tree per-IP token-bucket rate limiter
metrics.rs # In-tree Prometheus text encoder
protocol/mod.rs # JSON message types for the WS protocol
```
### Streaming
- Zipformer-vi-30M is published as an offline transducer; phostt wraps it in
an overlap-buffer streaming layer (StreamingState) to expose the same
WebSocket protocol as the upstream gigastt server.
- Server accepts configurable sample rates (8/16/24/44.1/48 kHz) via the
`Configure` message; default 48 kHz is resampled to 16 kHz with rubato.
## Streaming model
Zipformer-vi-30M is an offline transducer. phostt wraps it in a sliding
overlap-buffer (`StreamingState`) to expose real-time WebSocket streaming.
Audio is chunked into configurable windows (default 4 s), each chunk is
featurized and encoded independently, and partial results are merged across
boundaries with optional fuzzy word matching to handle boundary instability.
`kaldi_native_fbank::OnlineFeature` in `inference/features.rs` drives the
feature extraction, providing the same povey-window + preemphasis + Slaney
mel filterbank pipeline that sherpa-onnx uses upstream. The `OnlineFeature`
wrapper handles streaming increments, while the overlap-and-merge logic in
`StreamingState` works around the offline encoder constraint.
### Streaming modes
Two mutually exclusive streaming strategies are available:
1. **Overlap-buffer (default)** — fixed 4-second windows with 1-second overlap.
Emits `Partial` (interim) results as speech progresses and `Final` on
endpointing (~600 ms silence or decoder blank streak).
- `--streaming-window-ms` / `--streaming-overlap-ms` tune latency vs accuracy.
- `--streaming-fuzzy-threshold` enables fuzzy boundary merge.
2. **VAD-based simulated streaming (`--vad`)** — Silero VAD segments speech
into natural utterances; each utterance is transcribed offline with the
full encoder context. Eliminates boundary artefacts entirely. While speech
is active, partial (interim) results are still emitted via the overlap-buffer
so clients see live transcription progress. Suitable for high-accuracy
use cases where latency tolerance is higher.
### Tunable overlap-buffer parameters
- `--streaming-window-ms` (default 4000) — mel frames per encoder window.
- `--streaming-overlap-ms` (default 1000) — overlap between consecutive windows.
- `--streaming-fuzzy-threshold` (default 1.0) — normalized Levenshtein similarity
for boundary word deduplication. Lower values reduce duplicate words on
boundaries at the cost of potentially missing legitimate repetitions.
### Graceful shutdown
- `CancellationToken` + `TaskTracker` cascades through every WS / SSE handler.
- On SIGTERM each session flushes, emits a final frame, and closes with
`Close(1001 Going Away)`.
- `--shutdown-drain-secs` (default 10) bounds the wait after `axum::serve`
returns. `--max-session-secs` (default 3600) caps any single WS session.
## Development guidelines
### TDD workflow
1. Write failing test first
2. Implement minimal code to pass
3. Refactor, verify tests still pass
4. `cargo test && cargo clippy` before every commit
### API versioning & backward compatibility
- WebSocket protocol version: `PROTOCOL_VERSION = "1.0"` (in `protocol/mod.rs`)
- `ServerMessage::Ready` includes `version` field sent on connection
- Canonical WS path: `/v1/ws`. `/ws` remains as a deprecated alias.
- New fields are additive only; never remove or rename existing fields.
### Code style
- Rust 2024 edition
- `anyhow` for error handling, `tracing` for logging
- No `unwrap()` in production paths (use `?`, `context()`, or `unwrap_or_else`)
- Shared inference constants live in `inference/mod.rs`
### Audio format support
- File transcription: WAV, M4A/AAC, MP3, OGG/Vorbis, FLAC (via symphonia)
- WebSocket: raw PCM16 binary frames at configurable sample rate; resampled
to 16 kHz server-side via rubato
### Security
- **Loopback bind by default.** `127.0.0.1` only; `--bind-all` /
`PHOSTT_ALLOW_BIND_ANY=1` required for non-loopback.
- **Origin allowlist.** Cross-origin denied by default; `--allow-origin`
to extend, `--cors-allow-any` for wildcard.
- **Runtime limits**: `--idle-timeout-secs` (300), `--ws-frame-max-bytes`
(512 KiB), `--body-limit-bytes` (50 MiB), `--pool-size` (4),
`--max-session-secs` (3600), `--shutdown-drain-secs` (10).
- **Per-IP rate limiting** (opt-in): `--rate-limit-per-minute N` +
`--rate-limit-burst`.
- **SHA-256 verification + atomic rename** on every model file.
- **Internal errors sanitized** — no path or model leakage to clients.
- **Prometheus `/metrics`** (opt-in via `--metrics`).