phostt 0.2.2

Local STT server powered by Zipformer-vi RNN-T — on-device Vietnamese speech recognition via ONNX Runtime
Documentation
# CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

## Project

**phostt** — local speech-to-text server for Vietnamese, powered by Zipformer-vi RNN-T. On-device inference via ONNX Runtime. No cloud, no API keys, full privacy.

- **Repository**: https://github.com/ekhodzitsky/phostt
- **License**: MIT
- **Crates.io**: `phostt` name verified available (not yet published)
- **Status**: 0.1.0 release candidate. Forked from [`gigastt`]https://github.com/ekhodzitsky/gigastt v0.9.4 (Russian STT). The HTTP/WS/SSE/metrics/shutdown stack is production-grade and unchanged. The inference path (model fetch, 80-bin mel features, SentencePiece BPE tokenizer, stateless RNN-T decode, overlap-buffer streaming) is fully wired and tested against Vietnamese audio fixtures.

## Build & Test

```sh
cargo build                          # CPU-only debug build (default, any platform)
cargo build --features coreml        # macOS ARM64 (CoreML / Neural Engine)
cargo build --features cuda          # Linux x86_64 (CUDA 12+)
cargo build --release                # Release build (LTO, stripped)
cargo test                           # Unit tests (no model required)
cargo clippy                         # Lint (no expected warnings)
```

`--features coreml` is confirmed working on macOS ARM64 (Apple Silicon).

## Model

**Zipformer-30M-RNNT-6000h** packaged by `sherpa-onnx` (Apache 2.0):

- Source bundle: `sherpa-onnx-zipformer-vi-30M-int8-2026-02-09.tar.bz2`
  (~26 MB) from [k2-fsa/sherpa-onnx releases]https://github.com/k2-fsa/sherpa-onnx/releases
- Files inside: `encoder.int8.onnx` (26 MB), `decoder.onnx` (4.9 MB),
  `joiner.int8.onnx` (1.0 MB), `bpe.model` (262 KB), `tokens.txt`
- Sample rate: 16 kHz · Mel bins: 80 · Vocab: SentencePiece BPE
- Decode: RNN-T greedy (stateless decoder — no LSTM h/c)
- Training: ~6000 hours of Vietnamese speech (`hynt/Zipformer-30M-RNNT-6000h`)
- WER: ~12.3 on VLSP2020-T1, ~7.97 on VLSP2025-Pub, ~7.56 on GigaSpeech2

The bundle is downloaded on first run into `~/.phostt/models/`. SHA-256
verification + atomic rename, same hardening as the upstream gigastt fetcher.

## Architecture

```
src/
  lib.rs                  # Public module exports
  main.rs                 # CLI (clap): serve, download, transcribe, inspect
  model/mod.rs            # Model bundle download (tar.bz2 → .onnx + bpe.model + tokens.txt)
  inference/
    mod.rs                # Engine: ONNX session pool, StreamingState, DecoderState
    features.rs           # Mel spectrogram (80 bins, FFT=400, hop=160) via kaldi_native_fbank::OnlineFeature
    tokenizer.rs          # SentencePiece BPE (bpe.model)
    decode.rs             # RNN-T greedy decode (stateless decoder)
    audio.rs              # Audio loading, resampling, channel mixing
  error.rs                # Typed error types (PhosttError)
  inspect.rs              # ONNX session I/O metadata printer
  server/
    mod.rs                # axum router: HTTP + WebSocket on single port
    http.rs               # REST handlers: /health, /v1/models, /v1/transcribe, /v1/transcribe/stream
    rate_limit.rs         # In-tree per-IP token-bucket rate limiter
    metrics.rs            # In-tree Prometheus text encoder
  protocol/mod.rs         # JSON message types for the WS protocol
```

### Streaming
- Zipformer-vi-30M is published as an offline transducer; phostt wraps it in
  an overlap-buffer streaming layer (StreamingState) to expose the same
  WebSocket protocol as the upstream gigastt server.
- Server accepts configurable sample rates (8/16/24/44.1/48 kHz) via the
  `Configure` message; default 48 kHz is resampled to 16 kHz with rubato.

## Streaming model

Zipformer-vi-30M is an offline transducer. phostt wraps it in a sliding
overlap-buffer (`StreamingState`) to expose real-time WebSocket streaming.
The implementation uses a 4-second window with 1-second overlap: audio is
chunked, each chunk is featurized and encoded independently, and partial
results are merged across boundaries.

`kaldi_native_fbank::OnlineFeature` in `inference/features.rs` drives the
feature extraction, providing the same povey-window + preemphasis + Slaney
mel filterbank pipeline that sherpa-onnx uses upstream. The `OnlineFeature`
wrapper handles streaming increments, while the overlap-and-merge logic in
`StreamingState` works around the offline encoder constraint.

### Graceful shutdown
- `CancellationToken` + `TaskTracker` cascades through every WS / SSE handler.
- On SIGTERM each session flushes, emits a final frame, and closes with
  `Close(1001 Going Away)`.
- `--shutdown-drain-secs` (default 10) bounds the wait after `axum::serve`
  returns. `--max-session-secs` (default 3600) caps any single WS session.

## Development guidelines

### TDD workflow
1. Write failing test first
2. Implement minimal code to pass
3. Refactor, verify tests still pass
4. `cargo test && cargo clippy` before every commit

### API versioning & backward compatibility
- WebSocket protocol version: `PROTOCOL_VERSION = "1.0"` (in `protocol/mod.rs`)
- `ServerMessage::Ready` includes `version` field sent on connection
- Canonical WS path: `/v1/ws`. `/ws` remains as a deprecated alias.
- New fields are additive only; never remove or rename existing fields.

### Code style
- Rust 2024 edition
- `anyhow` for error handling, `tracing` for logging
- No `unwrap()` in production paths (use `?`, `context()`, or `unwrap_or_else`)
- Shared inference constants live in `inference/mod.rs`

### Audio format support
- File transcription: WAV, M4A/AAC, MP3, OGG/Vorbis, FLAC (via symphonia)
- WebSocket: raw PCM16 binary frames at configurable sample rate; resampled
  to 16 kHz server-side via rubato

### Security
- **Loopback bind by default.** `127.0.0.1` only; `--bind-all` /
  `PHOSTT_ALLOW_BIND_ANY=1` required for non-loopback.
- **Origin allowlist.** Cross-origin denied by default; `--allow-origin`
  to extend, `--cors-allow-any` for wildcard.
- **Runtime limits**: `--idle-timeout-secs` (300), `--ws-frame-max-bytes`
  (512 KiB), `--body-limit-bytes` (50 MiB), `--pool-size` (4),
  `--max-session-secs` (3600), `--shutdown-drain-secs` (10).
- **Per-IP rate limiting** (opt-in): `--rate-limit-per-minute N` +
  `--rate-limit-burst`.
- **SHA-256 verification + atomic rename** on every model file.
- **Internal errors sanitized** — no path or model leakage to clients.
- **Prometheus `/metrics`** (opt-in via `--metrics`).