Unified turn detection for voice pipelines, wrapping multiple open-source models behind common Rust traits. Same pattern as wavekat-vad.
[!WARNING] Early development. API may change between minor versions.
Backends
| Backend | Feature flag | Input | Model size | Inference | License |
|---|---|---|---|---|---|
| Pipecat Smart Turn v3 | pipecat |
Audio (16 kHz PCM) | ~8 MB (int8 ONNX) | ~12 ms CPU | BSD 2-Clause |
| WaveKat Smart Turn fine-tunes (HF) | wavekat-smart-turn |
Audio (16 kHz PCM) | ~8 MB (int8 ONNX) | ~12 ms CPU | BSD 2-Clause |
| LiveKit Turn Detector | livekit |
Text (ASR transcript) | ~400 MB (ONNX) | ~25 ms CPU | LiveKit Model License |
The WaveKat fine-tunes share the upstream Pipecat ONNX contract (same input
shape, same tensor names) — they're language-specialized weights for the
same architecture. Use them when you want better behavior on a specific
language; today Mandarin (zh) is the only one shipped, but more will land
in the same HF repo over time.
Quick Start
Use TurnController to wrap any detector with automatic state tracking:
use ;
use PipecatSmartTurn;
let detector = new?;
let mut ctrl = new;
// Feed audio continuously
ctrl.push_audio;
// VAD speech start — soft reset (keeps buffer if turn was unfinished)
ctrl.reset_if_finished;
// VAD speech end — predict
let prediction = ctrl.predict?;
match prediction.state
// After assistant finishes responding — hard reset
ctrl.reset;
Or the text-based detector directly:
use ;
use LiveKitEou;
let mut detector = new?;
let prediction = detector.predict_text?;
assert_eq!;
See examples/controller.rs for a
full walkthrough with real audio.
Architecture
Two trait families cover the two input modalities:
AudioTurnDetector-- operates on raw audio frames (no ASR needed)TextTurnDetector-- operates on ASR transcript text with optional conversation context
TurnController wraps any AudioTurnDetector and adds orchestration helpers
like soft-reset (preserves buffer when the user pauses mid-sentence).
wavekat-vad --> "is someone speaking?"
wavekat-turn --> "are they done speaking?"
| |
v v
wavekat-voice --> orchestrates VAD + turn + ASR + LLM + TTS
Feature Flags
| Flag | Default | Description |
|---|---|---|
pipecat |
off | Pipecat Smart Turn v3 audio backend (requires ort, ndarray) |
wavekat-smart-turn |
off | WaveKat language-specialized fine-tunes; implies pipecat, adds hf-hub runtime download |
livekit |
off | LiveKit text-based backend (requires ort, ndarray) |
Selecting a Smart Turn variant
use ;
#
use SmartTurnLang;
// Embedded upstream weights — works offline, no setup.
let detector = new?;
#
// WaveKat Mandarin fine-tune — downloaded from HuggingFace on first call,
// then cached under $HF_HOME/hub/.
let detector = with_variant?;
The first call for a WaveKat variant downloads the ONNX from
wavekat/smart-turn-ONNX
and caches it under $HF_HOME/hub/ (default ~/.cache/huggingface/hub/).
For offline builds, set WAVEKAT_TURN_MODEL_DIR to a directory containing
<lang>/smart-turn-cpu.onnx to skip the download.
Important Notes
- 8 kHz telephony audio must be upsampled to 16 kHz before passing to audio-based detectors. Smart Turn v3 silently produces incorrect results at 8 kHz.
- Text-based detectors depend on ASR transcript quality. Pair with a streaming ASR provider for best results.
Accuracy
Cross-validated against the original Python (Pipecat) pipeline on three fixture clips. Tolerance: ±0.02 probability.
Run locally with make accuracy. See scripts/README.md for how to regenerate the Python reference.
License
Licensed under Apache 2.0.
Copyright 2026 WaveKat.
Acknowledgements
- Pipecat Smart Turn by Daily (BSD 2-Clause)
- LiveKit Turn Detector by LiveKit (LiveKit Model License)