Unified turn detection for voice pipelines, wrapping multiple open-source models behind common Rust traits. Same pattern as wavekat-vad.
[!WARNING] Early development. Trait API is defined; backend implementations are stubs pending ONNX model integration.
Backends
| Backend | Feature flag | Input | Model size | Inference | License |
|---|---|---|---|---|---|
| Pipecat Smart Turn v3 | pipecat |
Audio (16 kHz PCM) | ~8 MB (int8 ONNX) | ~12 ms CPU | BSD 2-Clause |
| LiveKit Turn Detector | livekit |
Text (ASR transcript) | ~400 MB (ONNX) | ~25 ms CPU | LiveKit Model License |
Quick Start
Use the audio-based detector:
use ;
use PipecatSmartTurn;
let mut detector = new?;
// Feed 16 kHz f32 PCM frames after VAD detects silence
let prediction = detector.predict_audio?;
match prediction.state
Or the text-based detector:
use ;
use LiveKitEou;
let mut detector = new?;
let prediction = detector.predict_text?;
assert_eq!;
Architecture
Two trait families cover the two input modalities:
AudioTurnDetector-- operates on raw audio frames (no ASR needed)TextTurnDetector-- operates on ASR transcript text with optional conversation context
wavekat-vad --> "is someone speaking?"
wavekat-turn --> "are they done speaking?"
| |
v v
wavekat-voice --> orchestrates VAD + turn + ASR + LLM + TTS
Feature Flags
| Flag | Default | Description |
|---|---|---|
pipecat |
off | Pipecat Smart Turn v3 audio backend (requires ort, ndarray) |
livekit |
off | LiveKit text-based backend (requires ort, ndarray) |
Important Notes
- 8 kHz telephony audio must be upsampled to 16 kHz before passing to audio-based detectors. Smart Turn v3 silently produces incorrect results at 8 kHz.
- Text-based detectors depend on ASR transcript quality. Pair with a streaming ASR provider for best results.
License
Licensed under Apache 2.0.
Copyright 2026 WaveKat.
Acknowledgements
- Pipecat Smart Turn by Daily (BSD 2-Clause)
- LiveKit Turn Detector by LiveKit (LiveKit Model License)