wavekat-turn 0.0.6

Unified turn detection with multiple backends
Documentation

Crates.io docs.rs

Unified turn detection for voice pipelines, wrapping multiple open-source models behind common Rust traits. Same pattern as wavekat-vad.

[!WARNING] Early development. API may change between minor versions.

Backends

Backend Feature flag Input Model size Inference License
Pipecat Smart Turn v3 pipecat Audio (16 kHz PCM) ~8 MB (int8 ONNX) ~12 ms CPU BSD 2-Clause
LiveKit Turn Detector livekit Text (ASR transcript) ~400 MB (ONNX) ~25 ms CPU LiveKit Model License

Quick Start

cargo add wavekat-turn --features pipecat

Use TurnController to wrap any detector with automatic state tracking:

use wavekat_turn::{TurnController, TurnState};
use wavekat_turn::audio::PipecatSmartTurn;

let detector = PipecatSmartTurn::new()?;
let mut ctrl = TurnController::new(detector);

// Feed audio continuously
ctrl.push_audio(&audio_frame);

// VAD speech start — soft reset (keeps buffer if turn was unfinished)
ctrl.reset_if_finished();

// VAD speech end — predict
let prediction = ctrl.predict()?;
match prediction.state {
    TurnState::Finished   => { /* user is done, send to LLM */ }
    TurnState::Unfinished => { /* keep listening */ }
    TurnState::Wait       => { /* user asked AI to hold */ }
}

// After assistant finishes responding — hard reset
ctrl.reset();

Or the text-based detector directly:

use wavekat_turn::{TextTurnDetector, TurnState};
use wavekat_turn::text::LiveKitEou;

let mut detector = LiveKitEou::new()?;

let prediction = detector.predict_text("I was wondering if", &context)?;
assert_eq!(prediction.state, TurnState::Unfinished);

See examples/controller.rs for a full walkthrough with real audio.

Architecture

Two trait families cover the two input modalities:

  • AudioTurnDetector -- operates on raw audio frames (no ASR needed)
  • TextTurnDetector -- operates on ASR transcript text with optional conversation context

TurnController wraps any AudioTurnDetector and adds orchestration helpers like soft-reset (preserves buffer when the user pauses mid-sentence).

wavekat-vad   -->  "is someone speaking?"
wavekat-turn  -->  "are they done speaking?"
     |                   |
     v                   v
wavekat-voice -->  orchestrates VAD + turn + ASR + LLM + TTS

Feature Flags

Flag Default Description
pipecat off Pipecat Smart Turn v3 audio backend (requires ort, ndarray)
livekit off LiveKit text-based backend (requires ort, ndarray)

Important Notes

  • 8 kHz telephony audio must be upsampled to 16 kHz before passing to audio-based detectors. Smart Turn v3 silently produces incorrect results at 8 kHz.
  • Text-based detectors depend on ASR transcript quality. Pair with a streaming ASR provider for best results.

Accuracy

Cross-validated against the original Python (Pipecat) pipeline on three fixture clips. Tolerance: ±0.02 probability.

Run locally with make accuracy. See scripts/README.md for how to regenerate the Python reference.

License

Licensed under Apache 2.0.

Copyright 2026 WaveKat.

Acknowledgements