Expand description
§Kalosm Sound
Kalosm Sound is a collection of audio models and utilities for the Kalosm framework. It supports several voice activity detection models, and provides utilities for transcribing audio into text.
§Sound Streams
Models in kalosm sound work with any AsyncSource
. You can use MicInput::stream
to stream audio from the microphone, or any synchronous audio source that implements rodio::Source
like a mp3 or wav file.
You can transform the audio streams with:
VoiceActivityDetectorExt::voice_activity_stream
: Detect voice activity in the audio dataDenoisedExt::denoise_and_detect_voice_activity
: Denoise the audio data and detect voice activityAsyncSourceTranscribeExt::transcribe
: Chunk an audio stream based on voice activity and then transcribe the chunked audio dataVoiceActivityStreamExt::rechunk_voice_activity
: Chunk an audio stream based on voice activityVoiceActivityStreamExt::filter_voice_activity
: Filter chunks of audio data based on voice activityTranscribeChunkedAudioStreamExt::transcribe
: Transcribe a chunked audio stream
§Voice Activity Detection
VAD models are used to detect when a speaker is speaking in a given audio stream. The simplest way to use a VAD model is to create an audio stream and call VoiceActivityDetectorExt::voice_activity_stream
to stream audio chunks that are actively being spoken:
use kalosm::sound::*;
#[tokio::main]
async fn main() {
// Get the default microphone input
let mic = MicInput::default();
// Stream the audio from the microphone
let stream = mic.stream();
// Detect voice activity in the audio stream
let mut vad = stream.voice_activity_stream();
while let Some(input) = vad.next().await {
println!("Probability: {}", input.probability);
}
}
Kalosm also provides VoiceActivityStreamExt::rechunk_voice_activity
to collect chunks of consecutive audio samples with a high vad probability. This can be useful for applications like speech recognition where context between consecutive audio samples is important.
use kalosm::sound::*;
use rodio::Source;
#[tokio::main]
async fn main() {
// Get the default microphone input
let mic = MicInput::default();
// Stream the audio from the microphone
let stream = mic.stream();
// Chunk the audio into chunks of speech
let vad = stream.voice_activity_stream();
let mut audio_chunks = vad.rechunk_voice_activity();
// Print the chunks as they are streamed in
while let Some(input) = audio_chunks.next().await {
println!("New voice activity chunk with duration {:?}", input.total_duration());
}
}
§Transcription
You can use the Whisper
model to transcribe audio into text. Kalosm can transcribe any AsyncSource
into a transcription stream with the AsyncSourceTranscribeExt::transcribe
method:
use kalosm::sound::*;
#[tokio::main]
async fn main() {
// Get the default microphone input
let mic = MicInput::default();
// Stream the audio from the microphone
let stream = mic.stream();
// Transcribe the audio into text with the default Whisper model
let mut transcribe = stream.transcribe(Whisper::new().await.unwrap());
// Print the text as it is streamed in
transcribe.to_std_out().await.unwrap();
}
Re-exports§
Structs§
- Chunked
Transcription Task - A chunked audio transcription task which can be streamed from a
Whisper
model. - Denoised
Stream - A stream of
SamplesBuffer
s with voice activity detection information - MicInput
- A microphone input.
- MicStream
- A stream of audio data from the microphone.
- Parse
Whisper Language Error - Error that reports the unsupported value
- Parse
Whisper Source Error - Error that reports the unsupported value
- Resampled
Async Source - A resampled async audio source
- Segment
- A transcribed segment of audio.
- Token
Chunk Ref - A reference to a utf8 token chunk in a segment.
- Transcription
Task - A transcription task which can be streamed from a
Whisper
model. - Voice
Activity Detector Output - The output of a
crate::VoiceActivityDetectorStream
- Voice
Activity Detector Stream - A stream of
SamplesBuffer
s with voice activity detection information - Voice
Activity Filter Stream - A stream of audio chunks that have a voice activity probability above a given threshold
- Voice
Activity Rechunker Stream - A stream of audio chunks with a voice activity probability rolling average above a given threshold
- Whisper
- A quantized whisper audio transcription model.
- Whisper
Builder - A builder with configuration for a Whisper model.
Enums§
- File
Source - A source for a file, either from Hugging Face or a local path
- Model
Loading Progress - The progress starting a model
- Whisper
Language - A language whisper can use
- Whisper
Source - The source whisper model to use.
Traits§
- Async
Source - A streaming audio source for single channel audio. This trait is implemented for all types that implement
rodio::Source
automatically. - Async
Source Transcribe Ext - An extension trait for
AsyncSource
that integrates withcrate::Whisper
. - Denoised
Ext - An extension trait for audio streams for denoising. Based on the nnnoiseless crate.
- Transcribe
Chunked Audio Stream Ext - An extension trait to transcribe pre-chunked audio streams
- Voice
Activity Detector Ext - An extension trait for audio streams that adds a voice activity detection information. Based on the voice_activity_detector crate.
- Voice
Activity Stream Ext - An extension trait for audio streams with voice activity detection information