scriptrs

Work in progress

scriptrs is early and intentionally narrow right now:

macOS only
Apple CoreML only
Parakeet TDT v2 only
no CUDA
no non-macOS backend yet

Rust transcription with native CoreML Parakeet v2 inference.

The base crate exposes a single-chunk TranscriptionPipeline. Fast long-audio chunking lives behind the long-form feature via LongFormTranscriptionPipeline. VAD-backed speech region planning is an additional long-form-vad feature.

Current scope

Base pipeline for short audio
Optional fast long-form pipeline with overlap chunking
Optional VAD-backed long-form region planning
Native CoreML inference on macOS
Hugging Face download support with optional local model loading

What it does not do yet

Linux or Windows support
CUDA support
Other ASR models
Streaming transcription
Stable public guarantees around model layout or long-form behavior

Install

[dependencies]
scriptrs = "0.1.0"

For fast long-form transcription:

[dependencies]
scriptrs = { version = "0.1.0", features = ["long-form"] }

For VAD-backed long-form transcription:

[dependencies]
scriptrs = { version = "0.1.0", features = ["long-form-vad"] }

Model downloads

With the default online feature, scriptrs can resolve models automatically:

it downloads the runtime bundle from avencera/scriptrs-models

You can override either side of that:

SCRIPTRS_MODELS_DIR=/path/to/models forces a local bundle
SCRIPTRS_MODELS_REPO=owner/repo forces a specific Hugging Face model repo layout

Local model layout

If you want to use from_dir(...) or SCRIPTRS_MODELS_DIR, the local bundle should look like this:

models/
  parakeet-v2/
    encoder.mlmodelc/
    decoder.mlmodelc/
    joint-decision.mlmodelc/
    vocab.txt

With long-form-vad, add:

models/
  vad/
    silero-vad.mlmodelc/

Usage

Short audio

Use the base pipeline when your audio already fits in a single Parakeet chunk.

With the default online feature, from_pretrained() is the intended path:

use scriptrs::TranscriptionPipeline;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let audio: Vec<f32> = load_mono_16khz_audio();
    let pipeline = TranscriptionPipeline::from_pretrained()?;
    let result = pipeline.run(&audio)?;

    println!("{}", result.text);
    Ok(())
}

fn load_mono_16khz_audio() -> Vec<f32> {
    Vec::new()
}

If the input is too long for the base pipeline, it returns AudioTooLong.

If you want to use a local bundle instead:

use scriptrs::TranscriptionPipeline;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let audio: Vec<f32> = load_mono_16khz_audio();
    let pipeline = TranscriptionPipeline::from_dir("models")?;
    let result = pipeline.run(&audio)?;

    println!("{}", result.text);
    Ok(())
}

fn load_mono_16khz_audio() -> Vec<f32> {
    Vec::new()
}

Long audio

Enable long-form if you want scriptrs to own long-audio chunking internally and you care most about speed on clean, dense speech.

use scriptrs::LongFormTranscriptionPipeline;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let audio: Vec<f32> = load_mono_16khz_audio();
    let pipeline = LongFormTranscriptionPipeline::from_pretrained()?;
    let result = pipeline.run(&audio)?;

    println!("{}", result.text);
    Ok(())
}

fn load_mono_16khz_audio() -> Vec<f32> {
    Vec::new()
}

LongFormConfig defaults to the fast overlap-chunking path with 4 workers. You can tune the worker count when you want less or more parallelism:

use scriptrs::{LongFormConfig, LongFormTranscriptionPipeline};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let audio: Vec<f32> = load_mono_16khz_audio();
    let pipeline = LongFormTranscriptionPipeline::from_pretrained()?;
    let config = LongFormConfig {
        worker_count: 2,
        ..LongFormConfig::default()
    };
    let result = pipeline.run_with_config(&audio, &config)?;

    println!("{}", result.text);
    Ok(())
}

fn load_mono_16khz_audio() -> Vec<f32> {
    Vec::new()
}

Enable long-form-vad when you want VAD-backed speech region planning for sparse speech, long silences, or recordings with a lot of non-speech audio:

use scriptrs::{LongFormConfig, LongFormMode, LongFormTranscriptionPipeline};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let audio: Vec<f32> = load_mono_16khz_audio();
    let pipeline = LongFormTranscriptionPipeline::from_pretrained()?;
    let config = LongFormConfig {
        mode: LongFormMode::Vad,
        ..LongFormConfig::default()
    };
    let result = pipeline.run_with_config(&audio, &config)?;

    println!("{}", result.text);
    Ok(())
}

fn load_mono_16khz_audio() -> Vec<f32> {
    Vec::new()
}

Example

A small WAV example is included:

cargo run --example transcribe_wav -- --audio /path/to/file.wav --pretrained
cargo run --example transcribe_wav -- --audio /path/to/file.wav --models-dir models
cargo run --example transcribe_wav --features long-form -- --audio /path/to/file.wav --pretrained --long-form
cargo run --example transcribe_wav --features long-form -- --audio /path/to/file.wav --models-dir models --long-form
cargo run --example transcribe_wav --features long-form -- --audio /path/to/file.wav --pretrained --long-form --long-form-workers 2
cargo run --example transcribe_wav --features long-form-vad -- --audio /path/to/file.wav --pretrained --long-form --vad-long-form

The example expects mono 16kHz WAV input.

Notes

The public API is still moving
scriptrs currently targets the exact file layout and model I/O shipped in avencera/scriptrs-models; if you swap in a different CoreML Parakeet export, you may need runtime code changes
Use long-form for the fastest path on clean, dense speech
Add long-form-vad when you need better robustness on sparse-speech or non-speech-heavy recordings

scriptrs 0.2.0