Expand description
soundevents
Production-oriented Rust inference for CED AudioSet sound-event classifiers — load an ONNX model, feed it 16 kHz mono audio, get back ranked RatedSoundEvent predictions with names, ids, and confidences. Long clips are handled via configurable chunking.
§Highlights
- Drop-in CED inference — load any CED AudioSet ONNX model (or use the bundled
tinyvariant) and run it directly on&[f32]PCM samples. No Python, no preprocessing pipeline. - Typed labels, not bare integers — every prediction comes back as an
EventPredictioncarrying a&'static RatedSoundEventfromsoundevents-dataset, so you get the canonical AudioSet name, the/m/...id, the model class index, and the confidence in one struct. - Compile-time class-count guarantee — the
NUM_CLASSES = 527constant comes from the rated dataset at codegen time. If a model returns the wrong number of classes you get a typedClassifierError::UnexpectedClassCountinstead of a silent mismatch. - Long-clip chunking built in —
classify_chunked/classify_all_chunkedwindow the input at a configurable hop, run inference on each chunk, and aggregate the per-chunk confidences with eitherMeanorMax. Defaults match CED’s 10 s training window (160 000 samples at 16 kHz), and fixed-size chunk batches can now be packed into one model call. - Top-k via a tiny min-heap —
classify(samples, k)does not allocate a full 527-element scores vector to find the top results. - Batch-ready low-level API —
predict_raw_scores_batch,predict_raw_scores_batch_flat,predict_raw_scores_batch_into,classify_all_batch, andclassify_batchaccept equal-length clip batches for service-layer batching. - Bring-your-own model or bundle one — load from a path, from in-memory bytes, or enable the
bundled-tinyfeature to embedmodels/tiny.onnxdirectly into your binary.
§Quick start
[dependencies]
soundevents = "0.2"use soundevents::{Classifier, Options};
fn load_mono_16k_audio(_: &str) -> Result<Vec<f32>, Box<dyn std::error::Error>> {
Ok(vec![0.0; 16_000])
}
fn main() -> Result<(), Box<dyn std::error::Error>> {
let mut classifier = Classifier::from_file("soundevents/models/tiny.onnx")?;
// Bring your own decoder/resampler — soundevents expects mono f32
// samples at 16 kHz, in [-1.0, 1.0].
let samples: Vec<f32> = load_mono_16k_audio("clip.wav")?;
// Top-5 predictions for a clip up to ~10 s long.
for prediction in classifier.classify(&samples, 5)? {
println!(
"{:>5.1}% {:>3} {} ({})",
prediction.confidence() * 100.0,
prediction.index(),
prediction.name(),
prediction.id(),
);
}
Ok(())
}§Long clips: chunked inference
Classifier::classify_chunked slides a window over the input and aggregates each chunk’s per-class confidences. The defaults (10 s window, 10 s hop, mean aggregation) match CED’s training setup; tune them for overlap or peak-pooling.
use soundevents::{ChunkAggregation, ChunkingOptions, Classifier};
fn load_long_clip() -> Result<Vec<f32>, Box<dyn std::error::Error>> {
Ok(vec![0.0; 320_000])
}
fn main() -> Result<(), Box<dyn std::error::Error>> {
let mut classifier = Classifier::from_file("soundevents/models/tiny.onnx")?;
let samples: Vec<f32> = load_long_clip()?;
let opts = ChunkingOptions::default()
// 5 s overlap (50%) between adjacent windows
.with_hop_samples(80_000)
// Batch up to 4 equal-length windows per session.run()
.with_batch_size(4)
// Keep the loudest detection in any window instead of averaging
.with_aggregation(ChunkAggregation::Max);
let top3 = classifier.classify_chunked(&samples, 3, opts)?;
for prediction in top3 {
println!("{}: {:.2}", prediction.name(), prediction.confidence());
}
Ok(())
}§Models
The four CED variants are sourced from the mispeech Hugging Face organisation, exported to ONNX, and checked into this repo under soundevents/models/. You should not normally need to download anything — git clone gives you a working classifier out of the box.
| Variant | File | Size | Hugging Face source |
|---|---|---|---|
tiny | soundevents/models/tiny.onnx | 6.4 MB | mispeech/ced-tiny |
mini | soundevents/models/mini.onnx | 10 MB | mispeech/ced-mini |
small | soundevents/models/small.onnx | 22 MB | mispeech/ced-small |
base | soundevents/models/base.onnx | 97 MB | mispeech/ced-base |
All four expose the same input/output contract: mono f32 PCM at 16 kHz in, 527-class scores out (SAMPLE_RATE_HZ / NUM_CLASSES). They differ only in parameter count and accuracy/latency trade-off, so you can swap variants without touching application code.
Note — the four ONNX files together are ~135 MB. If you fork this repo and want to keep the working tree slim, consider tracking
soundevents/models/*.onnxwith git LFS.
§Refreshing models from upstream
If upstream releases new weights, or you cloned without the model files, refetch them with:
# Requires huggingface_hub: pip install --user huggingface_hub
./scripts/download_models.sh
# Or just one variant
./scripts/download_models.sh tinyThe script downloads the *.onnx artifact from each mispeech/ced-* Hugging Face repo and writes it as soundevents/models/<variant>.onnx.
See THIRD_PARTY_NOTICES.md for upstream model sources and attribution details.
§Bundled tiny model
Enable the bundled-tiny feature to embed models/tiny.onnx into your binary — useful for CLI tools and self-contained services where you don’t want to ship a separate model file.
soundevents = { version = "0.2", features = ["bundled-tiny"] }use soundevents::{Classifier, Options};
let mut classifier = Classifier::tiny(Options::default())?;§Features
| Feature | Default | What you get |
|---|---|---|
bundled-tiny | Embeds models/tiny.onnx into the crate so Classifier::tiny() works without an external file. |
The full input/output contract:
| Constant | Value | Meaning |
|---|---|---|
SAMPLE_RATE_HZ | 16_000 | Required input sample rate (mono f32). |
DEFAULT_CHUNK_SAMPLES | 160_000 | Default 10 s window/hop for chunked inference. |
NUM_CLASSES | 527 | Number of CED output classes — derived at compile time from RatedSoundEvent::events().len(). |
For low-level batching, every clip in predict_raw_scores_batch* / classify_*_batch must be non-empty and have the same sample count. predict_raw_scores_batch_flat returns one row-major Vec<f32>, and predict_raw_scores_batch_into lets callers reuse their own output buffer to avoid per-call result allocations. classify_chunked uses the same equal-length restriction internally when ChunkingOptions::batch_size() > 1, which is naturally satisfied for fixed-size windows and automatically falls back to smaller batches for the final short tail chunk.
§Development
Regenerate the dataset from upstream sources:
cargo xtask codegenRun the test suite:
cargo test§License
soundevents is under the terms of both the MIT license and the
Apache License (Version 2.0).
See LICENSE-APACHE, LICENSE-MIT for details. Bundled third-party model attributions and source licenses are documented in THIRD_PARTY_NOTICES.md.
Copyright (c) 2026 FinDIT studio authors.
Structs§
- Chunking
Options - Options for chunked inference over long clips.
- Classifier
- CED sound event classifier.
- Event
Prediction - A single classification result with both model-space and ontology-space metadata.
- Options
- Options for constructing a
Classifierfrom an ONNX model on disk.
Enums§
- Chunk
Aggregation - Controls how chunked inference aggregates chunk confidences.
- Classifier
Error - Errors from
Classifieroperations.
Constants§
- DEFAULT_
CHUNK_ SAMPLES - The default window size used by the chunked inference helpers: 10 seconds at 16 kHz.
- NUM_
CLASSES - Number of model output classes.
- SAMPLE_
RATE_ HZ - The expected input sample rate for CED models.