# oximedia-caption-gen
Advanced caption and subtitle generation — speech alignment, line breaking, WCAG compliance, and speaker diarization for OxiMedia
[](https://crates.io/crates/oximedia-caption-gen)
[](https://docs.rs/oximedia-caption-gen)
[](LICENSE)
Part of the [OxiMedia](https://github.com/cool-japan/oximedia) sovereign media framework.
Version: 0.1.6 — 2026-04-26 — 470 tests
## Features
- Frame-accurate speech-to-caption alignment with word-level and segment-level timestamps
- Automatic segment merging (short segments) and splitting (long segments at sentence/word boundaries)
- Caption block construction with configurable max lines and characters per line
- Greedy and optimal (Knuth-Plass DP) line-breaking algorithms minimizing raggedness
- Reading speed (CPS) computation, validation, and duration adjustment
- WCAG 2.1 compliance checking: caption coverage (1.2.2), live latency (1.2.4), reading speed, minimum duration, gap detection
- Speaker diarization: turn merging, per-speaker statistics, dominant speaker detection, crosstalk detection
- Speaker-to-caption block assignment based on temporal overlap
- Voice activity ratio computation with interval union
- Pure Rust with zero C/Fortran dependencies
## Quick Start
```toml
[dependencies]
oximedia-caption-gen = "0.1"
```
### Speech Alignment and Caption Blocks
```rust
use oximedia_caption_gen::alignment::{
TranscriptSegment, WordTimestamp, align_to_frames, build_caption_blocks,
};
let mut segment = TranscriptSegment {
text: "Hello world".to_string(),
start_ms: 0,
end_ms: 2000,
speaker_id: None,
words: vec![
WordTimestamp { word: "Hello".to_string(), start_ms: 0, end_ms: 1000, confidence: 0.95 },
WordTimestamp { word: "world".to_string(), start_ms: 1000, end_ms: 2000, confidence: 0.92 },
],
};
// Map words to frame numbers at 25 fps
let frames = align_to_frames(&segment, 25.0).unwrap();
assert_eq!(frames[0].0, 0); // "Hello" at frame 0
assert_eq!(frames[1].0, 25); // "world" at frame 25
// Build caption blocks with 2 lines, 42 chars per line
let blocks = build_caption_blocks(&[segment], 2, 42);
assert_eq!(blocks[0].id, 1);
```
### Line Breaking
```rust
use oximedia_caption_gen::line_breaking::{greedy_break, optimal_break, LineBalance};
let text = "This is a sample caption text for demonstration";
// Greedy: break at last space before max width
let greedy = greedy_break(text, 20);
// Optimal (Knuth-Plass DP): minimize squared slack for balanced lines
let optimal = optimal_break(text, 20);
// Optimal produces better-balanced lines
let opt_balance = LineBalance::balance_factor(&optimal);
let greed_balance = LineBalance::balance_factor(&greedy);
assert!(opt_balance <= greed_balance + 0.01);
```
### WCAG 2.1 Compliance
```rust
use oximedia_caption_gen::wcag::{run_all_checks, compliance_score, WcagLevel};
use oximedia_caption_gen::alignment::{CaptionBlock, CaptionPosition};
let blocks = vec![
CaptionBlock {
id: 1, start_ms: 0, end_ms: 2000,
lines: vec!["Hello world".to_string()],
speaker_id: None, position: CaptionPosition::Bottom,
},
CaptionBlock {
id: 2, start_ms: 2000, end_ms: 4000,
lines: vec!["How are you".to_string()],
speaker_id: None, position: CaptionPosition::Bottom,
},
];
let violations = run_all_checks(&blocks, 4000, WcagLevel::AA);
let score = compliance_score(&violations); // 100.0 if no violations
```
### Speaker Diarization
```rust
use oximedia_caption_gen::diarization::{
DiarizationResult, Speaker, SpeakerTurn,
speaker_stats, dominant_speaker, assign_speakers_to_blocks,
CrosstalkDetector, voice_activity_ratio,
};
let mut result = DiarizationResult::new();
result.speakers.insert(1, Speaker {
id: 1, name: Some("Alice".to_string()), gender: None, language: None,
});
result.turns = vec![
SpeakerTurn { speaker_id: 1, start_ms: 0, end_ms: 5000 },
SpeakerTurn { speaker_id: 2, start_ms: 5000, end_ms: 10000 },
];
let stats = speaker_stats(&result);
let dominant = dominant_speaker(&result); // Some(1) or Some(2)
let var = voice_activity_ratio(&result, 12000); // fraction of content with speech
let overlaps = CrosstalkDetector::find_overlapping_turns(&result);
```
## Modules
### `alignment`
Core types `WordTimestamp` (word text, start/end ms, ASR confidence) and `TranscriptSegment` (text, timing, optional speaker, word list). `align_to_frames` maps segments to `(frame_number, subtitle_line)` pairs at a given FPS, supporting both word-level and segment-level alignment. `merge_short_segments` absorbs segments shorter than a threshold into adjacent segments. `split_long_segments` breaks oversized segments at sentence then word boundaries with proportional timestamp redistribution. `build_caption_blocks` wraps segments into `CaptionBlock` values with greedy line wrapping and configurable max lines. `CaptionPosition` supports Bottom, Top, and Custom(x%, y%) placement.
### `line_breaking`
`greedy_break` wraps text at the last space before `max_width`. `optimal_break` uses a Knuth-Plass-inspired dynamic programming algorithm minimizing `sum((max_width - line_width)^2)` for more balanced output. `LineBreakConfig` holds broadcast-standard defaults (42 chars/line, 17 CPS, 2 lines, 80ms gap). `compute_cps` and `reading_speed_ok` validate reading speed in characters per second. `adjust_duration_for_reading` computes the minimum display time for a given CPS limit. `LineBalance::balance_factor` scores line balance from 0.0 (perfect) to 1.0 (maximally unbalanced). `rebalance_lines` attempts to improve balance by re-running the optimal algorithm.
### `wcag`
WCAG 2.1 accessibility compliance checks organized by success criteria:
- **`check_caption_coverage`** (1.2.2, Level A) -- detects gaps exceeding 2 seconds between caption blocks
- **`check_live_latency`** (1.2.4, Level AA) -- validates live caption latency is under 3 seconds
- **`check_sign_language`** (1.2.6, Level AAA) -- placeholder (not machine-checkable)
- **`check_cps`** -- validates reading speed against BBC/Netflix 17 CPS guideline
- **`check_min_duration`** -- enforces minimum 1-second display time per block
- **`check_gap_duration`** -- finds all inter-block gaps exceeding a threshold
`run_all_checks` executes all checks appropriate for a target `WcagLevel`. `compliance_score` computes a 0-100 score with configurable penalties per severity level. `WcagViolation` provides rule ID, message, severity level, and optional timestamp.
### `diarization`
`Speaker` metadata (id, name, gender, language) and `SpeakerTurn` (speaker id, start/end ms) with overlap detection. `DiarizationResult` aggregates speakers and turns. `merge_consecutive_turns` joins same-speaker turns separated by less than 500ms. `speaker_stats` computes per-speaker total time, turn count, and average turn duration. `dominant_speaker` identifies the speaker with the most airtime. `assign_speakers_to_blocks` maps speakers to caption blocks by maximum temporal overlap. `CrosstalkDetector::find_overlapping_turns` detects simultaneous speech. `voice_activity_ratio` computes the fraction of content with active speech using interval union. `format_speaker_label` generates display names.
## Architecture
The caption generation pipeline flows from raw transcript segments through alignment, merging/splitting, line breaking, and finally caption block construction. WCAG checks can be run as a post-processing validation step on the generated blocks. Diarization is an orthogonal pipeline that can be integrated at the block level via `assign_speakers_to_blocks`. All functions are stateless and operate on slices, making them composable in streaming or batch workflows. The only external dependency is `thiserror`.
## License
Licensed under the terms specified in the workspace root.
Copyright (c) COOLJAPAN OU (Team Kitasan)