shinkai-translator 0.1.3

CLI tool for translating video subtitles with LLMs through OpenAI-compatible APIs, with native PGS OCR
docs.rs failed to build shinkai-translator-0.1.3
Please check the build logs for more information.
See Builds for ideas on how to fix a failed build, or Metadata for how to configure docs.rs builds.
If you believe this is docs.rs' fault, open an issue.

shinkai-translator

CLI-first tool for translating video subtitles with LLMs through OpenAI-compatible APIs, including hosted providers, self-hosted gateways, and local runners that expose /v1/chat/completions.

The primary workflow is: pass a video file, let the CLI probe the available subtitle streams, extract the best subtitle track, translate it, and mux the translated subtitle back into the video with translated track metadata. Direct subtitle-file input is still supported as a secondary path.

When a video only contains PGS subtitle tracks, the CLI can extract the .sup stream, run OCR to produce a text subtitle, translate it, and mux the translated result back into the video. --subtitle-input remains available when you want to override the extracted subtitle with an external text subtitle.

The Rust library remains secondary and exists to embed the subtitle translation engine and media helpers in Rust applications, especially media-server style workflows.

Current scope

  • CLI tool for video subtitle translation.
  • Automatic subtitle stream selection from video containers.
  • Subtitle extraction and remux through ffprobe and ffmpeg.
  • Native PaddleOCR (PP-OCRv5) for PGS subtitle streams extracted from video containers.
  • OCR debug command for inspecting raw recognition output at specific timestamps.
  • Subtitle formats: SRT, VTT, ASS/SSA.
  • External text subtitle override with --subtitle-input.
  • OpenAI-compatible provider support.
  • Config-driven local and remote endpoint support.
  • Config-driven batch translation concurrency.
  • ASS-aware classification for dialogue, karaoke, and song-like cues.
  • Optional classification report output for ASS files.
  • Default config file plus a ratatui editor for provider, OCR, and translation defaults.
  • Direct subtitle-file input as a secondary workflow.
  • Reusable Rust crate for embedding the same engine when needed.

Out of scope right now

  • Speech-to-text or audio transcription.
  • OCR from arbitrary video frames outside subtitle streams.
  • Translating unsupported image-based subtitle streams such as VobSub.
  • Native non-OpenAI provider adapters.

If the input video only has PGS subtitle streams, the CLI extracts the selected .sup stream, runs OCR, and continues from the generated text subtitle.

If you already have a better OCR result, or the image-based subtitle format is not supported, pass a text subtitle with --subtitle-input and the CLI will still mux the translated result back into the video.

Why this project exists

Most subtitle translation tools treat ASS as plain text and damage karaoke timing, lyric styling, or event structure. This project is stricter:

  • ASS parsing is lossless-oriented and preserves original Dialogue: structure.
  • Classification of ASS cues is separate from parsing.
  • Non-dialogue material can be preserved or marked for review instead of blindly translated.
  • Model output uses a compact numbered-line contract instead of verbose JSON prompts.

Install and build

Runtime requirements:

  • ffprobe
  • ffmpeg

Build the project:

cargo build

Install the CLI locally:

cargo install --path .

Run tests:

cargo test

Run the translator directly without installing after the config file is set up:

cargo run -- translate episode.mkv

The CLI now loads defaults from:

~/.config/shinkai-translator/config.toml

Open the built-in config editor:

shinkai-translator config

The TUI creates the config file on save if it does not exist yet.

If you do not want to save config yet, one-off overrides such as --target-language, --model, and --base-url still work, but they are intentionally hidden from the default help output.

If ffmpeg or ffprobe are not in PATH, override them with:

export SHINKAI_TRANSLATOR_FFMPEG_BIN=/custom/path/ffmpeg
export SHINKAI_TRANSLATOR_FFPROBE_BIN=/custom/path/ffprobe

CLI overview

The binary name is shinkai-translator.

The default help output is intentionally short. Once ~/.config/shinkai-translator/config.toml is configured, the common path is:

shinkai-translator validate episode.mkv
shinkai-translator translate episode.mkv --output episode.pt-br.mkv

Advanced one-off overrides still exist, but they are hidden from the default help so the CLI stays focused on the config-driven workflow.

Configure defaults once

shinkai-translator config

The TUI edits the default config file at ~/.config/shinkai-translator/config.toml.

Inside the TUI:

  • Enter opens a modal editor or searchable selector for the current field.
  • s or Ctrl+S saves immediately.
  • c checks the current provider by calling GET /models.
  • m fetches the provider model list into a searchable picker.
  • Base URL, OCR language, thinking mode, language defaults, and classification policies use selections instead of manual typing.

Minimal example:

[provider]
base_url = "https://api.openai.com/v1"
model = "gpt-4o-mini"
thinking_mode = "off"

[translation]
target_language = "Portuguese (Brazil)"
max_batch_items = 8
max_batch_characters = 4000
max_parallel_batches = 5

[ocr]
language = "english"

[tools]
ffmpeg_bin = "ffmpeg"
ffprobe_bin = "ffprobe"

After that, the common workflow only needs the input path and, for translation, an optional output path.

Validate the selected subtitle stream from a video

shinkai-translator validate episode.mkv

Example output:

valid Ass subtitle extracted from video stream #3 (ass), lang=eng, title=English Full with 473 cue(s), 320 translatable, 110 preserved, 43 review

Translate a video and mux the translated subtitle back

shinkai-translator translate episode.mkv --output episode.pt-br.mkv

When --output is omitted, the CLI derives the output video name from the configured target language.

If you also want to keep the translated subtitle as a sidecar file, add the hidden one-off override --subtitle-output episode.pt-br.ass.

Translate a PGS-only video with built-in OCR

shinkai-translator translate episode.mkv --output episode.pt-br.mkv

If the best selected subtitle track is PGS, the tool extracts the .sup, runs OCR, and continues automatically.

Set the default OCR language in the config file or TUI. If you need to force a specific stream or keep a sidecar SRT, hidden one-off overrides such as --subtitle-stream-index 4 and --subtitle-output episode.pt-br.srt still work.

Override OCR with an external text subtitle

Hidden one-off override:

shinkai-translator translate episode.mkv \
  --subtitle-input episode.ocr.srt \
  --output episode.pt-br.mkv \
  --subtitle-output episode.pt-br.srt

Use this when the built-in OCR result is not good enough or when the video contains an unsupported image-based subtitle format.

Translate a subtitle file directly

shinkai-translator translate episode.ass --output episode.pt-br.ass

Translate all files in a folder

Pass a directory path to translate every video and subtitle file in it. Output paths are derived automatically for each file:

shinkai-translator translate /path/to/season-1/

Files that fail do not stop the rest of the batch; each error is printed and the command exits with a non-zero status if any file failed.

Dry-run without writing a file

shinkai-translator translate episode.mkv --dry-run

Inspect OCR output at a specific timestamp

Useful when you want to see exactly what the OCR engine sees and produces for a given moment:

shinkai-translator debug-ocr episode.mkv \
  --at 00:03:10 \
  --output ./ocr-debug

The --at flag accepts HH:MM:SS or HH:MM:SS.mmm. The --window flag (default 30) controls how many seconds around the target timestamp are scanned.

To inspect an explicit time range instead:

shinkai-translator debug-ocr episode.mkv \
  --from 00:03:00 --to 00:04:00 \
  --output ./ocr-debug

You can also pass a raw .sup file directly:

shinkai-translator debug-ocr episode.sup --at 00:03:10

For each display set in range the command writes:

  • NNNN_<timestamp>_raw.png — full RGBA frame rendered from the PGS bitmap
  • NNNN_<timestamp>_prepared.png — grayscale RGB image fed to PaddleOCR (alpha-composited, scaled if small)
  • NNNN_<timestamp>_annotated.png — same image with red bounding boxes around each detected text region
  • NNNN_<timestamp>_ocr.txt — per-block bbox coordinates, recognized text, grouping into lines, and final output
  • index.txt — summary table of all processed frames with timestamps, state, and first OCR line

Use a specific subtitle stream when needed

Hidden one-off override:

shinkai-translator translate episode.mkv --subtitle-stream-index 4

--subtitle-stream-index cannot be combined with --subtitle-input.

Write a validation report

Hidden one-off override:

shinkai-translator validate episode.mkv --report report.json

Reports are available only when the selected subtitle stream resolves to ASS/SSA.

Configure provider credentials and endpoint

The recommended path is to set them in the config TUI or in ~/.config/shinkai-translator/config.toml.

The CLI still accepts one-off provider overrides like --base-url, --model, --api-key, and --target-language, but they are hidden from the default help because they are no longer the primary workflow.

You can still keep the API key in the environment:

export SHINKAI_TRANSLATOR_API_KEY="your-token"

Example TOML for a local endpoint:

[provider]
base_url = "http://localhost:11434/v1"
model = "llama3.1"

ASS classification policy

ASS handling is split into two phases:

  1. Parsing and rendering preserve the file structure.
  2. Classification decides whether a cue should be translated, preserved, or reviewed.

Three dispositions are supported:

  • translate
  • preserve
  • review

Defaults

  • Karaoke cues: preserve
  • Explicit song markers from metadata: preserve
  • Inferred song runs: review

Recommended config keys

[classification]
karaoke_policy = "preserve"
explicit_song_policy = "preserve"
inferred_song_policy = "review"

Advanced classification marker overrides are still supported, but they are intentionally no longer part of the primary CLI surface.

Example: emit a classification report during validation

shinkai-translator validate episode.mkv --report report.json

The generated report contains:

  • cue kind
  • cue disposition
  • confidence
  • reason
  • start/end timestamps
  • text preview
  • aggregate summary counts

Provider behavior

Behavior:

  • off: ask the provider for final answers only when known provider-specific controls exist.
  • on: allow provider reasoning mode if supported, but still require final visible output in compact numbered form.
  • auto: do not force either behavior.

Current implementation details:

  • The provider still accepts fallback content from reasoning and reasoning_content when a backend returns final output there instead of message.content.
  • For known NVIDIA NIM + StepFun combinations, thinking-mode is translated into provider-specific request fields.

Set this in the config file or TUI under provider.thinking_mode.

Example: NVIDIA NIM StepFun

[provider]
base_url = "https://integrate.api.nvidia.com/v1"
model = "stepfun-ai/step-3.5-flash"
thinking_mode = "off"

Prompt and output contract

The model is instructed to return one line per cue in the form:

1: translated text
2: second translated text

This is intentional:

  • less token overhead than JSON
  • easier validation
  • stable mapping back to cue order
  • no regex-based parsing required

For multiline cues, internal line breaks are represented as the literal sequence \n and restored after parsing.

Parallel batching

Translation is chunked and sent in batches. The main knobs live in the config file:

[translation]
max_batch_items = 24
max_batch_characters = 16000
max_parallel_batches = 5

The pipeline currently runs multiple API requests concurrently up to max_parallel_batches.

Library embedding (secondary)

The library still exposes subtitle-document translation primitives first. The end-to-end video workflow is currently centered on the CLI, with reusable media helpers available under shinkai_translator::media.

Async API

use std::sync::Arc;

use shinkai_translator::{
    OpenAiCompatibleProvider, ProviderConfig, TranslationOptions, Translator,
};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let provider = OpenAiCompatibleProvider::new(ProviderConfig {
        base_url: "https://api.openai.com/v1".to_owned(),
        model: "gpt-4o-mini".to_owned(),
        api_key: std::env::var("SHINKAI_TRANSLATOR_API_KEY").ok(),
        ..ProviderConfig::default()
    })?;

    let translator = Translator::new(Arc::new(provider));
    let result = translator
        .translate_file_path("episode.ass", &TranslationOptions::default())
        .await?;

    println!("{}", result.rendered());
    Ok(())
}

Blocking API

use std::sync::Arc;

use shinkai_translator::{
    BlockingTranslator, OpenAiCompatibleProvider, ProviderConfig, TranslationOptions, Translator,
};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let provider = OpenAiCompatibleProvider::new(ProviderConfig::default())?;
    let translator = Translator::new(Arc::new(provider));
    let blocking = BlockingTranslator::new(translator)?;

    let result = blocking.translate_file_path("episode.srt", &TranslationOptions::default())?;
    println!("{}", result.rendered());
    Ok(())
}

Configure ASS classification from code

use shinkai_translator::{AssClassificationPolicy, CueDisposition, TranslationOptions};

let options = TranslationOptions {
    ass_classification_policy: AssClassificationPolicy {
        karaoke_policy: CueDisposition::Preserve,
        explicit_song_policy: CueDisposition::Preserve,
        inferred_song_policy: CueDisposition::Translate,
        ..AssClassificationPolicy::default()
    },
    ..TranslationOptions::default()
};

Development notes

  • The ASS parser and classifier avoid regex-based parsing.
  • Classification is policy-driven and separated from parsing.
  • Tests cover parser round-trip, ASS classification, compact prompt parsing, parallel batching, CLI integration, video extraction/mux orchestration, classification-report output, and thinking-mode wiring.

Known limitations

  • Only OpenAI-compatible providers are implemented today.
  • Classification reports are currently emitted only when the selected subtitle stream is ASS/SSA.
  • Inferred song detection is heuristic by design; use policy overrides when your subtitle style is known.
  • Built-in OCR currently targets PGS subtitle streams using PP-OCRv5 MNN models; other image-based subtitle formats still require an external text subtitle.
  • The first OCR run downloads PP-OCRv5 detection and recognition models into ~/.cache/shinkai-translator/paddleocr/; subsequent runs use the local cache.
  • OCR speed and accuracy depend on subtitle styling and CPU resources.
  • The library API is still subtitle-centric even though reusable media helpers are available.

Suggested workflow for media-server integration

  1. Run validate on the video first and inspect the ASS report when available.
  2. Tune classification markers or policies for your subtitle style.
  3. Run translate on the video with the tuned policy.
  4. Keep the standalone translated subtitle with --subtitle-output when you want a sidecar file.
  5. Use the muxed output video as the final artifact.