gemini-tts-cli 0.1.1

Agent-friendly Gemini text-to-speech CLI for expressive scripts, voices, tags, and audio files
gemini-tts-cli-0.1.1 is not a library.

gemini-tts-cli

Agent-friendly Gemini text-to-speech CLI for expressive scripts, voices, tags, languages, and audio files.

It is built around practical use by AI agents: the binary explains itself with agent-info, emits JSON envelopes when piped, keeps audio out of stdout, diagnoses its own setup, and gives agents voice/tag/script guidance instead of exposing only a raw API call.

Install

cargo install gemini-tts-cli

For compressed output formats, install ffmpeg:

brew install ffmpeg

Quick Start

gemini-tts-cli auth import-env
gemini-tts-cli doctor --live

gemini-tts-cli speak "Say warmly: Gemini TTS is ready." --voice Kore -o ready.wav
gemini-tts-cli speak "[whispers] This part is quiet." --voice Achernar -o whisper.mp3 --format mp3

WAV and raw PCM are written directly. MP3, M4A, and FLAC use ffmpeg.

Gemini TTS does not expose separate per-language voice IDs. Google documents 30 prebuilt voice names as voice timbres, and the model auto-detects the transcript language. For Italian, use an Italian transcript plus --language Italian or --language it, and add accent direction such as --accent "heavy Italian accent" when the accent matters.

Agent Workflows

Discover the command contract:

gemini-tts-cli agent-info

Choose a voice:

gemini-tts-cli voices list
gemini-tts-cli voices recommend "warm expert narrator for medical guidance"

Find tags and prompt recipes:

gemini-tts-cli tags list
gemini-tts-cli tags search whisper
gemini-tts-cli tags recipes

Build a structured prompt before generation:

gemini-tts-cli script "Welcome back. The audio pipeline is ready." \
  --style "warm expert narrator with a slight smile" \
  --accent "British English from London" \
  --tag "[warmly]" \
  --out prompt.txt

Generate from a script:

gemini-tts-cli speak prompt.txt --text-file --voice Sulafat -o narration.wav

Multi-speaker dialogue:

gemini-tts-cli speak dialogue.txt --text-file \
  --speaker Host=Kore \
  --speaker Guest=Puck \
  -o dialogue.mp3

For multi-speaker output, transcript lines should use the exact speaker names:

Host: Welcome back.
Guest: [excitedly] This is the good part.

Prompt Quality

Gemini 3.1 Flash TTS responds well to a clear structure:

Synthesize speech for the performance defined below. The audio profile, scene,
director notes, cast, and context are direction only. Do not speak them. Speak
only the lines under #### TRANSCRIPT.

# AUDIO PROFILE: Clear narrator

## THE SCENE
A clean studio recording for direct listener comprehension.

### DIRECTOR'S NOTES
Style: warm, precise, expressive without overacting.
Pacing: medium pace with deliberate pauses.
Accent: British English from London.
Language: English.

#### TRANSCRIPT
[warmly] Welcome back. [short pause] The audio pipeline is ready.

Use director notes for global tone. Use square-bracket tags for local changes:

[warmly] [whispers] [shouting] [short pause] [very slow] [sighs] [laughs]

Run lint before important jobs:

gemini-tts-cli lint prompt.txt --text-file

The linter checks for long takes, tag inflation, app-specific [[tts]] wrappers, and multi-speaker name mismatches. This is based on current Gemini TTS docs and public issue patterns: preserve Gemini tags, keep tags in English, avoid over-specifying every sentence, and split long takes when quality matters.

doctor --live checks both Google endpoints used by the CLI: it reads the model metadata endpoint and makes a tiny generateContent request, then verifies that Gemini returned non-empty PCM audio.

Configuration

Config lives at:

~/.config/gemini-tts-cli/config.toml

Commands:

gemini-tts-cli config init
gemini-tts-cli config show
gemini-tts-cli config set defaults.voice Sulafat
gemini-tts-cli config set defaults.audio_format mp3
gemini-tts-cli config get keys.api_key
gemini-tts-cli update --check

API key sources:

export GEMINI_API_KEY=...
gemini-tts-cli auth import-env
gemini-tts-cli auth status

Secrets are masked in command output. The config file is written with 0600 permissions on Unix.

JSON Contract

In a terminal, commands render human-readable output. When piped or with --json, commands emit a JSON envelope:

{
  "version": "1",
  "status": "success",
  "data": {
    "audio": {
      "path": "ready.wav",
      "format": "wav",
      "sample_rate": 24000,
      "channels": 1
    }
  }
}

Audio is always written to --out. Stdout stays metadata-only so agents can pipe it safely.

Exit Codes

Code Meaning
0 Success
1 Transient, IO, network, or audio encoder error
2 Config or credential error
3 Bad input
4 Rate limited

Development

cargo test
cargo run -- agent-info
cargo run -- doctor --live

License

MIT