gemini-tts-cli
Agent-friendly Gemini text-to-speech CLI for expressive scripts, voices, tags, languages, and audio files.
It is built around practical use by AI agents: the binary explains itself with agent-info, emits JSON envelopes when piped, keeps audio out of stdout, diagnoses its own setup, and gives agents voice/tag/script guidance instead of exposing only a raw API call.
Install
For compressed output formats, install ffmpeg:
Quick Start
WAV and raw PCM are written directly. MP3, M4A, and FLAC use ffmpeg.
Gemini TTS does not expose separate per-language voice IDs. Google documents 30
prebuilt voice names as voice timbres, and the model auto-detects the transcript
language. For Italian, use an Italian transcript plus --language Italian or
--language it, and add accent direction such as --accent "heavy Italian accent" when the accent matters.
Agent Workflows
Discover the command contract:
Choose a voice:
Find tags and prompt recipes:
Build a structured prompt before generation:
Generate from a script:
Multi-speaker dialogue:
For multi-speaker output, transcript lines should use the exact speaker names:
Host: Welcome back.
Guest: [excitedly] This is the good part.
Prompt Quality
Gemini 3.1 Flash TTS responds well to a clear structure:
Synthesize speech for the performance defined below. The audio profile, scene,
director notes, cast, and context are direction only. Do not speak them. Speak
only the lines under #### TRANSCRIPT.
# AUDIO PROFILE: Clear narrator
## THE SCENE
A clean studio recording for direct listener comprehension.
### DIRECTOR'S NOTES
Style: warm, precise, expressive without overacting.
Pacing: medium pace with deliberate pauses.
Accent: British English from London.
Language: English.
#### TRANSCRIPT
[warmly] Welcome back. [short pause] The audio pipeline is ready.
Use director notes for global tone. Use square-bracket tags for local changes:
[warmly] [whispers] [shouting] [short pause] [very slow] [sighs] [laughs]
Run lint before important jobs:
The linter checks for long takes, tag inflation, app-specific [[tts]] wrappers, and multi-speaker name mismatches. This is based on current Gemini TTS docs and public issue patterns: preserve Gemini tags, keep tags in English, avoid over-specifying every sentence, and split long takes when quality matters.
doctor --live checks both Google endpoints used by the CLI: it reads the model
metadata endpoint and makes a tiny generateContent request, then verifies that
Gemini returned non-empty PCM audio.
Configuration
Config lives at:
~/.config/gemini-tts-cli/config.toml
Commands:
API key sources:
Secrets are masked in command output. The config file is written with 0600 permissions on Unix.
JSON Contract
In a terminal, commands render human-readable output. When piped or with --json, commands emit a JSON envelope:
Audio is always written to --out. Stdout stays metadata-only so agents can pipe it safely.
Exit Codes
| Code | Meaning |
|---|---|
0 |
Success |
1 |
Transient, IO, network, or audio encoder error |
2 |
Config or credential error |
3 |
Bad input |
4 |
Rate limited |
Development
License
MIT