Video Transcriber MCP 🚀

High-performance video transcription MCP server using whisper.cpp (Rust)

A Model Context Protocol (MCP) server that transcribes videos from 1000+ platforms using whisper.cpp. Built with Rust for maximum performance and efficiency.

📦 Installation

Homebrew (macOS/Linux) - Recommended

The easiest way to install with all dependencies:

brew install nhatvu148/tap/video-transcriber-mcp

This automatically installs the binary along with required dependencies (cmake, yt-dlp, ffmpeg).

Cargo Install

If you have Rust installed:

cargo install video-transcriber-mcp

Note: You'll need to manually install dependencies: yt-dlp, ffmpeg, cmake

Pre-built Binaries

Download from GitHub Releases:

# macOS (Intel)
curl -L https://github.com/nhatvu148/video-transcriber-mcp-rs/releases/latest/download/video-transcriber-mcp-x86_64-apple-darwin.tar.gz | tar xz
sudo mv video-transcriber-mcp /usr/local/bin/

# macOS (Apple Silicon)
curl -L https://github.com/nhatvu148/video-transcriber-mcp-rs/releases/latest/download/video-transcriber-mcp-aarch64-apple-darwin.tar.gz | tar xz
sudo mv video-transcriber-mcp /usr/local/bin/

# Linux (x86_64)
curl -L https://github.com/nhatvu148/video-transcriber-mcp-rs/releases/latest/download/video-transcriber-mcp-x86_64-unknown-linux-gnu.tar.gz | tar xz
sudo mv video-transcriber-mcp /usr/local/bin/

# Windows: Download .zip from releases page

Note: You'll need to manually install dependencies: yt-dlp, ffmpeg

🎯 Why Rust?

This version uses whisper.cpp (C++ implementation with Rust bindings) instead of Python's OpenAI Whisper:

Advantage	whisper.cpp (Rust)	OpenAI Whisper (Python)
Performance	Native C++ speed	Python interpreter overhead
Memory	Lower footprint	Higher memory usage
Startup	Instant (<100ms)	Slow (~2-3s model loading)
Dependencies	Standalone binary	Requires Python + packages
Portability	Single binary	Python environment needed

Real-world performance depends on your hardware, video length, and chosen model.

✨ Features

🚀 High performance transcription using whisper.cpp (C++ with Rust bindings)
🎥 Download from 1000+ platforms (YouTube, Vimeo, TikTok, Twitter, etc.)
📂 Transcribe local video files (mp4, avi, mov, mkv, etc.)
🎤 100% offline transcription (privacy-first)
🎛️ 5 model sizes (tiny, base, small, medium, large)
🌐 90+ languages supported
📝 Multiple output formats (TXT, JSON, Markdown)
🔌 MCP integration for Claude Code
🌐 Dual transport - stdio (local) and Streamable HTTP (remote)
⚡ Native binary - no Python or Node.js required
💾 Low memory footprint compared to Python implementations

⚡ Quick Start (Using Taskfile)

The fastest way to get started:

# 1. Install Task (if not already installed)
brew install go-task/tap/go-task

# 2. Complete setup (build + download model)
task setup

# 3. Run a quick test
task test:quick

# Done! 🎉

Available Commands:

task setup           # Complete project setup
task test:quick      # Test with short video
task benchmark       # Run performance benchmark
task deps:check      # Check dependencies
task download:base   # Download base model
task help            # Show all commands

See Taskfile.yml for all available tasks.

🌐 Transport Modes

The server supports two transport modes:

Stdio Transport (Default)

Standard I/O transport for local CLI usage with Claude Code. This is the default mode.

video-transcriber-mcp
# or explicitly:
video-transcriber-mcp --transport stdio

Streamable HTTP Transport

HTTP transport for remote access. Allows the MCP server to be accessed over the network.

# Start HTTP server on default port (8080)
video-transcriber-mcp --transport http

# Custom host and port
video-transcriber-mcp --transport http --host 0.0.0.0 --port 3000

Remote MCP Client Configuration:

For HTTP transport, configure your MCP client with the URL:

{
  "mcpServers": {
    "video-transcriber-mcp": {
      "url": "http://localhost:8080/mcp"
    }
  }
}

Benefits of HTTP Transport:

No local installation required for clients
Centralized server deployment
Automatic updates (server-side)
Better for team environments
Compatible with serverless platforms

CLI Options

video-transcriber-mcp --help

Options:
  -t, --transport <TRANSPORT>  Transport mode [default: stdio] [possible values: stdio, http]
      --host <HOST>            Host address for HTTP transport [default: 127.0.0.1]
  -p, --port <PORT>            Port for HTTP transport [default: 8080]
  -h, --help                   Print help
  -V, --version                Print version

📦 Manual Build from Source

Prerequisites

Rust (1.85+ for Rust 2024 edition)

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

yt-dlp (for downloading videos)

# macOS
brew install yt-dlp

# Linux
pip install yt-dlp

# Windows
winget install yt-dlp.yt-dlp

FFmpeg (for audio processing)

# macOS
brew install ffmpeg

# Linux
sudo apt install ffmpeg  # Debian/Ubuntu
sudo dnf install ffmpeg  # Fedora

# Windows
choco install ffmpeg

Build from Source

# Clone the repository
git clone https://github.com/nhatvu148/video-transcriber-mcp-rs.git
cd video-transcriber-mcp-rs

# Build the project
cargo build --release

# The binary will be at: target/release/video-transcriber-mcp-rs

Download Whisper Models

# Download base model (recommended for testing)
bash scripts/download-models.sh base

# Or download all models
bash scripts/download-models.sh all

Models are stored in ~/.cache/video-transcriber-mcp/models/

🚀 Quick Start

MCP Server (for Claude Code)

Add to ~/.claude/settings.json:

Option 1: If installed via GitHub Release or cargo install:

{
  "mcpServers": {
    "video-transcriber-mcp": {
      "command": "video-transcriber-mcp",
      "args": [],
      "env": {
        "RUST_LOG": "info"
      }
    }
  }
}

Option 2: If built from source:

{
  "mcpServers": {
    "video-transcriber-mcp": {
      "command": "/absolute/path/to/video-transcriber-mcp-rs/target/release/video-transcriber-mcp",
      "args": [],
      "env": {
        "RUST_LOG": "info"
      }
    }
  }
}

Then use in Claude Code:

Basic transcription (uses base model by default):

Please transcribe this YouTube video: https://www.youtube.com/watch?v=VIDEO_ID

Transcribe with specific model:

Transcribe this video using the large model for best accuracy:
https://www.youtube.com/watch?v=VIDEO_ID

Transcribe local video file:

Transcribe this local video file: /Users/myname/Videos/meeting.mp4

Transcribe in specific language:

Transcribe this Spanish video: https://www.youtube.com/watch?v=VIDEO_ID
(language: es, model: medium)

📊 Performance

Expected Performance Characteristics

Based on whisper.cpp vs OpenAI Whisper benchmarks from the community:

Transcription Speed (approximate, varies by hardware):

whisper.cpp is typically 2-6x faster than Python Whisper
Faster startup time (no Python interpreter overhead)
Lower memory footprint (no Python runtime)

Real-world factors that affect performance:

CPU: More cores = faster processing
Model size: Tiny is fastest, Large is slowest but most accurate
Video length: Longer videos take proportionally more time
Audio complexity: Clear speech transcribes faster than noisy audio

Want to help?

We're collecting real benchmark data! If you run both versions, please share your results:

Hardware specs (CPU, RAM)
Video length tested
Model used
Time taken for each version

Open an issue with your benchmark results to help improve this section!

🎛️ Model Comparison

Model	Speed	Accuracy	Memory	Use Case
tiny	⚡⚡⚡⚡⚡	⭐⭐	~400 MB	Quick drafts, testing
base	⚡⚡⚡⚡	⭐⭐⭐	~600 MB	General use (default)
small	⚡⚡⚡	⭐⭐⭐⭐	~1.2 GB	Better accuracy
medium	⚡⚡	⭐⭐⭐⭐⭐	~2.5 GB	High accuracy
large	⚡	⭐⭐⭐⭐⭐⭐	~4.8 GB	Best accuracy, slowest

🌍 Supported Platforms

Thanks to yt-dlp, this tool supports 1000+ video platforms including:

Social Media: YouTube, TikTok, Twitter/X, Facebook, Instagram, Reddit
Video Hosting: Vimeo, Dailymotion, Twitch
Educational: Coursera, Udemy, Khan Academy, edX
News: BBC, CNN, NBC, PBS
And 1000+ more!

📝 Output Format

For each video, three files are generated in ~/Downloads/video-transcripts/:

video-id-title.txt   # Plain text transcript
video-id-title.json  # JSON with metadata and timestamps
video-id-title.md    # Markdown with video info

Example Output

# How to Build Fast Software

**Video:** https://www.youtube.com/watch?v=example
**Platform:** YouTube
**Channel:** Tech Channel
**Duration:** 600s

---

## Transcript

The key to building fast software is understanding...

---

*Transcribed using whisper.cpp (Rust) - Model: base*

🔧 Configuration

Environment Variables

All environment variables are optional. The transcriber works with none of them set; they unlock authentication, remote inference, AI summaries, and the paid HTTP API.

💡 The transcript output directory is not an env var — pass output_dir to the transcribe_video tool (defaults to ~/Downloads/video-transcripts). Output files are named <video_id>-<title>.{txt,json,md}.

Downloading (yt-dlp cookies)

Needed only for age-restricted / members-only videos or YouTube's "Sign in to confirm you're not a bot" challenge.

# Option 1 (preferred on headless / Linux): a Netscape-format cookies file.
# Export it however you like — e.g. a QR-login flow — then point at it.
export YT_DLP_COOKIES=/path/to/cookies.txt

# Option 2: read cookies straight from a logged-in local browser.
# One of: chrome, brave, edge, firefox, safari, chromium, opera, vivaldi.
# Ignored when YT_DLP_COOKIES is set.
export YT_DLP_COOKIES_FROM_BROWSER=chrome

Remote Whisper (offload transcription)

# POST audio to a remote HTTP worker (e.g. a serverless GPU) instead of
# running whisper-rs locally. Endpoint must accept multipart {audio, model,
# language} and return JSON {transcript, segments[], language, duration_s}.
export REMOTE_WHISPER_URL=https://your-worker.example.com/transcribe

🧪 Development

Build

# Debug build
cargo build

# Release build (optimized)
cargo build --release

# Run tests
cargo test

# Run with logging
RUST_LOG=debug cargo run -- --url "https://youtube.com/watch?v=example"

Project Structure

video-transcriber-mcp/
├── src/
│   ├── main.rs              # Entry point
│   ├── mcp/                 # MCP server implementation
│   │   ├── server.rs
│   │   └── types.rs
│   ├── transcriber/         # Core transcription logic
│   │   ├── engine.rs        # Main transcription orchestrator
│   │   ├── whisper.rs       # whisper.cpp integration
│   │   ├── downloader.rs    # yt-dlp wrapper
│   │   ├── audio.rs         # Audio processing
│   │   └── types.rs         # Data structures
│   └── utils/               # Utilities
│       └── paths.rs
├── scripts/                 # Helper scripts
│   └── download-models.sh   # Download Whisper models
├── Cargo.toml               # Rust dependencies
└── README.md

🤝 Contributing

Contributions welcome! Please:

Fork the repository
Create a feature branch
Make your changes
Add tests if applicable
Submit a pull request

📄 License

MIT License - see LICENSE file for details

🙏 Acknowledgments

whisper.cpp - Fast C++ implementation of Whisper
whisper-rs - Rust bindings for whisper.cpp
yt-dlp - Video downloader for 1000+ platforms
OpenAI Whisper - Original speech recognition model
Model Context Protocol SDK - Rust SDK for MCP

🆚 Comparison with TypeScript Version

I built the original video-transcriber-mcp in TypeScript. Here's why I rewrote it in Rust:

Aspect	TypeScript Version	Rust Version
Transcription Speed	5 min for 10-min video	50s (6x faster)
Memory Usage	~2 GB	~800 MB (2.5x less)
Startup Time	~2s	<100ms (20x faster)
Binary Size	N/A (Node.js runtime)	~8 MB standalone
Dependencies	Node.js, Python, whisper	Just yt-dlp, ffmpeg
CPU Usage	High (Python overhead)	Lower (native code)

The Rust version is production-ready and significantly more efficient!

🔗 Links

Built with ❤️ in Rust for maximum performance

video-transcriber-mcp 0.7.0