gitbook2text
A CLI tool and a Rust library for crawling GitBook sites, downloading their pages, and converting them to Markdown and plain text.
✨ What's New v0.3.0
- 🕷️ Automatic Crawling: Automatically discovers all pages of a GitBook
- ✅ GitBook Verification: Detects if a site is indeed a GitBook before crawling
- 🚀 All-in-One Mode: Crawl and download in a single command
- 📋 Improved CLI Interface: Clear subcommands with
clap
🚀 Installation
As a CLI Tool
As a Dependency
Add this to your Cargo.toml:
[]
= "0.3"
📖 Usage
CLI
Full Mode (Recommended)
Crawls and downloads all pages in a single command:
Crawl Only Mode
Generates the links.txt file with all found links:
# With a custom output file
Download Only Mode
Downloads pages from an existing links file:
# With a custom file
Legacy Mode (Backward Compatible)
Without a subcommand, downloads from links.txt:
Structure of Generated Files
Files are saved in:
data/md/- Original markdown filesdata/txt/- Cleaned text files
Library
Crawling a GitBook
use ;
async
Download and Convert
use ;
async
🔧 Features
- ✅ Smart crawling: Automatically discovers all pages of a documentation
- ✅ GitBook verification: Detects GitBook sites via their specific markers
- ✅ Concurrent downloading: Processes multiple pages simultaneously
- ✅ Markdown to text conversion: Clean content extraction
- ✅ Advanced cleaning: Removes special GitBook tags
- ✅ Code block support: Preserves titles and content
- ✅ Normalization: Uniform spaces and characters
🎯 Use cases
- 📚 Archive a complete documentation
- 🔍 Index content for a search engine
- 🤖 Prepare data for model training
- 📊 Analyze the structure of documentation
- 💾 Create documentation backups
📋 Practical Examples
Archiving Complete Documentation
# All in one
# Or step by step
Use with an automated workflow
#!/bin/bash
# backup-docs.sh
GITBOOK_URL="https://docs.example.com"
BACKUP_DIR="backups/"
📚 API Documentation
For the full API documentation, visit docs.rs/gitbook2text.
🤝 Contribute
Contributions are welcome! Feel free to open an issue or a pull request.
📝 Changelog
See CHANGELOG.md for the version history.
📄 License
This project is dual-licensed under MIT or Apache-2.0, your choice.
- MIT License (LICENSE-MIT or http://opensource.org/licenses/MIT)
- Apache License, Version 2.0 (LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)