omnivore-cli
🕷️ Omnivore - The Universal Web Scraper & Code Extractor
A powerful command-line tool for web scraping, code analysis, and knowledge extraction. Omnivore can crawl websites, extract code from Git repositories, and build comprehensive knowledge graphs from the data it collects.
📚 Full Documentation: ov.pranavkarra.me/docs
🌐 Website: ov.pranavkarra.me
Features
- 🌐 Advanced Web Crawling: Multi-threaded crawling with configurable depth, politeness delays, and smart content extraction
- 📊 Knowledge Graph Generation: Automatically build knowledge graphs from crawled content
- 🔍 Code Repository Analysis: Extract and analyze code from any Git repository or local directory
- 🤖 AI-Powered Extraction: Intelligent content extraction using OpenAI GPT models
- 🎯 Smart Filtering: Automatically detect project types and apply intelligent file filtering
- 📁 Multiple Output Formats: JSON, Markdown, HTML, CSV, and plain text
- 🔧 Highly Configurable: Extensive configuration options via CLI flags or config files
Installation
From crates.io
From source
Quick Start
Web Crawling
# Basic crawl with 5 workers and depth of 3
# Crawl with AI extraction
# JavaScript-rendered content with browser mode
Code Repository Analysis
# Analyze a GitHub repository
# Analyze local directory (even non-git directories)
# Filter by file types
# Exclude patterns
Configuration
# Interactive setup wizard
# Set OpenAI API key for AI features
# View current configuration
Key Commands
omnivore crawl
Crawl websites with advanced options:
- Multi-threaded crawling with configurable workers
- Respect robots.txt and politeness delays
- Extract structured data, metadata, and content
- Support for JavaScript-rendered pages (browser mode)
- AI-powered intelligent extraction
omnivore git
Extract and analyze code repositories:
- Smart project type detection (Next.js, Python, Rust, etc.)
- Intelligent file filtering (skip node_modules, build artifacts, etc.)
- Organized output with project structure and metadata
- Support for both Git repositories and regular directories
- Multiple output formats (text, JSON)
omnivore config
Manage configuration:
- Set API keys for AI features
- Configure default settings
- Export/import configurations
Advanced Usage
AI-Powered Extraction
# Use GPT-4 for intelligent content extraction
Browser Mode for JavaScript Sites
# Crawl JavaScript-heavy sites
Building Knowledge Graphs
# Generate a knowledge graph from crawled data
Code Repository Analysis with Smart Defaults
# Omnivore automatically detects project type and applies smart filters
# Override smart defaults
Configuration File
Create a .omnivore.toml file in your project or home directory:
[]
= 10
= 3
= true
= "Omnivore/1.0"
[]
= true
= 10485760 # 10MB
= true
[]
= "gpt-4o-mini"
= 2000
Examples
Crawl documentation site and extract API references
Analyze a TypeScript project
Create a knowledge graph from a website
# Step 1: Crawl the website
# Step 2: Generate knowledge graph
Environment Variables
OMNIVORE_CONFIG_DIR: Custom configuration directoryOPENAI_API_KEY: OpenAI API key for AI featuresOMNIVORE_USER_AGENT: Custom user agent for crawlingOMNIVORE_LOG_LEVEL: Logging level (debug, info, warn, error)
Documentation
For comprehensive documentation, examples, and guides, visit:
Contributing
Contributions are welcome! Please check out our contributing guidelines.
License
This project is dual-licensed under either:
- Apache License, Version 2.0 (LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)
- MIT license (LICENSE-MIT or http://opensource.org/licenses/MIT)
at your option.