TurboScraper
A high-performance, concurrent web scraping framework for Rust, powered by Tokio. TurboScraper provides a robust foundation for building scalable web scrapers with built-in support for retries, storage backends, and concurrent request handling.
Features
- 🚀 High Performance: Built on Tokio for async I/O and concurrent request handling
- 🔄 Smart Retries: Configurable retry mechanisms for both HTTP requests and parsing failures
- 💾 Multiple Storage Backends: Support for MongoDB and filesystem storage
- 🎯 Type-safe: Leverages Rust's type system for reliable data extraction
- 🔧 Configurable: Extensive configuration options for crawling behavior
- 🛡️ Error Handling: Comprehensive error handling and reporting
- 📊 Statistics: Built-in request statistics and performance monitoring
Quick Start
Add TurboScraper to your Cargo.toml:
[]
= { = "0.1.0" }
Basic Spider Example
Here's a simple spider that scrapes book information:
use *;
Running the Spider
use ;
async
Advanced Features
Retry Configuration
TurboScraper supports sophisticated retry mechanisms:
let mut retry_config = default;
retry_config.categories.insert;
Storage Backends
TurboScraper supports multiple storage backends:
- MongoDB: For scalable document storage
- Filesystem: For local file storage
- Custom: Implement the
StorageBackendtrait for custom storage solutions
Error Handling
Comprehensive error handling with custom error types:
match result
Best Practices
- Respect Robots.txt: Always check and respect website crawling policies
- Rate Limiting: Use appropriate delays between requests
- Error Handling: Implement proper error handling and retries
- Data Validation: Validate scraped data before storage
- Resource Management: Monitor memory and connection usage
Contributing
Contributions are welcome! Please feel free to submit pull requests.
License
This project is licensed under the MIT License - see the LICENSE file for details.