turboscraper 0.1.0

A high-performance, concurrent web scraping framework for Rust with built-in support for retries, storage backends, and concurrent request handling
Documentation
# TurboScraper

A high-performance, concurrent web scraping framework for Rust, powered by Tokio. TurboScraper provides a robust foundation for building scalable web scrapers with built-in support for retries, storage backends, and concurrent request handling.

## Features

- 🚀 **High Performance**: Built on Tokio for async I/O and concurrent request handling
- 🔄 **Smart Retries**: Configurable retry mechanisms for both HTTP requests and parsing failures
- 💾 **Multiple Storage Backends**: Support for MongoDB and filesystem storage
- 🎯 **Type-safe**: Leverages Rust's type system for reliable data extraction
- 🔧 **Configurable**: Extensive configuration options for crawling behavior
- 🛡️ **Error Handling**: Comprehensive error handling and reporting
- 📊 **Statistics**: Built-in request statistics and performance monitoring

## Quick Start

Add TurboScraper to your `Cargo.toml`:
```toml
[dependencies]
turboscraper = { version = "0.1.0" }
```

### Basic Spider Example

Here's a simple spider that scrapes book information:
```rust
use turboscraper::prelude::*;

pub struct BookSpider {
    config: SpiderConfig,
    storage: Box<dyn StorageBackend>,
    storage_config: Box<dyn StorageConfig>,
}

#[async_trait]
impl Spider for BookSpider {
    fn name(&self) -> String {
        "book_spider".to_string()
    }

    fn start_urls(&self) -> Vec<Url> {
        vec![Url::parse("https://books.toscrape.com/").unwrap()]
    }

    async fn parse(
        &self,
        response: SpiderResponse,
        url: Url,
        depth: usize,
    ) -> ScraperResult<ParseResult> {
        match response.callback {
            SpiderCallback::Bootstrap => {
                // Parse book list and return new requests
                let new_requests = parse_book_list(&response.body)?;
                Ok(ParseResult::Continue(new_requests))
            }
            SpiderCallback::ParseItem => {
                // Parse and store book details
                self.parse_book_details(response).await?;
                Ok(ParseResult::Skip)
            }
            _ => Ok(ParseResult::Skip),
        }
    }
}
```

### Running the Spider

```rust
use turboscraper::storage::factory::{create_storage, StorageType};

#[tokio::main]
async fn main() -> ScraperResult<()> {
    // Initialize storage
    let storage = create_storage(StorageType::Disk {
        path: "data/books".to_string(),
    }).await?;

    // Create and configure spider
    let spider = BookSpider::new(storage).await?;
    let config = SpiderConfig::default()
        .with_depth(2)
        .with_concurrency(10);
    let spider = spider.with_config(config);

    // Create crawler and run spider
    let scraper = Box::new(HttpScraper::new());
    let crawler = Crawler::new(scraper);
    crawler.run(spider).await?;
    Ok(())
}
```

## Advanced Features

### Retry Configuration

TurboScraper supports sophisticated retry mechanisms:
```rust
let mut retry_config = RetryConfig::default();
retry_config.categories.insert(
    RetryCategory::HttpError,
    CategoryConfig {
        max_retries: 3,
        initial_delay: Duration::from_secs(1),
        max_delay: Duration::from_secs(60),
        conditions: vec![
            RetryCondition::Request(RequestRetryCondition::StatusCode(429)),
        ],
        backoff_policy: BackoffPolicy::Exponential { factor: 2.0 },
    },
);
```

### Storage Backends

TurboScraper supports multiple storage backends:

- **MongoDB**: For scalable document storage
- **Filesystem**: For local file storage
- **Custom**: Implement the `StorageBackend` trait for custom storage solutions

### Error Handling

Comprehensive error handling with custom error types:
```rust
match result {
    Ok(ParseResult::Continue(requests)) => // Handle new requests
    Ok(ParseResult::RetryWithSameContent(response)) => // Retry parsing
    Err(ScraperError::StorageError(e)) => // Handle storage errors
    Err(ScraperError::HttpError(e)) => // Handle HTTP errors
}
```

## Best Practices

1. **Respect Robots.txt**: Always check and respect website crawling policies
2. **Rate Limiting**: Use appropriate delays between requests
3. **Error Handling**: Implement proper error handling and retries
4. **Data Validation**: Validate scraped data before storage
5. **Resource Management**: Monitor memory and connection usage

## Contributing

Contributions are welcome! Please feel free to submit pull requests.

## License

This project is licensed under the MIT License - see the [LICENSE file](LICENSE) for details.