Stream crawler
stream-scraper is a Rust crate that provides an asynchronous web crawling utility. It processes URLs, extracts content and child URLs, and handles retry attempts for failed requests. It uses the tokio runtime for asynchronous operations and the reqwest library for HTTP requests.
Features
- Asynchronous crawling using
tokio - Extracts URLs from
<a>tags in HTML - Retries failed requests up to a specified number of attempts
- Limits the number of concurrent requests using a semaphore
Installation
Add this to your Cargo.toml:
[]
= "0.1.0"
= { = "1", = ["full"] }
= { = "0.11", = ["json"] }
= "0.12"
Usage
use scrape;
use StreamExt;
async
Functionality
scrapefunction :
- Takes a vector of URLs, a retry attempt limit, and a maximum number of concurrent processes.
- Returns a stream of
ProcessedUrlstructures.
ProcessedUrlstructure :
- Contains the original URL, the parent URL (if any), the HTML content of the page, and a list of child URLs extracted from
<a>tags.
Example
This example demonstrates how to use the scrape function to process a list of URLs.
use scrape;
use StreamExt;
async
Documentation
Refer to the inline documentation for detailed usage and examples.
ProcessedUrl
Contributing
Contributions are welcome! Please open an issue or submit a pull request.
License
This project is licensed under the MIT License.
This README.md provides an overview of the crate, its features, installation instructions, and usage examples. You can customize it further based on your specific requirements.