spider-util
Provides utility types, traits, and implementations for the spider-lib framework.
Overview
The spider-util crate contains fundamental data structures, error types, and utility functions that are shared across all components of the spider framework. This crate serves as the common foundation for all other spider crates, providing the basic building blocks for web scraping operations.
Key Components
- Request: Represents an HTTP request with URL, method, headers, and body
- Response: Represents an HTTP response with status, headers, and body
- ScrapedItem: Trait and derive macro for defining data structures to hold scraped data
- Error Handling: Comprehensive error types for all operations
- Bloom Filter: Efficient probabilistic data structure for duplicate detection
- Utilities: Helper functions and extensions for common operations
Architecture
This crate is designed to be lightweight and reusable, containing only the essential types and utilities needed by other spider components. It has minimal external dependencies to ensure stability and compatibility.
Usage
use ;
use Url;
// Create a request
let url = parse.unwrap;
let request = new;
// Define a scraped item
Components
Request
Represents an HTTP request with URL, method, headers, and body. Provides methods for constructing and manipulating requests.
Usage:
use Request;
use Url;
let url = parse.unwrap;
let mut request = new;
// Add headers
request.headers.insert;
// Add metadata
request.meta.insert;
Response
Represents an HTTP response with status, headers, and body. Contains methods for extracting content and metadata from responses.
Usage:
use Response;
// Access response properties
println!;
println!;
println!;
// Extract content with scraper
use ;
let document = parse_document;
let selector = parse.unwrap;
if let Some = document.select.next
ScrapedItem
Defines the trait and associated functionality for data structures that hold scraped data. Used by spiders to define the structure of the data they extract.
Usage:
use scraped_item;
// The macro automatically implements necessary traits
let product = Product ;
// Convert to JSON
let json = product.to_json_value;
Error Handling
Comprehensive error types for all operations within the spider framework, providing detailed information about failures.
Error Types:
SpiderError: General error type for spider operationsPipelineError: Errors specific to pipeline operationsSerializationError: Errors during serialization/deserializationIoError: Input/output errorsConfigurationError: Errors in configuration
Usage:
use SpiderError;
match some_operation
Bloom Filter
Efficient probabilistic data structure for duplicate detection, useful for identifying URLs that have already been crawled.
Usage:
use BloomFilter;
let mut filter = new; // 1000 expected items, 10% false positive rate
let url = "https://example.com/page";
if !filter.contains else
Utilities
Helper functions and extensions for common operations such as URL manipulation, content extraction, and data processing.
Utility Functions:
validate_output_dir: Validates that an output directory exists and is writablenormalize_origin: Normalizes URL origins for consistent comparisonextract_links: Extracts links from HTML contentcalculate_fingerprint: Calculates request fingerprints for caching
Usage:
use utils;
// Validate output directory
validate_output_dir?;
// Normalize URL origin
let normalized = normalize_origin;
// Calculate request fingerprint
let fingerprint = request.fingerprint;
License
This project is licensed under the MIT License - see the LICENSE file for details.