Expand description
Utility modules supporting research operations.
This module provides utility functions and types used throughout the library:
deduplicate_papers: Remove duplicate papers from results using DOI matching and title similarityfind_duplicates: Find duplicates without modifying the original listDuplicateStrategy: Strategy for handling duplicates (KeepFirst, KeepLast, Mark)HttpClient: HTTP client with built-in rate limitingRateLimitedRequestBuilder: Builder for rate-limited HTTP requestsextract_text: Extract text content from PDF files- [
is_available]: Check if PDF extraction is available (requires poppler) PdfExtractError: Errors that can occur during PDF extractionRetryConfig: Configuration for retry logic with exponential backoffwith_retry: Execute an operation with automatic retry on transient errors
§Deduplication
use research_master::utils::DuplicateStrategy;
// Example: deduplicate_papers takes papers and a strategy
let strategy = DuplicateStrategy::First;
assert_eq!(strategy, DuplicateStrategy::First);§HTTP Client with Rate Limiting
The HTTP client provides built-in rate limiting using the governor crate.
Each source can be configured with different rate limits via environment
variables (e.g., SEMANTIC_SCHOLAR_RPM for requests per minute).
§Retry with Backoff
use research_master::utils::RetryConfig;
let config = RetryConfig::default();
assert_eq!(config.max_attempts, 3);Structs§
- Cache
Service - Cache service for storing and retrieving cached data
- Cache
Stats - Statistics about the cache
- Circuit
Breaker - Thread-safe circuit breaker implementation
- Circuit
Breaker Manager - Manager for multiple circuit breakers (one per source)
- Concurrent
Paper Stream - A channel-based concurrent stream for parallel source searches.
- Extraction
Info - Get the best available extraction method with metadata
- Filter
ByYear Stream - Stream filter for year range
- Http
Client - Shared HTTP client with sensible defaults and rate limiting
- Progress
Reporter - Progress reporter with optional terminal output
- Rate
Limited Request Builder - Rate-limited request builder - compatible API with reqwest::RequestBuilder
- Release
Asset - A single release asset
- Release
Info - GitHub release information
- Retry
Config - Configuration for retry behavior
- Shared
Progress - Thread-safe progress tracker that can be shared across threads
- Skip
Stream - Stream that skips items
- Take
Stream - Stream that limits the number of items
Enums§
- Cache
Result - Result of a cache lookup
- Circuit
Result - Result of a circuit breaker operation
- Circuit
State - Circuit breaker states
- Duplicate
Strategy - Strategy for handling duplicates
- Extraction
Method - Method used for PDF text extraction
- Installation
Method - Information about the current installation method
- PdfExtract
Error - Errors that can occur during PDF extraction
- Retry
Result - Result of a retry operation
- Transient
Error - Transient errors that should trigger a retry
- Validation
Error - Validation error types
Functions§
- api_
retry_ config - Create a default retry configuration optimized for external APIs
- cleanup_
temp_ files - Clean up temporary files
- collect_
papers - Collect all papers from a stream into a Vec.
- compute_
sha256 - Compute SHA256 hash of a file
- deduplicate_
papers - Remove duplicate papers from a list
- detect_
installation - Detect how the tool was installed
- download_
and_ extract_ asset - Download and extract a release asset
- extract_
text - Extract text from a PDF file using the best available method.
- extract_
text_ simple - Extract text from a PDF file (legacy interface, discards method info)
- fast_
deduplicate_ papers - Fast hash-based deduplication for papers
- fetch_
and_ verify_ sha256 - Fetch and verify SHA256 checksum for a file
- fetch_
latest_ release - Fetch the latest release information from GitHub
- fetch_
sha256_ signature - Fetch the GPG signature for SHA256SUMS.txt
- filter_
by_ year - Create a stream that filters papers by year range.
- find_
asset_ for_ platform - Find the appropriate release asset for the current platform
- find_
duplicates - Find duplicate papers based on DOI, title similarity, and author+year
- get_
current_ target - Get the target triple for the current platform
- get_
extraction_ info - Get information about available PDF extraction methods
- get_
update_ instructions - Get installation-specific update instructions
- has_
poppler - Check if poppler utilities are available
- has_
tesseract - Check if tesseract OCR is available
- paper_
stream - Create a stream that yields papers one at a time from paginated search results.
- replace_
binary - Replace the current binary with a new one
- sanitize_
filename - Sanitize a filename to prevent path traversal and other attacks
- sanitize_
paper_ id - Validate a paper ID to prevent injection attacks
- strict_
rate_ limit_ retry_ config - Create a retry configuration for sources with strict rate limits
- validate_
doi - Validate and sanitize a DOI
- validate_
url - Validate a URL to prevent injection and SSRF attacks
- verify_
gpg_ signature - Verify GPG signature of SHA256SUMS.txt This requires the project maintainer’s public key to be in the system keyring. For CI/CD, set GPG_FINGERPRINT to the expected signer’s fingerprint.
- verify_
sha256 - Verify downloaded file against expected SHA256 hash
- with_
retry - Execute an async operation with retry logic
- with_
retry_ detailed - Execute an async operation with retry logic that returns RetryResult