Skip to main content

Module utils

Module utils 

Source
Expand description

Utility modules supporting research operations.

This module provides utility functions and types used throughout the library:

  • deduplicate_papers: Remove duplicate papers from results using DOI matching and title similarity
  • find_duplicates: Find duplicates without modifying the original list
  • DuplicateStrategy: Strategy for handling duplicates (KeepFirst, KeepLast, Mark)
  • HttpClient: HTTP client with built-in rate limiting
  • RateLimitedRequestBuilder: Builder for rate-limited HTTP requests
  • extract_text: Extract text content from PDF files
  • [is_available]: Check if PDF extraction is available (requires poppler)
  • PdfExtractError: Errors that can occur during PDF extraction
  • RetryConfig: Configuration for retry logic with exponential backoff
  • with_retry: Execute an operation with automatic retry on transient errors

§Deduplication

use research_master::utils::DuplicateStrategy;

// Example: deduplicate_papers takes papers and a strategy
let strategy = DuplicateStrategy::First;
assert_eq!(strategy, DuplicateStrategy::First);

§HTTP Client with Rate Limiting

The HTTP client provides built-in rate limiting using the governor crate. Each source can be configured with different rate limits via environment variables (e.g., SEMANTIC_SCHOLAR_RPM for requests per minute).

§Retry with Backoff

use research_master::utils::RetryConfig;

let config = RetryConfig::default();
assert_eq!(config.max_attempts, 3);

Structs§

CacheService
Cache service for storing and retrieving cached data
CacheStats
Statistics about the cache
CircuitBreaker
Thread-safe circuit breaker implementation
CircuitBreakerManager
Manager for multiple circuit breakers (one per source)
ConcurrentPaperStream
A channel-based concurrent stream for parallel source searches.
ExtractionInfo
Get the best available extraction method with metadata
FilterByYearStream
Stream filter for year range
HttpClient
Shared HTTP client with sensible defaults and rate limiting
ProgressReporter
Progress reporter with optional terminal output
RateLimitedRequestBuilder
Rate-limited request builder - compatible API with reqwest::RequestBuilder
ReleaseAsset
A single release asset
ReleaseInfo
GitHub release information
RetryConfig
Configuration for retry behavior
SharedProgress
Thread-safe progress tracker that can be shared across threads
SkipStream
Stream that skips items
TakeStream
Stream that limits the number of items

Enums§

CacheResult
Result of a cache lookup
CircuitResult
Result of a circuit breaker operation
CircuitState
Circuit breaker states
DuplicateStrategy
Strategy for handling duplicates
ExtractionMethod
Method used for PDF text extraction
InstallationMethod
Information about the current installation method
PdfExtractError
Errors that can occur during PDF extraction
RetryResult
Result of a retry operation
TransientError
Transient errors that should trigger a retry
ValidationError
Validation error types

Functions§

api_retry_config
Create a default retry configuration optimized for external APIs
cleanup_temp_files
Clean up temporary files
collect_papers
Collect all papers from a stream into a Vec.
compute_sha256
Compute SHA256 hash of a file
deduplicate_papers
Remove duplicate papers from a list
detect_installation
Detect how the tool was installed
download_and_extract_asset
Download and extract a release asset
extract_text
Extract text from a PDF file using the best available method.
extract_text_simple
Extract text from a PDF file (legacy interface, discards method info)
fast_deduplicate_papers
Fast hash-based deduplication for papers
fetch_and_verify_sha256
Fetch and verify SHA256 checksum for a file
fetch_latest_release
Fetch the latest release information from GitHub
fetch_sha256_signature
Fetch the GPG signature for SHA256SUMS.txt
filter_by_year
Create a stream that filters papers by year range.
find_asset_for_platform
Find the appropriate release asset for the current platform
find_duplicates
Find duplicate papers based on DOI, title similarity, and author+year
get_current_target
Get the target triple for the current platform
get_extraction_info
Get information about available PDF extraction methods
get_update_instructions
Get installation-specific update instructions
has_poppler
Check if poppler utilities are available
has_tesseract
Check if tesseract OCR is available
paper_stream
Create a stream that yields papers one at a time from paginated search results.
replace_binary
Replace the current binary with a new one
sanitize_filename
Sanitize a filename to prevent path traversal and other attacks
sanitize_paper_id
Validate a paper ID to prevent injection attacks
strict_rate_limit_retry_config
Create a retry configuration for sources with strict rate limits
validate_doi
Validate and sanitize a DOI
validate_url
Validate a URL to prevent injection and SSRF attacks
verify_gpg_signature
Verify GPG signature of SHA256SUMS.txt This requires the project maintainer’s public key to be in the system keyring. For CI/CD, set GPG_FINGERPRINT to the expected signer’s fingerprint.
verify_sha256
Verify downloaded file against expected SHA256 hash
with_retry
Execute an async operation with retry logic
with_retry_detailed
Execute an async operation with retry logic that returns RetryResult