Skip to main content

Module fetch

Module fetch 

Source
Expand description

doiget fetch <ref> subcommand.

Phase 1 scope:

  • arXiv refs — full end-to-end: PDF bytes are fetched via the doiget_core::sources::arxiv::ArxivSource, the [doiget] extension table is populated with the resolved license, source, size, and fetched_at, and the result is written to the on-disk store with both the metadata TOML and the PDF.
  • DOI refs — Crossref metadata + Unpaywall license enrichment + an OA PDF fetch when Unpaywall’s best_oa_location.url_for_pdf (or best_oa_location.url) resolves to a host on the synthetic "oa-publisher" allowlist (docs/REDIRECT_ALLOWLIST.md §3). The OA URL host check is informed-best-effort; if the host is not on the allowlist or the body fails the magic-byte check, the orchestrator logs a Fetch err row under source = "oa-publisher" and falls back to metadata-only success — the metadata is still useful.

§Provenance contract

Per docs/PROVENANCE_LOG.md §3, every invocation emits at least one SessionStart, one or more Fetch rows (one per source consulted), one StoreWrite row on success, and one SessionEnd. Each Fetch row is appended by the underlying Source impl; the orchestrator owns the session-bookend rows and the StoreWrite row.

§Configuration surface

Hard-coded paths with env-var overrides; full config.toml plumbing arrives in a follow-up. See docs/CONFIG.md for the eventual surface.

Env varDefaultPurpose
DOIGET_STORE_ROOT$HOME/papers (or %USERPROFILE%\papers on Windows)Filesystem store root
DOIGET_LOG_PATH<config>/doiget/access.jsonlProvenance log file
DOIGET_CONTACT_EMAILdoiget@localhostPolite-pool contact email (User-Agent and Crossref)
DOIGET_UNPAYWALL_EMAIL(= contact email)Unpaywall query-string email
DOIGET_ARXIV_BASEhttps://arxiv.orgarXiv source base (test override)
DOIGET_CROSSREF_BASEhttps://api.crossref.orgCrossref source base (test override)
DOIGET_UNPAYWALL_BASEhttps://api.unpaywall.org/v2Unpaywall source base (test override)
DOIGET_OA_PUBLISHER_BASE(production allowlist)OA publisher host allowlist override (test override)

Structs§

CliExit
Carries a docs/ERRORS.md §4 process exit code out of a CLI command to main, which owns the actual std::process::exit (calling it inside run_with_options would kill in-process integration tests). The human-readable error[CODE]: … line has ALREADY been written to stderr by render_fetch_error before this is constructed, so main must NOT print it again. Issue #119.
FetchPlan
Structured dry-run preview returned by --dry-run and the dry_run: true MCP variants. Wire shape matches ADR-0022 §1 and docs/MCP_TOOLS.md §10.
PdfSourcePlan
Per-PDF-source row inside FetchPlan::pdf_sources.
RateLimitBudget
Per-process rate-limit context surfaced alongside FetchPlan so an agent can predict the politeness ceiling without a separate doiget_capability_profile round-trip.

Functions§

build_dry_run_envelope
Build the dry-run envelope as a serde_json::Value, without writing anywhere. Used by both the CLI (which prints it to stdout) and the MCP tool wrapper (which routes the bytes via JSON-RPC). Wire shape:
build_fetch_plan
Build the dry-run preview (FetchPlan) for the given ref and store root, without contacting the network or filesystem.
emit_dry_run_plan_to_stdout
Serialize the dry-run envelope and write it to stdout. Used by the --dry-run flag on doiget fetch and doiget batch. The envelope shape matches ADR-0022 §1 / docs/MCP_TOOLS.md §10.
run_with_options
Run the doiget fetch <ref> subcommand.