archive-it-client
Rust client for Archive-It's partner API and WASAPI.
Inspiration and examples have been drawn from:
- https://github.com/sul-dlss/wasapi_client
- https://github.com/unt-libraries/py-wasapi-client
- https://github.com/WASAPI-Community/data-transfer-apis/tree/master/ait-specification
Overview
There are three clients, each scoped to what its endpoints expose under that auth state:
use ;
async
Timeouts and retries (default: 30s, 3 attempts, 250ms exponential backoff; retries on 5xx, 429, timeouts, and connection errors) are configured via Config:
use Duration;
use ;
Pagination
There are two options: streaming for transparent pagination, per-page methods for manual
control. Streaming hides the offset/cursor bookkeeping for both API styles
behind a uniform Stream<Item = Result<T, Error>>.
Streaming
Each list endpoint has a streaming variant. Pages are fetched lazily as items are pulled; dropping the stream stops mid-traversal:
use ;
use TryStreamExt; // for try_collect / try_next / try_filter / ...
async
The streaming methods are:
| Client | Method |
|---|---|
PublicClient |
accounts(), collections(account_id: Option<u64>) |
PartnerClient |
collections() |
WasapiClient |
webdata(query: WebdataQuery) |
Internally, PublicClient and PartnerClient streams fetch 100 items per
request. WasapiClient defaults to page_size=50 unless you override it in
WebdataQuery.
The streams expose the standard futures_core::Stream trait. To use the
extension methods shown above (try_collect, try_next, try_filter, take,
…) add a stream-utilities crate to your Cargo.toml:
[]
= "0.3" # or tokio-stream = "0.1"
Per-page
When you want to control page size or read pagination metadata
(WASAPI's count, next), use the lower-level methods:
use ;
async
Downloads
Two destinations: local filesystem and S3. Both skip the fetch when the
destination already matches — by sha1 when WASAPI supplied one, otherwise
by file size. Every download method returns a Stream of DownloadOutcome
events — Progress / Downloaded / Skipped / Failed per file — so
callers can render progress and react to per-file failures uniformly,
whether they're downloading one file or a whole collection.
use pin;
use ;
use TryStreamExt;
async
Local downloads use a <filename>.part sidecar so an interrupted run resumes
on the next invocation.
S3
WasapiClient::download_to_s3 and download_collection_to_s3 accept a
pre-built aws_sdk_s3::Client, so credentials, region, and HTTP wiring stay
under your control. Multipart upload is driven internally with server-side
crc64nvme as the at-rest integrity contract; sha1 (when supplied by WASAPI)
is recorded as user metadata so subsequent runs can skip on match.
The S3 principal needs s3:GetObject, s3:ListBucket, s3:PutObject, and
s3:AbortMultipartUpload on the target.
Examples
Runnable examples live under examples/:
# no auth — public partner registry
# partner API — needs ARCHIVE_IT_USERNAME/ARCHIVE_IT_PASSWORD set
ARCHIVE_IT_USERNAME=user ARCHIVE_IT_PASSWORD=pass
# wasapi — needs ARCHIVE_IT_USERNAME/ARCHIVE_IT_PASSWORD set
ARCHIVE_IT_USERNAME=user ARCHIVE_IT_PASSWORD=pass
# inventory every WARC exposed by WASAPI into ./warcs.csv
ARCHIVE_IT_USERNAME=user ARCHIVE_IT_PASSWORD=pass
# tally total WARC bytes across every collection on the account
ARCHIVE_IT_USERNAME=user ARCHIVE_IT_PASSWORD=pass
# download a collection to ./warcs (resumes via .part sidecars)
ARCHIVE_IT_USERNAME=user ARCHIVE_IT_PASSWORD=pass
# upload one WARC to S3 (uses standard AWS provider chain for creds)
ARCHIVE_IT_USERNAME=user ARCHIVE_IT_PASSWORD=pass S3_BUCKET=my-bucket \
The authenticated examples fail fast if ARCHIVE_IT_USERNAME or
ARCHIVE_IT_PASSWORD is unset.
Fixtures
JSON fixtures under fixtures/ are generated by fixtures.sh. It requires
ARCHIVE_IT_USERNAME and ARCHIVE_IT_PASSWORD (Archive-It partner
credentials) to be set:
ARCHIVE_IT_USERNAME=user ARCHIVE_IT_PASSWORD=pass