# archive-it-client
Rust client for Archive-It's partner API and WASAPI.
Inspiration and examples have been drawn from:
- <https://github.com/sul-dlss/wasapi_client>
- <https://github.com/unt-libraries/py-wasapi-client>
- <https://github.com/WASAPI-Community/data-transfer-apis/tree/master/ait-specification>
## Overview
There are three clients, each scoped to what its endpoints expose under that auth state:
```rust,no_run
use archive_it_client::{PageOpts, PartnerClient, PublicClient, WasapiClient, WebdataQuery};
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let user = "user";
let pass = "pass";
// public — no auth, partner registry + public collections
let public = PublicClient::new()?;
let accounts = public.list_accounts(PageOpts::default()).await?;
let collection = public.get_collection(2135).await?;
// partner — auth scopes every call to your own account
let partner = PartnerClient::new(user, pass)?;
let me = partner.my_account().await?;
let mine = partner.list_collections(PageOpts::default()).await?;
// wasapi — WARC manifests for a collection
let wasapi = WasapiClient::new(user, pass)?;
let query = WebdataQuery {
collection: Some(4472),
..Default::default()
};
let page = wasapi.list_webdata(&query).await?;
Ok(())
}
```
Timeouts and retries (default: 30s, 3 attempts, 250ms exponential backoff; retries on 5xx, 429, timeouts, and connection errors) are configured via `Config`:
```rust,no_run
use std::time::Duration;
use archive_it_client::{Config, PartnerClient};
fn main() -> Result<(), Box<dyn std::error::Error>> {
let user = "user";
let pass = "pass";
let mut cfg = Config::api();
cfg.timeout = Duration::from_secs(10);
cfg.max_attempts = 5;
let client = PartnerClient::with_config(user, pass, cfg)?;
Ok(())
}
```
## Pagination
There are two options: streaming for transparent pagination, per-page methods for manual
control. Streaming hides the offset/cursor bookkeeping for both API styles
behind a uniform `Stream<Item = Result<T, Error>>`.
### Streaming
Each list endpoint has a streaming variant. Pages are fetched lazily as items
are pulled; dropping the stream stops mid-traversal:
```rust,no_run
use archive_it_client::{PartnerClient, PublicClient, WasapiClient, WebdataQuery};
use futures::TryStreamExt; // for try_collect / try_next / try_filter / ...
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let user = "user";
let pass = "pass";
let public = PublicClient::new()?;
let partner = PartnerClient::new(user, pass)?;
let wasapi = WasapiClient::new(user, pass)?;
let all: Vec<_> = public.accounts().try_collect().await?;
let mine: Vec<_> = partner.collections().try_collect().await?;
let query = WebdataQuery {
collection: Some(4472),
..Default::default()
};
let mut files = Box::pin(wasapi.webdata(query));
while let Some(file) = files.try_next().await? {
// process one file at a time
}
Ok(())
}
```
The streaming methods are:
| `PublicClient` | `accounts()`, `collections(account_id: Option<u64>)` |
| `PartnerClient` | `collections()` |
| `WasapiClient` | `webdata(query: WebdataQuery)` |
Internally, `PublicClient` and `PartnerClient` streams fetch 100 items per
request. `WasapiClient` defaults to `page_size=50` unless you override it in
`WebdataQuery`.
The streams expose the standard `futures_core::Stream` trait. To use the
extension methods shown above (`try_collect`, `try_next`, `try_filter`, `take`,
…) add a stream-utilities crate to your `Cargo.toml`:
```toml
[dependencies]
futures = "0.3" # or tokio-stream = "0.1"
```
### Per-page
When you want to control page size or read pagination metadata
(WASAPI's `count`, `next`), use the lower-level methods:
```rust,no_run
use archive_it_client::{PageOpts, PublicClient, WasapiClient, WebdataQuery};
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let public = PublicClient::new()?;
let wasapi = WasapiClient::new("user", "pass")?;
// /api — caller passes limit/offset, gets a Vec
let batch = public
.list_accounts(PageOpts { limit: Some(50), offset: Some(0) })
.await?;
// wasapi — server-driven cursor; follow `next` until exhausted
let query = WebdataQuery {
collection: Some(4472),
..Default::default()
};
let mut page = wasapi.list_webdata(&query).await?;
println!("{} files total", page.count);
loop {
for file in &page.files { /* ... */ }
match wasapi.list_webdata_next(&page).await? {
Some(next) => page = next,
None => break,
}
}
Ok(())
}
```
## Downloads
Two destinations: local filesystem and S3. Both skip the fetch when the
destination already matches — by sha1 when WASAPI supplied one, otherwise
by file size. Every download method returns a `Stream` of `DownloadOutcome`
events — `Progress` / `Downloaded` / `Skipped` / `Failed` per file, plus
`StreamFailed` for errors that occur before a file is available — so callers
can render progress and react to failures uniformly, whether they're
downloading one file or a whole collection. The `error` carried by `Failed`
and `StreamFailed` is an `http_ferry::Error` (re-exported as
`archive_it_client::http_ferry::Error`), not `archive_it_client::Error`.
```rust,no_run
use std::pin::pin;
use archive_it_client::{WasapiClient, WebdataQuery};
use futures::{StreamExt, TryStreamExt};
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let wasapi = WasapiClient::new("user", "pass")?;
// single file → ./out.warc.gz, with progress events
let file = pin!(wasapi.webdata(WebdataQuery {
collection: Some(4472),
page_size: Some(1),
..Default::default()
}))
.try_next().await?.ok_or("empty")?;
let mut single = pin!(wasapi.download(file, "./out.warc.gz"));
while let Some(outcome) = single.next().await {
println!("{outcome}");
}
// whole collection → ./warcs, also a stream of outcomes per file
let query = WebdataQuery { collection: Some(4472), ..Default::default() };
let mut stream = pin!(wasapi.download_collection(query, "./warcs"));
while let Some(outcome) = stream.next().await {
println!("{outcome}");
}
Ok(())
}
```
Local downloads use a `<filename>.part` sidecar so an interrupted run resumes
on the next invocation.
### S3
`WasapiClient::download_to_s3` and `download_collection_to_s3` accept a
pre-built `aws_sdk_s3::Client`, so credentials, region, and HTTP wiring stay
under your control. Multipart upload is driven internally with server-side
crc64nvme as the at-rest integrity contract; sha1 (when supplied by WASAPI)
is recorded as user metadata so subsequent runs can skip on match.
The S3 principal needs `s3:GetObject`, `s3:ListBucket`, `s3:PutObject`, and
`s3:AbortMultipartUpload` on the target.
## Examples
Runnable examples live under `examples/`:
```bash
# no auth — public partner registry
cargo run --example public
# authenticated examples
export ARCHIVE_IT_USERNAME=user
export ARCHIVE_IT_PASSWORD=pass
# partner API
cargo run --example partner
# wasapi
cargo run --example wasapi
# inventory every WARC exposed by WASAPI into ./warcs.csv
cargo run --example warcs_inventory
# tally total WARC bytes across every collection on the account
cargo run --example count_bytes
# download a collection to ./warcs (resumes via .part sidecars)
cargo run --example download_collection
# upload one WARC to S3 (uses standard AWS provider chain for creds)
S3_BUCKET=my-bucket cargo run --example download_s3
```
The authenticated examples fail fast if `ARCHIVE_IT_USERNAME` or
`ARCHIVE_IT_PASSWORD` is unset.
## Fixtures
JSON fixtures under `fixtures/` are generated by `fixtures.sh`. It requires
`ARCHIVE_IT_USERNAME` and `ARCHIVE_IT_PASSWORD` (Archive-It partner
credentials) to be set:
```bash
ARCHIVE_IT_USERNAME=user ARCHIVE_IT_PASSWORD=pass ./fixtures.sh
```