http-ferry
A resumable, checksum-verified, streaming byte-transfer engine: pull bytes from
an HTTP source and push them into a pluggable sink, hashing as you go. One sink
ships in the box (local file); another (S3 multipart upload) lives behind the
s3 feature. The caller's own item type rides through untouched, the way
reqwest hands your response back to you.
The crate knows nothing about any specific service — you bring the URLs, the
auth, and (optionally) your own sink. It was extracted from archive-it-client,
which uses it to download WASAPI WARCs to disk or S3.
The name: the HTTP side is the source and Sink is the destination, so the
crate ferries bytes from one to the other.
What it does
- Resumable downloads over HTTP range requests, including the awkward case
where a server ignores
Rangeand replies200instead of206(the sink is restarted and the byte counter resets). - Integrity verification with a pluggable
Checksum(sha1 / md5). The engine hashes the stream with the matching algorithm and fails on mismatch. - Skip-on-exists: a sink can report the destination already holds the file
(by checksum, or by size when no checksum is supplied) and the engine yields
Skippedwithout fetching a byte. - Progress + per-item error isolation: a
StreamofOutcomeevents; one bad file in a batch yieldsFailedand the stream continues. - Retry with exponential backoff, both at request setup and mid-stream.
Core concepts
| Type | Role |
|---|---|
Downloader |
Owns the HTTP client, retry policy, and optional request customization (where you inject auth). |
Source |
Trait an item implements to describe itself to the engine: name, size, checksum. The item rides through untouched. |
Download |
A ready-made Source: source url, destination name, size, and optional checksum. Use it when you don't have a domain type. |
Target<'a> |
The borrowed view a sink sees: name, size, checksum. No URL, no caller item — sinks are domain-agnostic. |
Sink / SinkFactory |
Where bytes go. Implement these to add a destination (disk, S3, GCS, a database BLOB…). |
Outcome<M, L> |
Per-item result stream: Downloaded / Skipped / Progress / Failed / StreamFailed. |
drive_downloads(..) |
Simple driver for Download items whose source URLs are already known. |
drive(..) |
General driver. Pulls Source items, resolves each source URL, builds a sink, runs the download. |
The engine reads size, checksum, and name from your item through the
Source trait; the item value itself is cloned into Progress events and
handed back in terminal Outcomes. name() also serves as the item's label in
Outcome's Display. Download is a built-in Source for the common case
where each item already carries its own URL.
Cargo features
s3(off by default) — the S3 multipart-upload sink in thes3module. It pulls in the wholeaws-sdk-s3dependency tree, so consumers who only download to disk don't pay for it. Enable withfeatures = ["s3"].
Usage
Wire a Downloader, hand drive_downloads a stream of Downloads and a
SinkFactory:
use Duration;
use StreamExt;
use ;
// 1. An HTTP layer. Request customization is optional; add it when you need
// bearer tokens, basic auth, signed headers, etc.
let downloader = builder
.max_attempts
.backoff
.build;
// 2. A stream of work items. `Download` is enough when each URL is known.
let items = iter;
// 3. Drive it: write into ./out via the local sink. `create_all` makes the
// destination dir up front (it must already exist).
let mut out = pin!;
while let Some = out.next.await
If requests need auth or other per-request setup, add a customizer:
let token = var?;
let downloader = builder
.customize_request
.build;
For a complete one-file local-sink example with progress validation and required checksum verification:
Adding a destination
Implement Sink (per-file state machine) and SinkFactory (builds one sink
per item). The engine calls prepare once, then write_chunk repeatedly, then
finalize — or restart if the server forced a fresh download mid-stream.
use ;
Location types implement DownloadLocation so the engine can render where a
file went. The item type M already implements Source, whose name()
supplies the filename used in log lines, so Outcome's Display works for
free.
Design notes
- Auth is a closure, not a credential type.
Downloadernever names a service's credential model. The builder works without customization, and advanced consumers can supply aFn(RequestBuilder) -> RequestBuilderfor bearer tokens, basic auth, signed headers, or anything else. - Known URLs use
Download+drive_downloads. This is the common path when each item already has a source URL. - Domain types implement
Sourceand usedrive. Your item describes itself (name/size/checksum) anddriveresolves its URL through a closure. Resolution can fail per item (yielding a non-fatalFailed) without tearing down the stream; a failure pulling the next item from the source yields a fatalStreamFailed. - Caller errors flow in through
Error::Source. The resolver and the input item stream produce the caller's error type. The engine type-erases them throughSource(Box<dyn Error + Send + Sync>), so it never needs to know a consumer's domain errors; callers recover the original bydowncast. - No auto-abort of interrupted uploads. Rust has no
AsyncDrop, so a sink that leaves server-side state (e.g. an S3 multipart upload) documents how to garbage-collect it rather than attempting brittle cleanup on drop. The S3 sink also defersCreateMultipartUploadto the first byte, so a source error before any data arrives leaves nothing behind.