Expand description
§http-ferry
A resumable, checksum-verified, streaming byte-transfer engine: pull bytes from
an HTTP source and push them into a pluggable sink, hashing as you go. One sink
ships in the box (local file); another (S3 multipart upload) lives behind the
s3 feature. The caller’s own item type rides through untouched, the way
reqwest hands your response back to you.
The crate knows nothing about any specific service — you bring the URLs, the
auth, and (optionally) your own sink. It was extracted from archive-it-client,
which uses it to download WASAPI WARCs to disk or S3.
The name: the HTTP side is the source and Sink is the destination, so the
crate ferries bytes from one to the other.
§What it does
- Resumable downloads over HTTP range requests, including the awkward case
where a server ignores
Rangeand replies200instead of206(the sink is restarted and the byte counter resets). - Integrity verification with a pluggable
Checksum(sha1 / md5). The engine hashes the stream with the matching algorithm and fails on mismatch. - Skip-on-exists: a sink can report the destination already holds the file
(by checksum, or by size when no checksum is supplied) and the engine yields
Skippedwithout fetching a byte. - Progress + per-item error isolation: a
StreamofOutcomeevents; one bad file in a batch yieldsFailedand the stream continues. - Retry with exponential backoff, both at request setup and mid-stream.
§Core concepts
| Type | Role |
|---|---|
Downloader | Owns the HTTP client, retry policy, and a request-customization hook (where you inject auth). |
Transfer<M> | One unit of work: size, optional checksum, destination name, and your opaque meta. |
Target<'a> | The borrowed view a sink sees: name, size, checksum. No URL, no meta — sinks are domain-agnostic. |
Sink / SinkFactory | Where bytes go. Implement these to add a destination (disk, S3, GCS, a database BLOB…). |
Outcome<M, L> | Per-item result stream: Downloaded / Skipped / Progress / Failed / StreamFailed. |
drive(..) | The one driver. Pulls Transfers, resolves each source URL, builds a sink, runs the download. |
The engine reads only three things off each item — size, checksum, name —
so your rich type (M) is never inspected; it is cloned into Progress events
and handed back in the terminal Outcome.
§Cargo features
s3(off by default) — the S3 multipart-upload sink in thes3module. It pulls in the wholeaws-sdk-s3dependency tree, so consumers who only download to disk don’t pay for it. Enable withfeatures = ["s3"].
§Usage
Wire a Downloader, hand drive a stream of Transfers, a closure that
resolves each item’s source URL, and a SinkFactory:
use std::time::Duration;
use futures_util::StreamExt;
use http_ferry::{Checksum, Downloader, Outcome, Transfer, local::LocalDir};
// 1. An HTTP layer. The `customize` closure is for your auth — inject a bearer
// token, basic auth, signed headers, or nothing.
let token = std::env::var("TOKEN")?;
let downloader = Downloader::new(
reqwest::Client::builder().build()?,
/* max_attempts */ 3,
/* backoff */ Duration::from_millis(250),
move |req| req.bearer_auth(&token),
);
// 2. A stream of work items. `meta` is whatever you want back in the outcome.
let items = futures_util::stream::iter(vec![Ok(Transfer {
size: 1_048_576,
checksum: Some(Checksum::Sha1("da39a3ee…".into())),
name: "report.bin".into(),
meta: (),
})]);
// 3. Drive it: resolve each item's URL, write into ./out via the local sink.
// `create_all` makes the destination dir up front (it must already exist).
let mut out = std::pin::pin!(http_ferry::drive(
&downloader,
items,
|t: &Transfer<()>| Ok(format!("https://example.com/files/{}", t.name).parse()?),
LocalDir::create_all("./out")?,
));
while let Some(outcome) = out.next().await {
match outcome {
Outcome::Downloaded { location, verified, .. } => {
println!("ok {} (verified={verified})", location.display());
}
Outcome::Progress { received, total, .. } => { /* update a bar */ }
Outcome::Skipped { .. } => {}
Outcome::Failed { error, .. } => eprintln!("file failed: {error}"),
Outcome::StreamFailed { error } => eprintln!("fatal: {error}"),
}
}§Adding a destination
Implement Sink (per-file state machine) and SinkFactory (builds one sink
per item). The engine calls prepare once, then write_chunk repeatedly, then
finalize — or restart if the server forced a fresh download mid-stream.
use http_ferry::{Error, Hasher, Prepared, Sink, Target};
struct MemSink { name: String, buf: Vec<u8> }
impl Sink for MemSink {
type Location = String; // identifies where the bytes landed
async fn prepare(&mut self, target: Target<'_>) -> Result<Prepared<String>, Error> {
// Inspect target.checksum / target.size to decide skip-vs-fetch.
// Return a `Hasher` matching the expected checksum so resumed
// downloads keep hashing from where they left off.
Ok(Prepared::Resume { received: 0, partial: Hasher::for_checksum(target.checksum) })
}
async fn write_chunk(&mut self, chunk: &[u8]) -> Result<(), Error> {
self.buf.extend_from_slice(chunk);
Ok(())
}
async fn restart(&mut self) -> Result<(), Error> { self.buf.clear(); Ok(()) }
async fn finalize(self) -> Result<String, Error> { Ok(self.name) }
}Location types implement DownloadLocation so the engine can render where a
file went. To get Display on the outcomes, implement Label on your meta
type M (it supplies the filename used in log lines).
§Design notes
- Auth is a closure, not a credential type.
Downloadernever names “basic auth” or “bearer token” — the consumer supplies aFn(RequestBuilder) -> RequestBuilder. This keeps the engine free of any service’s auth model. - URL resolution is a per-item closure passed to
drive. Resolution can fail per item (yielding a non-fatalFailed) without tearing down the stream; a failure pulling the next item from the source yields a fatalStreamFailed. - Caller errors flow in through
Error::Source. The resolver and the input item stream produce the caller’s error type. The engine type-erases them throughSource(Box<dyn Error + Send + Sync>), so it never needs to know a consumer’s domain errors; callers recover the original bydowncast. - No auto-abort of interrupted uploads. Rust has no
AsyncDrop, so a sink that leaves server-side state (e.g. an S3 multipart upload) documents how to garbage-collect it rather than attempting brittle cleanup on drop. The S3 sink also defersCreateMultipartUploadto the first byte, so a source error before any data arrives leaves nothing behind.
Modules§
Structs§
- Downloader
- HTTP client for resumable range fetches. The
customizeclosure is the auth seam: it is applied to every request, so callers inject basic auth, a bearer token, signed headers, or nothing — the engine stays auth-agnostic. - Target
- Borrowed view of a
Transferhanded to sinks at prepare time. The source URL is resolved by the engine andmetais the caller’s concern, so neither appears here — sinks are domain-agnostic. - Transfer
- One unit of work for the engine.
metais the caller’s own item type, carried opaquely and handed straight back viaOutcome; the engine reads onlysize,checksum, andname.
Enums§
- Checksum
- Expected integrity hash supplied by the caller for a transfer. The engine hashes the byte stream with the matching algorithm and verifies the result; sinks use it for skip-on-match decisions.
- Error
- Errors produced by the transfer engine.
- Hasher
- Streaming hasher selected from the caller’s expected
Checksum.Nonemeans no checksum was supplied: the engine still counts bytes but reportsverified: false. - Outcome
- Per-item outcome of a transfer stream, generic over the caller’s item type
M.Failedcarries per-item errors so a single bad item in a batch doesn’t tear down the whole stream.StreamFailedcarries errors that happen before an item is available, such as a failed listing request or destination preflight. - Prepared
Traits§
- Download
Location - Renders a transfer destination for log lines.
- Label
- Short label for a transfer item (e.g. a filename), used by
Outcome’sDisplayimpl to render log lines. Implement it on yourmetatypeMto getDisplayfor the outcomes carrying it. - Sink
- Sink
Factory - Builds a per-item
Sink. One factory drives a whole stream of items (singular call sites pass a one-element stream).
Functions§
- drive
- One driver for every download path. Pulls items from the input stream,
resolves each item’s source URL, asks the factory for a per-item sink, and
runs
run_download. Per-item errors (url resolution, sink build, transport failure) yieldFailedand the loop continues to the next item — a one-element input stream therefore yields exactly one terminal outcome. - run_
download - Streams one item’s download. Only emits the happy-path
Outcomevariants (Progress,Skipped,Downloaded); per-item faults bubble out asErranddriveturns them intoFailed.StreamFailedis never produced here — it’s reserved for pre-item errors at thedrivelayer.