Skip to main content

Crate http_ferry

Crate http_ferry 

Source
Expand description

§http-ferry

A resumable, checksum-verified, streaming byte-transfer engine: pull bytes from an HTTP source and push them into a pluggable sink, hashing as you go. One sink ships in the box (local file); another (S3 multipart upload) lives behind the s3 feature. The caller’s own item type rides through untouched, the way reqwest hands your response back to you.

The crate knows nothing about any specific service — you bring the URLs, the auth, and (optionally) your own sink. It was extracted from archive-it-client, which uses it to download WASAPI WARCs to disk or S3.

The name: the HTTP side is the source and Sink is the destination, so the crate ferries bytes from one to the other.

§What it does

  • Resumable downloads over HTTP range requests, including the awkward case where a server ignores Range and replies 200 instead of 206 (the sink is restarted and the byte counter resets).
  • Integrity verification with a pluggable Checksum (sha1 / md5). The engine hashes the stream with the matching algorithm and fails on mismatch.
  • Skip-on-exists: a sink can report the destination already holds the file (by checksum, or by size when no checksum is supplied) and the engine yields Skipped without fetching a byte.
  • Progress + per-item error isolation: a Stream of Outcome events; one bad file in a batch yields Failed and the stream continues.
  • Retry with exponential backoff, both at request setup and mid-stream.

§Core concepts

TypeRole
DownloaderOwns the HTTP client, retry policy, and optional request customization (where you inject auth).
SourceTrait an item implements to describe itself to the engine: name, size, checksum. The item rides through untouched.
DownloadA ready-made Source: source url, destination name, size, and optional checksum. Use it when you don’t have a domain type.
Target<'a>The borrowed view a sink sees: name, size, checksum. No URL, no caller item — sinks are domain-agnostic.
Sink / SinkFactoryWhere bytes go. Implement these to add a destination (disk, S3, GCS, a database BLOB…).
Outcome<M, L>Per-item result stream: Downloaded / Skipped / Progress / Failed / StreamFailed.
drive_downloads(..)Simple driver for Download items whose source URLs are already known.
drive(..)General driver. Pulls Source items, resolves each source URL, builds a sink, runs the download.

The engine reads size, checksum, and name from your item through the Source trait; the item value itself is cloned into Progress events and handed back in terminal Outcomes. name() also serves as the item’s label in Outcome’s Display. Download is a built-in Source for the common case where each item already carries its own URL.

§Cargo features

  • s3 (off by default) — the S3 multipart-upload sink in the s3 module. It pulls in the whole aws-sdk-s3 dependency tree, so consumers who only download to disk don’t pay for it. Enable with features = ["s3"].

§Usage

Wire a Downloader, hand drive_downloads a stream of Downloads and a SinkFactory:

use std::time::Duration;
use futures_util::StreamExt;
use http_ferry::{Checksum, Download, Downloader, Outcome, local::LocalDir};

// 1. An HTTP layer. Request customization is optional; add it when you need
//    bearer tokens, basic auth, signed headers, etc.
let downloader = Downloader::builder(reqwest::Client::builder().build()?)
    .max_attempts(3)
    .backoff(Duration::from_millis(250))
    .build();

// 2. A stream of work items. `Download` is enough when each URL is known.
let items = futures_util::stream::iter(vec![Ok(Download {
    url: "https://example.com/files/report.bin".parse()?,
    size: 1_048_576,
    checksum: Some(Checksum::Sha1("da39a3ee…".into())),
    name: "report.bin".into(),
})]);

// 3. Drive it: write into ./out via the local sink. `create_all` makes the
//    destination dir up front (it must already exist).
let mut out = std::pin::pin!(http_ferry::drive_downloads(
    &downloader,
    items,
    LocalDir::create_all("./out")?,
));

while let Some(outcome) = out.next().await {
    match outcome {
        Outcome::Downloaded { location, verified, .. } => {
            println!("ok {} (verified={verified})", location.display());
        }
        Outcome::Progress { received, total, .. } => { /* update a bar */ }
        Outcome::Skipped { .. } => {}
        Outcome::Failed { error, .. } => eprintln!("file failed: {error}"),
        Outcome::StreamFailed { error } => eprintln!("fatal: {error}"),
    }
}

If requests need auth or other per-request setup, add a customizer:

let token = std::env::var("TOKEN")?;
let downloader = Downloader::builder(reqwest::Client::builder().build()?)
    .customize_request(move |req| req.bearer_auth(&token))
    .build();

For a complete one-file local-sink example with progress validation and required checksum verification:

cargo run -p http-ferry --example local_download -- \
  https://example.com/large.bin ./out sha1:<40-hex-digest>

§Adding a destination

Implement Sink (per-file state machine) and SinkFactory (builds one sink per item). The engine calls prepare once, then write_chunk repeatedly, then finalize — or restart if the server forced a fresh download mid-stream.

use http_ferry::{Error, Hasher, Prepared, Sink, Target};

struct MemSink { name: String, buf: Vec<u8> }

impl Sink for MemSink {
    type Location = String; // identifies where the bytes landed

    async fn prepare(&mut self, target: Target<'_>) -> Result<Prepared<String>, Error> {
        // Inspect target.checksum / target.size to decide skip-vs-fetch.
        // Return a `Hasher` matching the expected checksum so resumed
        // downloads keep hashing from where they left off.
        Ok(Prepared::Resume { received: 0, partial: Hasher::for_checksum(target.checksum) })
    }

    async fn write_chunk(&mut self, chunk: &[u8]) -> Result<(), Error> {
        self.buf.extend_from_slice(chunk);
        Ok(())
    }

    async fn restart(&mut self) -> Result<(), Error> { self.buf.clear(); Ok(()) }

    async fn finalize(self) -> Result<String, Error> { Ok(self.name) }
}

Location types implement DownloadLocation so the engine can render where a file went. The item type M already implements Source, whose name() supplies the filename used in log lines, so Outcome’s Display works for free.

§Design notes

  • Auth is a closure, not a credential type. Downloader never names a service’s credential model. The builder works without customization, and advanced consumers can supply a Fn(RequestBuilder) -> RequestBuilder for bearer tokens, basic auth, signed headers, or anything else.
  • Known URLs use Download + drive_downloads. This is the common path when each item already has a source URL.
  • Domain types implement Source and use drive. Your item describes itself (name/size/checksum) and drive resolves its URL through a closure. Resolution can fail per item (yielding a non-fatal Failed) without tearing down the stream; a failure pulling the next item from the source yields a fatal StreamFailed.
  • Caller errors flow in through Error::Source. The resolver and the input item stream produce the caller’s error type. The engine type-erases them through Source(Box<dyn Error + Send + Sync>), so it never needs to know a consumer’s domain errors; callers recover the original by downcast.
  • No auto-abort of interrupted uploads. Rust has no AsyncDrop, so a sink that leaves server-side state (e.g. an S3 multipart upload) documents how to garbage-collect it rather than attempting brittle cleanup on drop. The S3 sink also defers CreateMultipartUpload to the first byte, so a source error before any data arrives leaves nothing behind.

Modules§

local

Structs§

Download
A simple URL-bearing work item for callers that already know where each file lives. Use with drive_downloads when no per-item resolver is needed.
Downloader
HTTP client for resumable range fetches. The customize closure is the auth seam: it is applied to every request, so callers inject basic auth, a bearer token, signed headers, or nothing — the engine stays auth-agnostic.
DownloaderBuilder
Builder for Downloader. Settings are applied to the wrapped Downloader, which build returns.
Target
Borrowed view of a Source item handed to sinks at prepare time. The source URL is resolved by the engine and the caller’s item is its own concern, so neither appears here — sinks are domain-agnostic.

Enums§

Checksum
Expected integrity hash supplied by the caller for a transfer. The engine hashes the byte stream with the matching algorithm and verifies the result; sinks use it for skip-on-match decisions.
Error
Errors produced by the transfer engine.
Hasher
Streaming hasher selected from the caller’s expected Checksum. None means no checksum was supplied: the engine still counts bytes but reports verified: false.
Outcome
Per-item outcome of a transfer stream, generic over the caller’s item type M. Failed carries per-item errors so a single bad item in a batch doesn’t tear down the whole stream. StreamFailed carries errors that happen before an item is available, such as a failed listing request or destination preflight.
Prepared

Traits§

DownloadLocation
Renders a transfer destination for log lines.
Sink
SinkFactory
Builds a per-item Sink. One factory drives a whole stream of items (singular call sites pass a one-element stream).
Source
Describes a caller’s work item to the engine: its destination name, byte size, and optional checksum. The item value itself is cloned into Progress events and handed back in terminal Outcomes. name is also the item’s label in Outcome’s Display.

Functions§

drive
One driver for every download path. Pulls items from the input stream, resolves each item’s source URL, asks the factory for a per-item sink, and runs run_download. Per-item errors (url resolution, sink build, transport failure) yield Failed and the loop continues to the next item — a one-element input stream therefore yields exactly one terminal outcome.
drive_downloads
Simpler driver for transfers whose source URL is already known. For domain-specific items that need fallible URL resolution, use drive.
run_download
Streams one item’s download. Only emits the happy-path Outcome variants (Progress, Skipped, Downloaded); per-item faults bubble out as Err and drive turns them into Failed. StreamFailed is never produced here — it’s reserved for pre-item errors at the drive layer.