opendal-core 0.56.0

Apache OpenDALâ„¢: One Layer, All Storage.
- Proposal Name: checksum
- Start Date: 2025-11-24
- RFC PR: [apache/opendal#6817]https://github.com/apache/opendal/pull/6817
- Tracking Issue: [apache/opendal#5549]https://github.com/apache/opendal/issues/5549

# Summary

Add a single full-file checksum abstraction (`Checksum { algo, value }`), capability booleans for supported algorithms, write options for user-provided checksums, metadata return of the final checksum, and a `ChecksumLayer` that can auto-compute and enforce end-to-end verification using a preferred algorithm order.

# Motivation

- Give users a storage-agnostic way to attach and receive full-file checksums.
- Detect corruption or mismatched uploads early by comparing expected vs actual values.
- Provide an opt-in layer to fill gaps where backends cannot verify or return checksums.
- Keep changes minimal and consistent with existing `Capability` boolean style.

# Guide-level explanation

## New concepts
- `ChecksumAlgo`: algorithms we support (`Crc64Nvme`, `Crc32c`, `Md5`, `Sha256`, extensible).
- `Checksum`: holds exactly one algorithm and the full-file checksum bytes.
- `ChecksumLayer`: optional layer that computes/checks checksums with a preferred algorithm list and an `enforce` flag.

## Examples

### Write with a user-computed checksum (no layer)
```rust,no_run
use opendal::services;
use opendal::{Checksum, ChecksumAlgo, Operator, Result};

fn crc64_nvme_of(data: &[u8]) -> Vec<u8> {
    // user-side computation (placeholder)
    vec![0; 8]
}

#[tokio::main]
async fn main() -> Result<()> {
    let mut builder = services::Memory::default();
    let op = Operator::new(builder)?.finish();

    let data = b"hello checksum".to_vec();
    let expected = Checksum::new(ChecksumAlgo::Crc64Nvme, crc64_nvme_of(&data));

    // Backend supports CRC64-NVMe. Mismatch returns ErrorKind::ChecksumMismatch.
    op.write_with("foo.txt", data)
        .checksum(expected)
        .await?;
    Ok(())
}
```

### Read and inspect checksum from metadata
```rust,no_run
use opendal::services;
use opendal::{Operator, Result};

#[tokio::main]
async fn main() -> Result<()> {
    let mut builder = services::Memory::default();
    let op = Operator::new(builder)?.finish();

    let meta = op.stat("foo.txt").await?;
    if let Some(cs) = meta.checksum() {
        println!("algo={:?}, value={:x?}", cs.algo, cs.value);
    }
    Ok(())
}
```

### Enable end-to-end verification via ChecksumLayer (auto-compute)
```rust,no_run
use opendal::layers::ChecksumLayer;
use opendal::services;
use opendal::{ChecksumAlgo, Operator, Result};

#[tokio::main]
async fn main() -> Result<()> {
    let mut builder = services::Memory::default();

    // Prefer CRC64-NVMe, fall back to Sha256. 
    // Enforce=true: if backend lacks support, compute locally; 
    // any mismatch errors out.
    let op = Operator::new(builder)?
        .layer(ChecksumLayer::new().preferred(vec![ChecksumAlgo::Crc64Nvme, ChecksumAlgo::Sha256]).enforce(true))
        .finish();

    // User does not provide checksum; layer will compute and attach automatically.
    op.write("bar.bin", b"data".to_vec()).await?;

    // If metadata lacks the preferred checksum, the layer will stream-read and compute.
    let _ = op.read("bar.bin").await?;
    Ok(())
}
```

### Error on mismatch
```rust,no_run
use opendal::services;
use opendal::{Checksum, ChecksumAlgo, Operator, Result, ErrorKind};

#[tokio::main]
async fn main() -> Result<()> {
    let mut builder = services::Memory::default();
    let op = Operator::new(builder)?.finish();

    let wrong = Checksum::new(ChecksumAlgo::Sha256, vec![0; 32]);
    let res = op
        .write_with("bad.bin", b"payload".to_vec())
        .checksum(wrong)
        .await;

    assert!(matches!(res, Err(err) if err.kind() == ErrorKind::ChecksumMismatch));
    Ok(())
}
```

# Reference-level explanation

## Data types
- `ChecksumAlgo`: enum of supported algorithms. Extending this enum is allowed.
- `Checksum`: `{ algo: ChecksumAlgo, value: Vec<u8> }`; represents the full file only.
- `Metadata`: add `checksum: Option<Checksum>` plus helpers (`checksum()`, `crc64_nvme()`, etc.).

## Capability
- Add boolean fields to `Capability`: `checksum_crc64_nvme`, `checksum_crc32c`, `checksum_md5`, `checksum_sha256`.
- Semantics: `true` means the backend can accept and return that algorithm for full-file checksum.

## Write path
- `WriteOptions` / `OpWrite` gains `checksum: Option<Checksum>`.
- Flow:
  1. If `checksum` is provided and its algo flag is `false` in capability, return `Unsupported`.
  2. If supported, pass to backend; mismatch returns `ChecksumMismatch`.
  3. Response metadata includes the final checksum (from backend or layer).

## Read/stat path
- If backend provides checksum, fill `Metadata::checksum`.
- Otherwise leave `None`; `ChecksumLayer` may compute and inject.

## ChecksumLayer
- Config: `preferred(Vec<ChecksumAlgo>)`, `enforce(bool)`.
- Selection: pick the first preferred algo whose capability flag is true; if none and `enforce=false`, skip; if `enforce=true`, compute locally anyway.
- Write: if backend cannot verify, stream-compute chosen algo; compare against provided `checksum` (if any); mismatch -> `ChecksumMismatch`; inject result into returned metadata.
- Read: if metadata lacks chosen algo, stream-compute; mismatch -> `ChecksumMismatch`; if `enforce=true` and cannot obtain, surface `Unsupported` or mismatch.

## Errors
- New `ErrorKind::ChecksumMismatch` for value differences.
- Unsupported algorithm keeps using existing `Unsupported` error kind.

## Backward compatibility
- `content_md5` stays; when backends return MD5, it can populate both `content_md5` and `checksum(algo=Md5)`.
- No behavior change for users who ignore checksum features.

# Drawbacks
- More boolean fields in `Capability`; adding many algorithms enlarges the struct.
- `ChecksumLayer` can add CPU cost for large objects when enforce is enabled.

# Rationale and alternatives
- Chose capability booleans to match existing style and keep `Capability: Copy`.
- Rejected multi-checksum containers to keep the surface small and semantics single-valued.
- Rejected HashSet/bitmask because booleans are already the established pattern in `Capability`.

# Prior art
- Cloud SDKs commonly expose a single MD5/CRC32C field (e.g., GCS, OSS); we generalize to multiple algorithms via booleans.
- Middleware-style checksum verification mirrors S3 client behaviors but made storage-agnostic here.

# Unresolved questions
- Default preferred order for `ChecksumLayer` (proposed: `Crc64Nvme`, `Sha256`, `Crc32c`, `Md5`).
- Per-backend capability matrix (which algorithms to mark true by default).

# Future possibilities
- Add more algorithms (e.g., `Sha1`) with new booleans.
- Optional `reverify_on_read` flag in `ChecksumLayer` to recompute even when a checksum exists.
- Expose checksum info in presign responses when services support checksum headers.