Skip to main content

Module checkpoint

Module checkpoint 

Source
Expand description

This module implements the API for customized Delta checkpoint writes, where the caller drives the write themselves. The entry point is Snapshot::create_checkpoint_writer.

If you want an all-in-one API that handles writing the checkpoint, use Snapshot::checkpoint instead.

§Checkpoint Types and Selection Logic

This API supports two checkpoint types, selected based on table features:

Table FeatureResulting Checkpoint TypeDescription
No v2CheckpointsSingle-file Classic-named V1Follows V1 specification without CheckpointMetadata action
v2CheckpointsClassic-named V2 (with or without sidecars)Follows V2 specification with CheckpointMetadata action while maintaining backward compatibility via classic naming

For more information on the V1/V2 specifications, see the following protocol section: https://github.com/delta-io/delta/blob/master/PROTOCOL.md#checkpoint-specs

§Architecture

§Usage

The following steps outline the process of creating a checkpoint:

  1. Create a CheckpointWriter using Snapshot::create_checkpoint_writer
  2. Get the checkpoint path from CheckpointWriter::checkpoint_path
  3. Get the checkpoint data from CheckpointWriter::checkpoint_data
  4. Write the data to the path in object storage (engine-specific)
  5. Collect metadata (FileMeta) from the write operation
  6. Build a LastCheckpointHintStats from the exhausted iterator state
  7. Pass the LastCheckpointHintStats to CheckpointWriter::finalize
fn write_checkpoint_file(path: Url, data: ActionReconciliationIterator) -> DeltaResult<FileMeta> {
    todo!() /* engine-specific logic to write data to object storage*/
}

let engine: &dyn Engine = todo!(); /* create engine instance */

// Create a snapshot for the table at the version you want to checkpoint
let url = delta_kernel::try_parse_uri("./tests/data/app-txn-no-checkpoint")?;
let snapshot = Snapshot::builder_for(url).build(engine)?;

// Create a checkpoint writer from the snapshot
let writer = snapshot.create_checkpoint_writer(engine)?;

// Get the checkpoint path and data
let checkpoint_path = writer.checkpoint_path()?;
let checkpoint_data = writer.checkpoint_data(engine)?;

// Get the iterator state before consuming the data
let state = checkpoint_data.state();

// Write the checkpoint data to the object store and collect metadata
// The write function consumes the iterator, dropping its Arc reference to the state.
let metadata: FileMeta = write_checkpoint_file(checkpoint_path, checkpoint_data)?;
/* IMPORTANT: All data must be written before finalizing the checkpoint */

// Build the [`LastCheckpointHintStats`] from the exhausted iterator state
let state = std::sync::Arc::into_inner(state)
    .ok_or(Error::internal_error("checkpoint state Arc still has other references"))?;
let last_checkpoint_stats =
    delta_kernel::checkpoint::LastCheckpointHintStats::from_reconciliation_state(
        state,
        metadata.size,
        0, /* num_sidecars */
    )?;

// Finalize the checkpoint by passing the stats
writer.finalize(engine, &last_checkpoint_stats)?;

§Warning

Multi-part (V1) checkpoints are DEPRECATED and UNSAFE.

§Note

We currently do not plan to support UUID-named V2 checkpoints, since S3’s put-if-absent semantics remove the need for UUIDs to ensure uniqueness. Supporting only classic-named checkpoints avoids added complexity, such as coordinating naming decisions between kernel and engine, and handling coexistence with legacy V1 checkpoints. If a compelling use case arises in the future, we can revisit this decision.

Structs§

CheckpointWriter
Orchestrates the process of creating a checkpoint for a table.
LastCheckpointHintStats
Information about a freshly-written checkpoint. Pass it to CheckpointWriter::finalize to produce the _last_checkpoint hint file.

Enums§

CheckpointSpec
Specifies the checkpoint format and behavior.
V2CheckpointConfig
Configuration for V2 checkpoints.

Constants§

DEFAULT_FILE_ACTIONS_PER_SIDECAR_HINT
Default value for V2CheckpointConfig::WithSidecar::file_actions_per_sidecar_hint. It’s the suggested upper bound of file actions (add and remove) per sidecar file when the caller does not provide an explicit hint.