Skip to main content

Module cleanup

Module cleanup 

Source
Expand description

A task to clean up a lance dataset, removing files that are no longer needed.

Currently we try and be rather conservative about what we delete.

The following types of files may be deleted by the cleanup function:

  • Old manifest files - If a manifest file is older than the threshold and is not the latest manifest then it will be deleted.
  • Unreferenced data files - If a data file is not referenced by any fragment in a valid manifest file then it will be deleted.
  • Unreferenced delete files - If a delete file is not referenced by any fragment in a valid manifest file then it will be deleted.
  • Unreferenced index files - If an index file is not referenced by any valid manifest file then it will be deleted.

It is also difficult to distinguish between a data/tx/idx file which was leftover from an abandoned transaction and a data file which is part of an ongoing operation (both will look like unreferenced data files).

If the file is referenced by at least one manifest (even if that manifest is old and being deleted) then we assume it is not part of an ongoing operation and can be safely deleted.

If the data is not referenced by any manifest then we look at the age of the file. If the file is at least 7 days old then we assume it is probably not part of any ongoing operation and we will delete it.

Otherwise we will leave the file unless delete_unverified is set to true. (which should only be done if the caller can guarantee there are no updates happening at the same time)

Structs§

CleanupPolicy
CleanupPolicyBuilder
RemovalStats

Functions§

auto_cleanup_hook
If the dataset config has lance.auto_cleanup parameters set, this function automatically calls dataset.cleanup_old_versions every lance.auto_cleanup.interval versions. This function calls dataset.cleanup_old_versions with lance.auto_cleanup.older_than for older_than and Some(false) for both delete_unverified and error_if_tagged_old_versions.
build_cleanup_policy
cleanup_cascade_branch
This is trigger when a parent branch is cleaning and clean_referenced_branches is set as true For cascade branches, some cleanup parameters need be overridden.
cleanup_old_versions
Deletes old versions of a dataset, removing files that are no longer needed.