Expand description
A task to clean up a lance dataset, removing files that are no longer needed.
Currently we try and be rather conservative about what we delete.
The following types of files may be deleted by the cleanup function:
- Old manifest files - If a manifest file is older than the threshold and is not the latest manifest then it will be deleted.
- Unreferenced data files - If a data file is not referenced by any fragment in a valid manifest file then it will be deleted.
- Unreferenced delete files - If a delete file is not referenced by any fragment in a valid manifest file then it will be deleted.
- Unreferenced index files - If an index file is not referenced by any valid manifest file then it will be deleted.
It is also difficult to distinguish between a data/tx/idx file which was leftover from an abandoned transaction and a data file which is part of an ongoing operation (both will look like unreferenced data files).
If the file is referenced by at least one manifest (even if that manifest is old and being deleted) then we assume it is not part of an ongoing operation and can be safely deleted.
If the data is not referenced by any manifest then we look at the age of the file. If the file is at least 7 days old then we assume it is probably not part of any ongoing operation and we will delete it.
Otherwise we will leave the file unless delete_unverified is set to true. (which should only be done if the caller can guarantee there are no updates happening at the same time)
Structs§
Functions§
- auto_
cleanup_ hook - If the dataset config has
lance.auto_cleanupparameters set, this function automatically callsdataset.cleanup_old_versionseverylance.auto_cleanup.intervalversions. This function callsdataset.cleanup_old_versionswithlance.auto_cleanup.older_thanforolder_thanandSome(false)for bothdelete_unverifiedanderror_if_tagged_old_versions. - build_
cleanup_ policy - cleanup_
cascade_ branch - This is trigger when a parent branch is cleaning and
clean_referenced_branchesis set as true For cascade branches, some cleanup parameters need be overridden. - cleanup_
old_ versions - Deletes old versions of a dataset, removing files that are no longer needed.