pub struct Scan { /* private fields */ }Expand description
The result of building a scan over a table. This can be used to get the actual data from scanning the table.
Implementations§
Source§impl Scan
impl Scan
Sourcepub fn table_root(&self) -> &Url
pub fn table_root(&self) -> &Url
The table’s root URL. Any relative paths returned from scan_data (or in a callback from
ScanMetadata::visit_scan_files) must be resolved against this root to get the actual path to
the file.
Sourcepub fn snapshot(&self) -> &SnapshotRef
pub fn snapshot(&self) -> &SnapshotRef
Get a shared reference to the Snapshot of this scan.
Sourcepub fn logical_schema(&self) -> &SchemaRef
pub fn logical_schema(&self) -> &SchemaRef
Get a shared reference to the logical Schema of the scan (i.e. the output schema of the
scan). Note that the logical schema can differ from the physical schema due to e.g.
partition columns which are present in the logical schema but not in the physical schema.
Sourcepub fn physical_schema(&self) -> &SchemaRef
pub fn physical_schema(&self) -> &SchemaRef
Get a shared reference to the physical Schema of the scan. This represents the schema
of the underlying data files which must be read from storage.
Sourcepub fn physical_predicate(&self) -> Option<PredicateRef>
pub fn physical_predicate(&self) -> Option<PredicateRef>
Get the predicate PredicateRef of the scan.
Sourcepub fn logical_stats_schema(&self) -> Option<&SchemaRef>
Available on crate feature internal-api only.
pub fn logical_stats_schema(&self) -> Option<&SchemaRef>
internal-api only.Get the logical schema for file statistics.
When stats_columns is requested in a scan, the stats_parsed column in scan metadata
contains file statistics read using physical column names (to handle column mapping).
This method returns the corresponding logical schema that maps those physical column
names back to the table’s logical column names, enabling engines to interpret the stats
correctly.
Returns None if stats were not requested (i.e., stats_columns was not set in the scan).
Sourcepub fn scan_metadata(
&self,
engine: &dyn Engine,
) -> DeltaResult<impl Iterator<Item = DeltaResult<ScanMetadata>>>
pub fn scan_metadata( &self, engine: &dyn Engine, ) -> DeltaResult<impl Iterator<Item = DeltaResult<ScanMetadata>>>
Get an iterator of ScanMetadatas that should be used to facilitate a scan. This handles
log-replay, reconciling Add and Remove actions, and applying data skipping (if possible).
Each item in the returned iterator is a struct of:
Box<dyn EngineData>: Data in engine format, where each row represents a file to be scanned. The schema for each row can be obtained by callingscan_row_schema.Vec<bool>: A selection vector. If a row is at indexiand this vector isfalseat indexi, then that row should not be processed (i.e. it is filtered out). If the vector istrueat indexithe row should be processed. If the selection vector is shorter than the number of rows returned, missing elements are consideredtrue, i.e. included in the query. NB: If you are using the default engine and plan to call arrow’sfilter_record_batch, you need to extend this vector to the full length of the batch or arrow will drop the extra rows.Vec<Option<Expression>>: Transformation expressions that need to be applied. For each row at indexiin the above data, if an expression exists at indexiin theVec, the associated expression must be applied to the data read from the file specified by the row. The resultant schema for this expression is guaranteed to beSelf::logical_schema(). If the item at indexiin thisVecisNone, or if theVeccontains fewer thanielements, no expression need be applied and the data read from disk is already in the correct logical state.
Sourcepub fn scan_metadata_from(
&self,
engine: &dyn Engine,
existing_version: Version,
existing_data: impl IntoIterator<Item = Box<dyn EngineData>> + 'static,
_existing_predicate: Option<PredicateRef>,
) -> DeltaResult<Box<dyn Iterator<Item = DeltaResult<ScanMetadata>>>>
Available on crate feature internal-api only.
pub fn scan_metadata_from( &self, engine: &dyn Engine, existing_version: Version, existing_data: impl IntoIterator<Item = Box<dyn EngineData>> + 'static, _existing_predicate: Option<PredicateRef>, ) -> DeltaResult<Box<dyn Iterator<Item = DeltaResult<ScanMetadata>>>>
internal-api only.Get an updated iterator of ScanMetadatas based on an existing iterator of EngineDatas.
The existing iterator is assumed to contain data from a previous call to scan_metadata.
Engines may decide to cache the results of scan_metadata to avoid additional IO operations
required to replay the log.
As such the new scan’s predicate must “contain” the previous scan’s predicate. That is, the new scan’s predicate MUST skip all files the previous scan’s predicate skipped. The new scan’s predicate is also allowed to skip files the previous predicate kept. For example, if the previous scan predicate was
WHERE a < 42 AND b = 10then it is legal for the new scan to use predicates such as the following:
WHERE a = 30 AND b = 10
WHERE a < 10 AND b = 10
WHERE a < 42 AND b = 10 AND c = 20but it is NOT legal for the new scan to use predicates like these:
WHERE a < 42
WHERE a = 50 AND b = 10
WHERE a < 42 AND b <= 10
WHERE a < 42 OR b = 10The current implementation does not yet validate the existing predicate against the current predicate. Until this is implemented, the caller must ensure that the existing predicate is compatible with the current predicate.
§Parameters
existing_version- Table version the provided data was read from.existing_data- Existing processed scan metadata with all selection vectors applied.existing_predicate- The predicate used by the previous scan.
Sourcepub fn parallel_scan_metadata(
&self,
engine: Arc<dyn Engine>,
) -> DeltaResult<Phase1ScanMetadata>
Available on crate feature internal-api only.
pub fn parallel_scan_metadata( &self, engine: Arc<dyn Engine>, ) -> DeltaResult<Phase1ScanMetadata>
internal-api only.Start a parallel scan metadata processing for the table.
This method returns a Phase1ScanMetadata iterator that processes commits and
checkpoint manifests sequentially. After exhausting this iterator, call finish()
to determine if a distributed phase is needed.
§Example
let engine = Arc::new(DefaultEngineBuilder::new(Arc::new(LocalFileSystem::new())).build());
let table_root = Url::parse("file:///path/to/table")?;
// Build a snapshot
let snapshot = Snapshot::builder_for(table_root.clone())
.at_version(5) // Optional: specify a time-travel version (default is latest version)
.build(engine.as_ref())?;
let scan = snapshot.scan_builder().build()?;
let mut phase1 = scan.parallel_scan_metadata(engine.clone())?;
// Process sequential phase
for result in phase1.by_ref() {
let scan_metadata = result?;
// Process scan metadata...
}
// Check if distributed phase is needed
match phase1.finish()? {
AfterPhase1ScanMetadata::Done(_) => {
// All processing complete
}
AfterPhase1ScanMetadata::Parallel { processor, files } => {
// Wrap processor in Arc for sharing across threads
let processor = Arc::new(processor);
// Distribute files for parallel processing (e.g., one file per worker)
for file in files {
let phase2 = Phase2ScanMetadata::try_new(
engine.clone(),
processor.clone(),
vec![file],
)?;
for result in phase2 {
let scan_metadata = result?;
// Process scan metadata...
}
}
}
}Sourcepub fn execute(
&self,
engine: Arc<dyn Engine>,
) -> DeltaResult<impl Iterator<Item = DeltaResult<Box<dyn EngineData>>>>
pub fn execute( &self, engine: Arc<dyn Engine>, ) -> DeltaResult<impl Iterator<Item = DeltaResult<Box<dyn EngineData>>>>
Perform an “all in one” scan. This will use the provided engine to read and process all
the data for the query. Each EngineData in the resultant iterator is a portion of the
final table data. Generally connectors/engines will want to use Scan::scan_metadata so
they can have more control over the execution of the scan.
Trait Implementations§
Auto Trait Implementations§
impl Freeze for Scan
impl !RefUnwindSafe for Scan
impl Send for Scan
impl Sync for Scan
impl Unpin for Scan
impl UnsafeUnpin for Scan
impl !UnwindSafe for Scan
Blanket Implementations§
Source§impl<T> AsAny for T
impl<T> AsAny for T
Source§fn any_ref(&self) -> &(dyn Any + Sync + Send + 'static)
fn any_ref(&self) -> &(dyn Any + Sync + Send + 'static)
dyn Any reference to the object: Read moreSource§fn as_any(self: Arc<T>) -> Arc<dyn Any + Sync + Send>
fn as_any(self: Arc<T>) -> Arc<dyn Any + Sync + Send>
Arc<dyn Any> reference to the object: Read moreSource§fn into_any(self: Box<T>) -> Box<dyn Any + Sync + Send>
fn into_any(self: Box<T>) -> Box<dyn Any + Sync + Send>
Box<dyn Any>: Read moreSource§fn type_name(&self) -> &'static str
fn type_name(&self) -> &'static str
std::any::type_name, since Any does not provide it and
Any::type_id is useless as a debugging aid (its Debug is just a mess of hex digits).Source§impl<T> BorrowMut<T> for Twhere
T: ?Sized,
impl<T> BorrowMut<T> for Twhere
T: ?Sized,
Source§fn borrow_mut(&mut self) -> &mut T
fn borrow_mut(&mut self) -> &mut T
Source§impl<T> Instrument for T
impl<T> Instrument for T
Source§fn instrument(self, span: Span) -> Instrumented<Self>
fn instrument(self, span: Span) -> Instrumented<Self>
Source§fn in_current_span(self) -> Instrumented<Self>
fn in_current_span(self) -> Instrumented<Self>
Source§impl<T> IntoEither for T
impl<T> IntoEither for T
Source§fn into_either(self, into_left: bool) -> Either<Self, Self>
fn into_either(self, into_left: bool) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left is true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read moreSource§fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left(&self) returns true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read moreSource§impl<T> PolicyExt for Twhere
T: ?Sized,
impl<T> PolicyExt for Twhere
T: ?Sized,
Source§impl<KernelType, ArrowType> TryIntoArrow<ArrowType> for KernelTypewhere
ArrowType: TryFromKernel<KernelType>,
impl<KernelType, ArrowType> TryIntoArrow<ArrowType> for KernelTypewhere
ArrowType: TryFromKernel<KernelType>,
Source§fn try_into_arrow(self) -> Result<ArrowType, ArrowError>
fn try_into_arrow(self) -> Result<ArrowType, ArrowError>
default-engine-native-tls or default-engine-rustls or arrow-conversion) and crate feature arrow-conversion only.Source§impl<KernelType, ArrowType> TryIntoKernel<KernelType> for ArrowTypewhere
KernelType: TryFromArrow<ArrowType>,
impl<KernelType, ArrowType> TryIntoKernel<KernelType> for ArrowTypewhere
KernelType: TryFromArrow<ArrowType>,
Source§fn try_into_kernel(self) -> Result<KernelType, ArrowError>
fn try_into_kernel(self) -> Result<KernelType, ArrowError>
default-engine-native-tls or default-engine-rustls or arrow-conversion) and crate feature arrow-conversion only.