Skip to main content

Scan

Struct Scan 

Source
pub struct Scan { /* private fields */ }
Expand description

The result of building a scan over a table. This can be used to get the actual data from scanning the table.

Implementations§

Source§

impl Scan

Source

pub fn table_root(&self) -> &Url

The table’s root URL. Any relative paths returned from scan_data (or in a callback from ScanMetadata::visit_scan_files) must be resolved against this root to get the actual path to the file.

Source

pub fn snapshot(&self) -> &SnapshotRef

Get a shared reference to the Snapshot of this scan.

Source

pub fn logical_schema(&self) -> &SchemaRef

Get a shared reference to the logical Schema of the scan (i.e. the output schema of the scan). Note that the logical schema can differ from the physical schema due to e.g. partition columns which are present in the logical schema but not in the physical schema.

Source

pub fn physical_schema(&self) -> &SchemaRef

Get a shared reference to the physical Schema of the scan. This represents the schema of the underlying data files which must be read from storage.

Source

pub fn physical_predicate(&self) -> Option<PredicateRef>

Get the predicate PredicateRef of the scan.

Source

pub fn logical_stats_schema(&self) -> Option<&SchemaRef>

Available on crate feature internal-api only.

Get the logical schema for file statistics.

When stats_columns is requested in a scan, the stats_parsed column in scan metadata contains file statistics read using physical column names (to handle column mapping). This method returns the corresponding logical schema that maps those physical column names back to the table’s logical column names, enabling engines to interpret the stats correctly.

Returns None if stats were not requested (i.e., stats_columns was not set in the scan).

Source

pub fn scan_metadata( &self, engine: &dyn Engine, ) -> DeltaResult<impl Iterator<Item = DeltaResult<ScanMetadata>>>

Get an iterator of ScanMetadatas that should be used to facilitate a scan. This handles log-replay, reconciling Add and Remove actions, and applying data skipping (if possible). Each item in the returned iterator is a struct of:

  • Box<dyn EngineData>: Data in engine format, where each row represents a file to be scanned. The schema for each row can be obtained by calling scan_row_schema.
  • Vec<bool>: A selection vector. If a row is at index i and this vector is false at index i, then that row should not be processed (i.e. it is filtered out). If the vector is true at index i the row should be processed. If the selection vector is shorter than the number of rows returned, missing elements are considered true, i.e. included in the query. NB: If you are using the default engine and plan to call arrow’s filter_record_batch, you need to extend this vector to the full length of the batch or arrow will drop the extra rows.
  • Vec<Option<Expression>>: Transformation expressions that need to be applied. For each row at index i in the above data, if an expression exists at index i in the Vec, the associated expression must be applied to the data read from the file specified by the row. The resultant schema for this expression is guaranteed to be Self::logical_schema(). If the item at index i in this Vec is None, or if the Vec contains fewer than i elements, no expression need be applied and the data read from disk is already in the correct logical state.
Source

pub fn scan_metadata_from( &self, engine: &dyn Engine, existing_version: Version, existing_data: impl IntoIterator<Item = Box<dyn EngineData>> + 'static, _existing_predicate: Option<PredicateRef>, ) -> DeltaResult<Box<dyn Iterator<Item = DeltaResult<ScanMetadata>>>>

Available on crate feature internal-api only.

Get an updated iterator of ScanMetadatas based on an existing iterator of EngineDatas.

The existing iterator is assumed to contain data from a previous call to scan_metadata. Engines may decide to cache the results of scan_metadata to avoid additional IO operations required to replay the log.

As such the new scan’s predicate must “contain” the previous scan’s predicate. That is, the new scan’s predicate MUST skip all files the previous scan’s predicate skipped. The new scan’s predicate is also allowed to skip files the previous predicate kept. For example, if the previous scan predicate was

WHERE a < 42 AND b = 10

then it is legal for the new scan to use predicates such as the following:

WHERE a = 30 AND b = 10
WHERE a < 10 AND b = 10
WHERE a < 42 AND b = 10 AND c = 20

but it is NOT legal for the new scan to use predicates like these:

WHERE a < 42
WHERE a = 50 AND b = 10
WHERE a < 42 AND b <= 10
WHERE a < 42 OR b = 10

The current implementation does not yet validate the existing predicate against the current predicate. Until this is implemented, the caller must ensure that the existing predicate is compatible with the current predicate.

§Parameters
  • existing_version - Table version the provided data was read from.
  • existing_data - Existing processed scan metadata with all selection vectors applied.
  • existing_predicate - The predicate used by the previous scan.
Source

pub fn parallel_scan_metadata( &self, engine: Arc<dyn Engine>, ) -> DeltaResult<Phase1ScanMetadata>

Available on crate feature internal-api only.

Start a parallel scan metadata processing for the table.

This method returns a Phase1ScanMetadata iterator that processes commits and checkpoint manifests sequentially. After exhausting this iterator, call finish() to determine if a distributed phase is needed.

§Example
let engine = Arc::new(DefaultEngineBuilder::new(Arc::new(LocalFileSystem::new())).build());
let table_root = Url::parse("file:///path/to/table")?;

// Build a snapshot
let snapshot = Snapshot::builder_for(table_root.clone())
    .at_version(5) // Optional: specify a time-travel version (default is latest version)
    .build(engine.as_ref())?;
let scan = snapshot.scan_builder().build()?;
let mut phase1 = scan.parallel_scan_metadata(engine.clone())?;

// Process sequential phase
for result in phase1.by_ref() {
    let scan_metadata = result?;
    // Process scan metadata...
}

// Check if distributed phase is needed
match phase1.finish()? {
    AfterPhase1ScanMetadata::Done(_) => {
        // All processing complete
    }
    AfterPhase1ScanMetadata::Parallel { processor, files } => {
        // Wrap processor in Arc for sharing across threads
        let processor = Arc::new(processor);
        // Distribute files for parallel processing (e.g., one file per worker)
        for file in files {
            let phase2 = Phase2ScanMetadata::try_new(
                engine.clone(),
                processor.clone(),
                vec![file],
            )?;
            for result in phase2 {
                let scan_metadata = result?;
                // Process scan metadata...
            }
        }
    }
}
Source

pub fn execute( &self, engine: Arc<dyn Engine>, ) -> DeltaResult<impl Iterator<Item = DeltaResult<Box<dyn EngineData>>>>

Perform an “all in one” scan. This will use the provided engine to read and process all the data for the query. Each EngineData in the resultant iterator is a portion of the final table data. Generally connectors/engines will want to use Scan::scan_metadata so they can have more control over the execution of the scan.

Trait Implementations§

Source§

impl Debug for Scan

Source§

fn fmt(&self, f: &mut Formatter<'_>) -> Result<(), Error>

Formats the value using the given formatter. Read more

Auto Trait Implementations§

§

impl Freeze for Scan

§

impl !RefUnwindSafe for Scan

§

impl Send for Scan

§

impl Sync for Scan

§

impl Unpin for Scan

§

impl UnsafeUnpin for Scan

§

impl !UnwindSafe for Scan

Blanket Implementations§

Source§

impl<T> Any for T
where T: 'static + ?Sized,

Source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
Source§

impl<T> AsAny for T
where T: Any + Send + Sync,

Source§

fn any_ref(&self) -> &(dyn Any + Sync + Send + 'static)

Obtains a dyn Any reference to the object: Read more
Source§

fn as_any(self: Arc<T>) -> Arc<dyn Any + Sync + Send>

Obtains an Arc<dyn Any> reference to the object: Read more
Source§

fn into_any(self: Box<T>) -> Box<dyn Any + Sync + Send>

Converts the object to Box<dyn Any>: Read more
Source§

fn type_name(&self) -> &'static str

Convenient wrapper for std::any::type_name, since Any does not provide it and Any::type_id is useless as a debugging aid (its Debug is just a mess of hex digits).
Source§

impl<T> Borrow<T> for T
where T: ?Sized,

Source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
Source§

impl<T> BorrowMut<T> for T
where T: ?Sized,

Source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
Source§

impl<T> From<T> for T

Source§

fn from(t: T) -> T

Returns the argument unchanged.

Source§

impl<T> Instrument for T

Source§

fn instrument(self, span: Span) -> Instrumented<Self>

Instruments this type with the provided Span, returning an Instrumented wrapper. Read more
Source§

fn in_current_span(self) -> Instrumented<Self>

Instruments this type with the current Span, returning an Instrumented wrapper. Read more
Source§

impl<T, U> Into<U> for T
where U: From<T>,

Source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

Source§

impl<T> IntoEither for T

Source§

fn into_either(self, into_left: bool) -> Either<Self, Self>

Converts self into a Left variant of Either<Self, Self> if into_left is true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more
Source§

fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
where F: FnOnce(&Self) -> bool,

Converts self into a Left variant of Either<Self, Self> if into_left(&self) returns true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more
Source§

impl<T> PolicyExt for T
where T: ?Sized,

Source§

fn and<P, B, E>(self, other: P) -> And<T, P>
where T: Policy<B, E>, P: Policy<B, E>,

Create a new Policy that returns Action::Follow only if self and other return Action::Follow. Read more
Source§

fn or<P, B, E>(self, other: P) -> Or<T, P>
where T: Policy<B, E>, P: Policy<B, E>,

Create a new Policy that returns Action::Follow if either self or other returns Action::Follow. Read more
Source§

impl<T> Same for T

Source§

type Output = T

Should always be Self
Source§

impl<T, U> TryFrom<U> for T
where U: Into<T>,

Source§

type Error = Infallible

The type returned in the event of a conversion error.
Source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
Source§

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

Source§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
Source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.
Source§

impl<KernelType, ArrowType> TryIntoArrow<ArrowType> for KernelType
where ArrowType: TryFromKernel<KernelType>,

Source§

fn try_into_arrow(self) -> Result<ArrowType, ArrowError>

Available on (crate features default-engine-native-tls or default-engine-rustls or arrow-conversion) and crate feature arrow-conversion only.
Source§

impl<KernelType, ArrowType> TryIntoKernel<KernelType> for ArrowType
where KernelType: TryFromArrow<ArrowType>,

Source§

fn try_into_kernel(self) -> Result<KernelType, ArrowError>

Available on (crate features default-engine-native-tls or default-engine-rustls or arrow-conversion) and crate feature arrow-conversion only.
Source§

impl<V, T> VZip<V> for T
where V: MultiLane<T>,

Source§

fn vzip(self) -> V

Source§

impl<T> WithSubscriber for T

Source§

fn with_subscriber<S>(self, subscriber: S) -> WithDispatch<Self>
where S: Into<Dispatch>,

Attaches the provided Subscriber to this type, returning a WithDispatch wrapper. Read more
Source§

fn with_current_subscriber(self) -> WithDispatch<Self>

Attaches the current default Subscriber to this type, returning a WithDispatch wrapper. Read more