FileScanConfig

Struct FileScanConfig 

Source
pub struct FileScanConfig {
    pub object_store_url: ObjectStoreUrl,
    pub file_groups: Vec<FileGroup>,
    pub constraints: Constraints,
    pub limit: Option<usize>,
    pub output_ordering: Vec<LexOrdering>,
    pub file_compression_type: FileCompressionType,
    pub file_source: Arc<dyn FileSource>,
    pub batch_size: Option<usize>,
    pub expr_adapter_factory: Option<Arc<dyn PhysicalExprAdapterFactory>>,
    pub partitioned_by_file_group: bool,
    /* private fields */
}
Expand description

The base configurations for a DataSourceExec, the a physical plan for any given file format.

Use DataSourceExec::from_data_source to create a DataSourceExec from a ``FileScanConfig`.

§Example

#[derive(Clone)]
// create FileScan config for reading parquet files from file://
let object_store_url = ObjectStoreUrl::local_filesystem();
let file_source = Arc::new(ParquetSource::new(file_schema.clone()));
let config = FileScanConfigBuilder::new(object_store_url, file_source)
  .with_limit(Some(1000))            // read only the first 1000 records
  .with_projection_indices(Some(vec![2, 3])) // project columns 2 and 3
  .expect("Failed to push down projection")
   // Read /tmp/file1.parquet with known size of 1234 bytes in a single group
  .with_file(PartitionedFile::new("file1.parquet", 1234))
  // Read /tmp/file2.parquet 56 bytes and /tmp/file3.parquet 78 bytes
  // in a  single row group
  .with_file_group(FileGroup::new(vec![
   PartitionedFile::new("file2.parquet", 56),
   PartitionedFile::new("file3.parquet", 78),
  ])).build();
// create an execution plan from the config
let plan: Arc<dyn ExecutionPlan> = DataSourceExec::from_data_source(config);

Fields§

§object_store_url: ObjectStoreUrl

Object store URL, used to get an ObjectStore instance from RuntimeEnv::object_store

This ObjectStoreUrl should be the prefix of the absolute url for files as file:// or s3://my_bucket. It should not include the path to the file itself. The relevant URL prefix must be registered via RuntimeEnv::register_object_store

§file_groups: Vec<FileGroup>

List of files to be processed, grouped into partitions

Each file must have a schema of file_schema or a subset. If a particular file has a subset, the missing columns are padded with NULLs.

DataFusion may attempt to read each partition of files concurrently, however files within a partition will be read sequentially, one after the next.

§constraints: Constraints

Table constraints

§limit: Option<usize>

The maximum number of records to read from this plan. If None, all records after filtering are returned.

§output_ordering: Vec<LexOrdering>

All equivalent lexicographical orderings that describe the schema.

§file_compression_type: FileCompressionType

File compression type

§file_source: Arc<dyn FileSource>

File source such as ParquetSource, CsvSource, JsonSource, etc.

§batch_size: Option<usize>

Batch size while creating new batches Defaults to datafusion_common::config::ExecutionOptions batch_size.

§expr_adapter_factory: Option<Arc<dyn PhysicalExprAdapterFactory>>

Expression adapter used to adapt filters and projections that are pushed down into the scan from the logical schema to the physical schema of the file.

§partitioned_by_file_group: bool

When true, file_groups are organized by partition column values and output_partitioning will return Hash partitioning on partition columns. This allows the optimizer to skip hash repartitioning for aggregates and joins on partition columns.

If the number of file partitions > target_partitions, the file partitions will be grouped in a round-robin fashion such that number of file partitions = target_partitions.

Implementations§

Source§

impl FileScanConfig

Source

pub fn file_schema(&self) -> &SchemaRef

Get the file schema (schema of the files without partition columns)

Source

pub fn table_partition_cols(&self) -> &Vec<FieldRef>

Get the table partition columns

Source

pub fn statistics(&self) -> Statistics

Returns the unprojected table statistics, marking them as inexact if filters are present.

When filters are pushed down (including pruning predicates and bloom filters), we can’t guarantee the statistics are exact because we don’t know how many rows will be filtered out.

Source

pub fn projected_schema(&self) -> Result<Arc<Schema>>

Source

pub fn newlines_in_values(&self) -> bool

👎Deprecated since 52.0.0: newlines_in_values has moved to CsvSource. Access it via CsvSource::csv_options().newlines_in_values instead. It will be removed in 58.0.0 or 6 months after 52.0.0 is released, whichever comes first.

Returns whether newlines in values are supported.

This method always returns false. The actual newlines_in_values setting has been moved to CsvSource and should be accessed via CsvSource::csv_options() instead.

Source

pub fn projected_constraints(&self) -> Constraints

👎Deprecated since 52.0.0: This method is no longer used, use eq_properties instead. It will be removed in 58.0.0 or 6 months after 52.0.0 is released, whichever comes first.
Source

pub fn file_column_projection_indices(&self) -> Option<Vec<usize>>

👎Deprecated since 52.0.0: This method is no longer used, use eq_properties instead. It will be removed in 58.0.0 or 6 months after 52.0.0 is released, whichever comes first.
Source

pub fn split_groups_by_statistics_with_target_partitions( table_schema: &SchemaRef, file_groups: &[FileGroup], sort_order: &LexOrdering, target_partitions: usize, ) -> Result<Vec<FileGroup>>

Splits file groups into new groups based on statistics to enable efficient parallel processing.

The method distributes files across a target number of partitions while ensuring files within each partition maintain sort order based on their min/max statistics.

The algorithm works by:

  1. Takes files sorted by minimum values
  2. For each file:
  • Finds eligible groups (empty or where file’s min > group’s last max)
  • Selects the smallest eligible group
  • Creates a new group if needed
§Parameters
  • table_schema: Schema containing information about the columns
  • file_groups: The original file groups to split
  • sort_order: The lexicographical ordering to maintain within each group
  • target_partitions: The desired number of output partitions
§Returns

A new set of file groups, where files within each group are non-overlapping with respect to their min/max statistics and maintain the specified sort order.

Source

pub fn split_groups_by_statistics( table_schema: &SchemaRef, file_groups: &[FileGroup], sort_order: &LexOrdering, ) -> Result<Vec<FileGroup>>

Attempts to do a bin-packing on files into file groups, such that any two files in a file group are ordered and non-overlapping with respect to their statistics. It will produce the smallest number of file groups possible.

Source

pub fn file_source(&self) -> &Arc<dyn FileSource>

Returns the file_source

Trait Implementations§

Source§

impl Clone for FileScanConfig

Source§

fn clone(&self) -> FileScanConfig

Returns a duplicate of the value. Read more
1.0.0 · Source§

fn clone_from(&mut self, source: &Self)

Performs copy-assignment from source. Read more
Source§

impl DataSource for FileScanConfig

Source§

fn repartitioned( &self, target_partitions: usize, repartition_file_min_size: usize, output_ordering: Option<LexOrdering>, ) -> Result<Option<Arc<dyn DataSource>>>

If supported by the underlying FileSource, redistribute files across partitions according to their size.

Source§

fn output_partitioning(&self) -> Partitioning

Returns the output partitioning for this file scan.

When partitioned_by_file_group is true, this returns Partitioning::Hash on the Hive partition columns, allowing the optimizer to skip hash repartitioning for aggregates and joins on those columns.

Tradeoffs

  • Benefit: Eliminates RepartitionExec and SortExec for queries with GROUP BY or ORDER BY on partition columns.
  • Cost: Files are grouped by partition values rather than split by byte ranges, which may reduce I/O parallelism when partition sizes are uneven. For simple aggregations without ORDER BY, this cost may outweigh the benefit.

Follow-up Work

  • Idea: Could allow byte-range splitting within partition-aware groups, preserving I/O parallelism while maintaining partition semantics.
Source§

fn open( &self, partition: usize, context: Arc<TaskContext>, ) -> Result<SendableRecordBatchStream>

Source§

fn as_any(&self) -> &dyn Any

Source§

fn fmt_as(&self, t: DisplayFormatType, f: &mut Formatter<'_>) -> FmtResult

Format this source for display in explain plans
Source§

fn eq_properties(&self) -> EquivalenceProperties

Source§

fn scheduling_type(&self) -> SchedulingType

Source§

fn partition_statistics(&self, partition: Option<usize>) -> Result<Statistics>

Returns statistics for a specific partition, or aggregate statistics across all partitions if partition is None.
Source§

fn with_fetch(&self, limit: Option<usize>) -> Option<Arc<dyn DataSource>>

Return a copy of this DataSource with a new fetch limit
Source§

fn fetch(&self) -> Option<usize>

Source§

fn metrics(&self) -> ExecutionPlanMetricsSet

Source§

fn try_swapping_with_projection( &self, projection: &ProjectionExprs, ) -> Result<Option<Arc<dyn DataSource>>>

Source§

fn try_pushdown_filters( &self, filters: Vec<Arc<dyn PhysicalExpr>>, config: &ConfigOptions, ) -> Result<FilterPushdownPropagation<Arc<dyn DataSource>>>

Try to push down filters into this DataSource. See ExecutionPlan::handle_child_pushdown_result for more details.
Source§

fn try_pushdown_sort( &self, order: &[PhysicalSortExpr], ) -> Result<SortOrderPushdownResult<Arc<dyn DataSource>>>

Try to create a new DataSource that produces data in the specified sort order. Read more
Source§

fn statistics(&self) -> Result<Statistics>

👎Deprecated since 51.0.0: Use partition_statistics instead
Returns aggregate statistics across all partitions. Read more
Source§

impl Debug for FileScanConfig

Source§

fn fmt(&self, f: &mut Formatter<'_>) -> FmtResult

Formats the value using the given formatter. Read more
Source§

impl DisplayAs for FileScanConfig

Source§

fn fmt_as(&self, t: DisplayFormatType, f: &mut Formatter<'_>) -> FmtResult

Format according to DisplayFormatType, used when verbose representation looks different from the default one Read more
Source§

impl From<FileScanConfig> for FileScanConfigBuilder

Source§

fn from(config: FileScanConfig) -> Self

Converts to this type from the input type.

Auto Trait Implementations§

Blanket Implementations§

Source§

impl<T> Any for T
where T: 'static + ?Sized,

Source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
Source§

impl<T> Borrow<T> for T
where T: ?Sized,

Source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
Source§

impl<T> BorrowMut<T> for T
where T: ?Sized,

Source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
Source§

impl<T> CloneToUninit for T
where T: Clone,

Source§

unsafe fn clone_to_uninit(&self, dest: *mut u8)

🔬This is a nightly-only experimental API. (clone_to_uninit)
Performs copy-assignment from self to dest. Read more
Source§

impl<T> From<T> for T

Source§

fn from(t: T) -> T

Returns the argument unchanged.

Source§

impl<T, U> Into<U> for T
where U: From<T>,

Source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

Source§

impl<T> IntoEither for T

Source§

fn into_either(self, into_left: bool) -> Either<Self, Self>

Converts self into a Left variant of Either<Self, Self> if into_left is true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more
Source§

fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
where F: FnOnce(&Self) -> bool,

Converts self into a Left variant of Either<Self, Self> if into_left(&self) returns true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more
Source§

impl<T> ToOwned for T
where T: Clone,

Source§

type Owned = T

The resulting type after obtaining ownership.
Source§

fn to_owned(&self) -> T

Creates owned data from borrowed data, usually by cloning. Read more
Source§

fn clone_into(&self, target: &mut T)

Uses borrowed data to replace owned data, usually by cloning. Read more
Source§

impl<T, U> TryFrom<U> for T
where U: Into<T>,

Source§

type Error = Infallible

The type returned in the event of a conversion error.
Source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
Source§

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

Source§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
Source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.
Source§

impl<V, T> VZip<V> for T
where V: MultiLane<T>,

Source§

fn vzip(self) -> V

Source§

impl<T> ErasedDestructor for T
where T: 'static,