pub struct DatasetConfig {
pub name: String,
pub source: SourceConfig,
pub s3: Option<S3Config>,
pub index: IndexConfig,
pub columns: Vec<String>,
pub dict_encode: bool,
pub lazy: bool,
}Fields§
§name: String§source: SourceConfig§s3: Option<S3Config>§index: IndexConfig§columns: Vec<String>Optional column projection applied at load time. When non-empty, only the listed columns are read from the parquet/delta source — every other column is skipped entirely (no decode, no allocation, no resident memory). Empty (default) = read all columns. Names are matched case-insensitively against the source schema.
dict_encode: boolWhen true (default), Utf8 columns that are dictionary-encoded in
the source parquet are read as Arrow Dictionary(Int32, Utf8)
instead of being expanded to plain Utf8. Massively cheaper in RAM
for low-cardinality columns. Set to false to bypass the override
— useful as a workaround if you observe null-handling oddities on
a particular parquet file.
lazy: boolWhen true, the backend should keep the dataset on disk and stream
it at query time instead of materialising it into RAM at startup.
Trades the in-memory hot paths (raw Arrow slice, equality index)
for bounded memory use on large / multi-file sources. Honoured by
the DataFusion backend (local + S3 parquet) and by the DuckDB
backend, which registers the dataset as a view over the source scan
(local + S3 parquet, and delta) rather than materialising a table.
Implementations§
Source§impl DatasetConfig
impl DatasetConfig
Sourcepub fn resolve_local_parquet_files(&self) -> Result<Vec<PathBuf>, AppError>
pub fn resolve_local_parquet_files(&self) -> Result<Vec<PathBuf>, AppError>
Expand source.location to a concrete list of local .parquet
files. Only valid for kind = parquet on local paths — S3 and
Delta sources are resolved by the backend itself.
Accepts three location shapes:
- a single
*.parquetfile - a directory (lists every
*.parquetdirectly inside, non-recursive) - a glob pattern containing
*,?or[…](e.g.data/year=2024/*.parquet,data/**/*.parquet)
Sourcepub fn estimate_local_bytes(&self) -> Option<u64>
pub fn estimate_local_bytes(&self) -> Option<u64>
Estimate the on-disk byte size of this dataset’s local backing
files. Returns None for S3 sources (sizing would require a
network round-trip) or when nothing can be measured.
parquetsums the resolved.parquetfiles (single file, directory, or glob).deltasums every*.parquetdata file under the table root. This slightly over-counts when stale files haven’t been vacuumed, which is fine for a coarse force-lazy threshold.
Sourcepub fn force_lazy_bytes(&self, server: &ServerConfig) -> Option<u64>
pub fn force_lazy_bytes(&self, server: &ServerConfig) -> Option<u64>
Decide whether this dataset should be forced into lazy mode given
the server’s force_lazy_above_mb threshold. Returns Some(bytes)
(the measured size) when it should be forced, so the caller can log
it. Returns None when the dataset is already lazy, the threshold
is disabled, the source is S3, or the measured size is unknown or at
or below the threshold.
Sourcepub fn env_prefix(&self) -> String
pub fn env_prefix(&self) -> String
Env-var prefix derived from the dataset name: uppercase with
non-alphanumeric chars replaced by _. E.g. sales.eu-1 →
SALES_EU_1.
Sourcepub fn resolved_creds(&self) -> ResolvedCreds
pub fn resolved_creds(&self) -> ResolvedCreds
Resolve S3 credentials following the precedence chain documented at the top of this module. Returns an empty struct when nothing was found — the caller should then leave credential resolution to the engine’s default provider chain.
Sourcepub fn resolved_region(&self) -> String
pub fn resolved_region(&self) -> String
Resolved S3 region: per-dataset env (${PREFIX}_AWS_REGION)
→ inline → AWS_REGION → AWS_DEFAULT_REGION → us-east-1.
Trait Implementations§
Source§impl Clone for DatasetConfig
impl Clone for DatasetConfig
Source§fn clone(&self) -> DatasetConfig
fn clone(&self) -> DatasetConfig
1.0.0 (const: unstable) · Source§fn clone_from(&mut self, source: &Self)
fn clone_from(&mut self, source: &Self)
source. Read more