etl-unit 0.1.0

Semantic data model for ETL units — qualities and measurements over subjects and time. Built on Polars.
docs.rs failed to build etl-unit-0.1.0
Please check the build logs for more information.
See Builds for ideas on how to fix a failed build, or Metadata for how to configure docs.rs builds.
If you believe this is docs.rs' fault, open an issue.

etl-unit

crates.io docs.rs License

A semantic data model for ETL — qualities and measurements over subjects and time, built on Polars.

etl-unit separates the logical schema (what your data means) from the physical data (what's in the parquet/CSV columns). You declare the schema once, point it at any source whose physical columns can be mapped to your canonical names, and the same downstream code (subset requests, derivations, signal policies, time-window resampling) works unchanged.

At a glance

EtlSchema  +  BoundSource(s)  ──►  Universe  ──►  SubsetRequest  ──►  DataFrame
  (logical)     (physical)        (resolved)       (filter/query)    (results)

What's in the box

  • EtlSchema — declarative schema: subjects, qualities, measurements, derivations. Build programmatically via EtlSchema::new(...) or load from a TOML file via from_toml_file().
  • MeasurementUnit — a time-varying observation. Carries a SignalPolicy (how raw samples become a stable signal: instant vs. sliding-window) and a sample_rate (native cadence).
  • QualityUnit — static metadata about a subject (1:1 with subject).
  • Derivation — shape-preserving derived measurements. Three axes:
    • Pointwise — combine multiple units on the same (subject, time) row (e.g. any_on, count_non_zero, sum, ratio).
    • OverTime — time-axis transforms within a subject (Derivative, RollingMean, Lag, Lead).
    • OverSubjects — cross-subject transforms at a single time point (Rank, Quantile/Deciles, ZScore).
  • BoundSource — physical→canonical column binding. Supports direct mapping, computed columns, and unpivoting wide data into long form.
  • Universe — the resolved data layer: all subjects, all measurements, per-source binding rules baked in. Cheap to query repeatedly.
  • EtlUnitSubsetRequest — declarative query: filter subjects, filter qualities, pick measurements, time range, optional interval-bucketed reporting.
  • SubsetUniverse — the materialized result: a Polars DataFrame plus typed metadata (per-measurement stats, stage trace, interval stats, …).

Quick start

# Cargo.toml
[dependencies]
etl-unit = "0.1"
polars   = { version = "0.51", features = ["lazy", "dtype-datetime", "temporal"] }

Build a schema programmatically

use etl_unit::{EtlSchema, MeasurementKind};

let schema = EtlSchema::new("pump_station")
    .subject("station_id")
    .time("observation_time")
    .measurement_with_defaults("sump",     MeasurementKind::Measure)
    .measurement_with_defaults("fuel",     MeasurementKind::Measure)
    .measurement_with_defaults("engine_1", MeasurementKind::Categorical)
    .build()?;
# Ok::<(), Box<dyn std::error::Error>>(())

measurement_with_defaults sets a 60-second instant signal policy and a 60-second sample rate. For production, set them explicitly via .with_policy(...) and .with_sample_rate(...) so you don't hide configuration mistakes behind defaults.

Load a schema from TOML

use etl_unit::EtlSchema;

let schema = EtlSchema::from_toml_file("pump_station.toml")?;
# Ok::<(), Box<dyn std::error::Error>>(())
# pump_station.toml
name    = "pump_station"
subject = "station_id"
time    = "observation_time"

[measurements.sump]
kind         = "measure"
sample_rate  = "60s"
signal_policy = { max_staleness = "60s", windowing = { type = "instant" } }

[measurements.fuel]
kind         = "measure"
sample_rate  = "60s"
signal_policy = { max_staleness = "60s", windowing = { type = "instant" } }

[derivations.any_engine_running]
kind = "categorical"
computation.pointwise = { type = "any_on", inputs = ["engine_1", "engine_2"] }

Measurements, qualities, and derivations are keyed by canonical name ([measurements.sump], not [[measurements]]).

TOML is a convenience layer

from_toml_file is a thin loader that uses the same EtlSchema types underneath. For richer config (sources, chart hints, partitioning, upstream fetchers, intent examples, …) you'll typically want a config-layer crate above etl-unit that owns the TOML shape and translates it into EtlSchema via the builder. The published etl-unit-pipeline crate (planned) covers exactly that path.

Signal policies

Every measurement carries a SignalPolicy that tells the runtime how to turn raw, irregularly-timed samples into a stable signal:

  • Instant — keep the most recent value within max_staleness; emit null if no sample arrives in that window.
  • Sliding — apply an aggregation (mean / max / min / sum / first / last) over a rolling window of duration D requiring at least min_samples.

Pair this with sample_rate (the measurement's native cadence) and upsample / downsample strategies on the source, and the runtime will align everything onto a regular time grid.

Composition

Multiple BoundSources under one EtlSchema build a single Universe. Sources with different sampling rates, different physical column names, even different parquet partition layouts compose into one logical view keyed on (subject, time). The composition is declarative — bring your own sources, the universe is resolved when you build it.

What this is not

  • Not a query planner / SQL replacement — it's a thin semantic layer on Polars. The heavy lifting (LazyFrame, parallel execution, columnar arithmetic) is all Polars.
  • Not a storage layer — bring your own parquet/CSV/in-memory data.
  • Not async — the core is sync. (Source acquisition layers above etl-unit typically are async.)
  • Not a chart renderer — ChartHints is metadata for downstream renderers; the renderer itself is out of scope.

License

Dual-licensed under either of:

at your option.

Contribution

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual-licensed as above, without any additional terms or conditions.