etl-unit 0.1.0 - Docs.rs

# etl-unit

[![crates.io](https://img.shields.io/crates/v/etl-unit.svg)](https://crates.io/crates/etl-unit)
[![docs.rs](https://docs.rs/etl-unit/badge.svg)](https://docs.rs/etl-unit)
[![License](https://img.shields.io/badge/license-MIT%2FApache--2.0-blue.svg)](#license)

A semantic data model for ETL — `qualities` and `measurements` over
subjects and time, built on [Polars](https://crates.io/crates/polars).

`etl-unit` separates the **logical schema** (what your data *means*) from
the **physical data** (what's in the parquet/CSV columns). You declare
the schema once, point it at any source whose physical columns can be
mapped to your canonical names, and the same downstream code (subset
requests, derivations, signal policies, time-window resampling) works
unchanged.

## At a glance

```text
EtlSchema  +  BoundSource(s)  ──►  Universe  ──►  SubsetRequest  ──►  DataFrame
  (logical)     (physical)        (resolved)       (filter/query)    (results)
```

## What's in the box

- **`EtlSchema`** — declarative schema: subjects, qualities, measurements,
  derivations. Build programmatically via `EtlSchema::new(...)` or load
  from a TOML file via `from_toml_file()`.
- **`MeasurementUnit`** — a time-varying observation. Carries a
  **`SignalPolicy`** (how raw samples become a stable signal: instant
  vs. sliding-window) and a **`sample_rate`** (native cadence).
- **`QualityUnit`** — static metadata about a subject (1:1 with subject).
- **`Derivation`** — shape-preserving derived measurements. Three axes:
  - `Pointwise` — combine multiple units on the same `(subject, time)` row
    (e.g. `any_on`, `count_non_zero`, sum, ratio).
  - `OverTime` — time-axis transforms within a subject (`Derivative`,
    `RollingMean`, `Lag`, `Lead`).
  - `OverSubjects` — cross-subject transforms at a single time point
    (`Rank`, `Quantile`/Deciles, `ZScore`).
- **`BoundSource`** — physical→canonical column binding. Supports direct
  mapping, computed columns, and unpivoting wide data into long form.
- **`Universe`** — the resolved data layer: all subjects, all measurements,
  per-source binding rules baked in. Cheap to query repeatedly.
- **`EtlUnitSubsetRequest`** — declarative query: filter subjects,
  filter qualities, pick measurements, time range, optional
  interval-bucketed reporting.
- **`SubsetUniverse`** — the materialized result: a Polars `DataFrame`
  plus typed metadata (per-measurement stats, stage trace, interval
  stats, …).

## Quick start

```toml
# Cargo.toml
[dependencies]
etl-unit = "0.1"
polars   = { version = "0.51", features = ["lazy", "dtype-datetime", "temporal"] }
```

### Build a schema programmatically

```rust
use etl_unit::{EtlSchema, MeasurementKind};

let schema = EtlSchema::new("pump_station")
    .subject("station_id")
    .time("observation_time")
    .measurement_with_defaults("sump",     MeasurementKind::Measure)
    .measurement_with_defaults("fuel",     MeasurementKind::Measure)
    .measurement_with_defaults("engine_1", MeasurementKind::Categorical)
    .build()?;
# Ok::<(), Box<dyn std::error::Error>>(())
```

`measurement_with_defaults` sets a 60-second instant signal policy and a
60-second sample rate. For production, set them explicitly via
`.with_policy(...)` and `.with_sample_rate(...)` so you don't hide
configuration mistakes behind defaults.

### Load a schema from TOML

```rust,no_run
use etl_unit::EtlSchema;

let schema = EtlSchema::from_toml_file("pump_station.toml")?;
# Ok::<(), Box<dyn std::error::Error>>(())
```

```toml
# pump_station.toml
name    = "pump_station"
subject = "station_id"
time    = "observation_time"

[measurements.sump]
kind         = "measure"
sample_rate  = "60s"
signal_policy = { max_staleness = "60s", windowing = { type = "instant" } }

[measurements.fuel]
kind         = "measure"
sample_rate  = "60s"
signal_policy = { max_staleness = "60s", windowing = { type = "instant" } }

[derivations.any_engine_running]
kind = "categorical"
computation.pointwise = { type = "any_on", inputs = ["engine_1", "engine_2"] }
```

Measurements, qualities, and derivations are keyed by canonical name
(`[measurements.sump]`, not `[[measurements]]`).

### TOML is a convenience layer

`from_toml_file` is a thin loader that uses the same `EtlSchema` types
underneath. For richer config (sources, chart hints, partitioning,
upstream fetchers, intent examples, …) you'll typically want a
config-layer crate above `etl-unit` that owns the TOML shape and
translates it into `EtlSchema` via the builder. The published
`etl-unit-pipeline` crate (planned) covers exactly that path.

## Signal policies

Every measurement carries a `SignalPolicy` that tells the runtime how to
turn raw, irregularly-timed samples into a stable signal:

- **Instant** — keep the most recent value within `max_staleness`; emit
  `null` if no sample arrives in that window.
- **Sliding** — apply an aggregation (mean / max / min / sum / first /
  last) over a rolling window of duration `D` requiring at least
  `min_samples`.

Pair this with `sample_rate` (the measurement's native cadence) and
`upsample` / `downsample` strategies on the source, and the runtime will
align everything onto a regular time grid.

## Composition

Multiple `BoundSource`s under one `EtlSchema` build a single `Universe`.
Sources with different sampling rates, different physical column names,
even different parquet partition layouts compose into one logical view
keyed on `(subject, time)`. The composition is declarative — bring your
own sources, the universe is resolved when you build it.

## What this is *not*

- Not a query planner / SQL replacement — it's a thin semantic layer on
  Polars. The heavy lifting (LazyFrame, parallel execution, columnar
  arithmetic) is all Polars.
- Not a storage layer — bring your own parquet/CSV/in-memory data.
- Not async — the core is sync. (Source acquisition layers above
  `etl-unit` typically *are* async.)
- Not a chart renderer — `ChartHints` is metadata for downstream
  renderers; the renderer itself is out of scope.

## License

Dual-licensed under either of:

- Apache License, Version 2.0 ([LICENSE-APACHE](LICENSE-APACHE))
- MIT license ([LICENSE-MIT](LICENSE-MIT))

at your option.

### Contribution

Unless you explicitly state otherwise, any contribution intentionally
submitted for inclusion in the work by you, as defined in the Apache-2.0
license, shall be dual-licensed as above, without any additional terms
or conditions.