pondrs 0.3.0 - Docs.rs

# Partitioned Dataset

`PartitionedDataset` represents a directory of files where each file is treated as a separate partition.

*Requires the `std` feature.*

## Definition

```rust,ignore
pub struct PartitionedDataset<D: FileDataset> {
    pub path: String,
    pub ext: String,
    pub dataset: D,
}
```

- **`path`** — the directory to read from / write to
- **`ext`** — file extension to filter by (e.g. `"csv"`, `"parquet"`, `"txt"`)
- **`dataset`** — a template dataset that is cloned and pointed at each file

## Loading

Returns `HashMap<String, D::LoadItem>` where keys are filename stems:

```text
data/partitions/
  january.csv
  february.csv
  march.csv
```

```rust,ignore
// loads as HashMap { "february" => ..., "january" => ..., "march" => ... }
```

The template dataset is cloned for each file, its path is set to the full file path, and `load()` is called on the clone.

## Saving

Accepts `HashMap<String, D::SaveItem>` and writes each entry as `{path}/{name}.{ext}`. Parent directories are created automatically.

When the inner dataset's `prefer_parallel()` returns `true` and the pipeline is running inside a rayon thread pool (e.g. via `ParallelRunner`), partition saves are distributed across threads. This is the default behavior for [`LazyDataset`](./lazy.md) wrappers.

```rust,ignore
Node {
    name: "split_by_month",
    func: |df: DataFrame| -> (HashMap<String, DataFrame>,) {
        // split DataFrame into partitions...
    },
    input: (&cat.all_data,),
    output: (&cat.monthly,),  // PartitionedDataset<PolarsCsvDataset>
}
```

## `FileDataset` requirement

The inner dataset type must implement `FileDataset`:

```rust,ignore
pub trait FileDataset: Dataset + Clone {
    fn path(&self) -> &str;
    fn set_path(&mut self, path: &str);
    fn prefer_parallel(&self) -> bool { false }
    fn ensure_parent_dir(&self) -> Result<(), std::io::Error> { ... }
    fn list_entries(&self, path: &str, ext: &str) -> Result<Vec<String>, PondError> { ... }
}
```

`list_entries` scans the directory for files matching `ext` and returns their stems, sorted. You can override it for non-filesystem storage.

Built-in types that implement `FileDataset`: `PolarsCsvDataset`, `PolarsParquetDataset`, `TextDataset`, `JsonDataset`, `ImageDataset`, `LazyDataset<D>` (for any `D: FileDataset`).

## YAML configuration

```yaml
monthly:
  path: data/partitions
  ext: csv
  dataset:
    separator: ","
    has_header: true
```

The `dataset` field configures the template dataset that is cloned for each partition file.

## Lazy partitions

For deferred, parallel partition processing, wrap the inner dataset in `LazyDataset`. See [Lazy Dataset](./lazy.md) for details on `LazyPartitionedDataset` and `PartitionedNode`.