Floe

Technical ingestion on a single node, driven by YAML contracts.

Floe is a Rust + Polars tool for technical ingestion on a single node. It ingests raw files into typed datasets using YAML contracts, applying schema enforcement and simple data quality rules with clear, auditable outputs.

What Floe solves

Schema enforcement and type casting (strict vs coerce)
Nullability checks (not_null)
Uniqueness checks (unique)
Policy behavior: warn / reject / abort
Accepted vs rejected outputs for clean separation
JSON run reports for observability and audit

Why Polars + Rust

Polars provides fast, columnar execution on a single node without JVM overhead.
Rust gives predictable performance and low-level control while keeping memory usage tight.
The combo fits contract-driven ingestion: schema checks, deterministic outputs, and stable reports.

Minimal config example

version: "0.1"
report:
  path: "./reports"
entities:
  - name: "customer"
    source:
      format: "csv"
      path: "./example/in/customer"
    sink:
      accepted:
        format: "parquet"
        path: "./example/out/accepted/customer"
      rejected:
        format: "csv"
        path: "./example/out/rejected/customer"
    policy:
      severity: "reject"
    schema:
      columns:
        - name: "customer_id"
          type: "string"
          nullable: false
          unique: true
        - name: "created_at"
          type: "datetime"
          nullable: true

Full example: example/config.yml

Config reference: docs/config.md

Quickstart (Homebrew)

Install

brew tap malon64/floe
brew install floe
floe --version

Validate

floe validate -c example/config.yml

Run

floe run -c example/config.yml

Troubleshooting

If Homebrew is unavailable:

GitHub Releases: download the prebuilt binary from the latest release
Cargo: cargo install floe-cli

More CLI details: docs/cli.md

Sample console output

run id: run-123
report base: ./reports
==> entity customer (severity=reject, format=csv)
  REJECTED customers.csv rows=10 accepted=8 rejected=2 elapsed_ms=12 accepted_out=customers.parquet rejected_out=customers_rejected.csv
Totals: files=1 rows=10 accepted=8 rejected=2
Overall: rejected (exit_code=0)
Run summary: ./reports/run_run-123/run.summary.json

Outputs explained

Accepted output: entities[].sink.accepted.path
Rejected output: entities[].sink.rejected.path
Reports: <report.path>/run_<run_id>/<entity.name>/run.json

Reports include per-entity JSON, a run summary, and key counters (rows, accepted/rejected, errors).

Report details: docs/report.md

Severity policy

warn: keep all rows and report violations
reject: reject only rows with violations; keep valid rows
abort: reject the entire file on first violation

Checks and policy details: docs/checks.md

Supported formats

Inputs:

CSV (local and S3)
Parquet (local and S3)
NDJSON (local and S3)

Outputs:

Accepted: Parquet, Delta
Rejected: CSV

Sink details:

Cloud integration and storages

Floe resolves all paths through a storage registry in the config. By default, paths use local://. To use cloud storage, define a storage (with credentials or bucket info) and reference it on source/sink. Currently only S3 is implemented; Google Cloud Storage, Azure Data Lake Storage, and dbfs:// (Databricks) are on the roadmap.

Example (S3 storage):

storages:
  default: local
  definitions:
    - name: local
      type: local
    - name: s3_raw
      type: s3
      bucket: my-bucket
      region: eu-west-1
      # credentials via standard AWS env vars or profile
entities:
  - name: customer
    source:
      storage: s3_raw
      path: raw/customer/

Storage guide: docs/storages/s3.md

Roadmap (near term)

Cloud integration for storage and compute
Python library release
Orchestrator integrations (Airflow, Dagster)
More input/output formats, including database sources and sinks
Data platform integrations (Databricks, Microsoft Fabric, Snowflake)

Feature tracking: docs/features.md

License

MIT

floe-core 0.1.5