Floe

Technical ingestion on a single node, driven by YAML contracts.

Floe is a Rust + Polars tool for technical ingestion on a single node. It ingests raw files into typed datasets using YAML contracts, applies schema enforcement and data quality rules, and produces clear, auditable outputs.

Start here: docs/summary.md

What Floe solves

Schema enforcement and type casting (strict vs coerce)
Nullability checks (not_null)
Uniqueness checks (unique)
Policy behavior: warn / reject / abort
Accepted vs rejected outputs for clean separation
JSON run reports for observability and audit

Why Polars + Rust

Polars provides fast, columnar execution on a single node without JVM overhead.
Rust gives predictable performance and low-level control while keeping memory usage tight.
The combo fits contract-driven ingestion: schema checks, deterministic outputs, and stable reports.

Minimal config example

version: "0.1"
report:
  path: "./reports"
entities:
  - name: "customer"
    source:
      format: "csv"
      path: "./example/in/customer"
    sink:
      accepted:
        format: "parquet"
        path: "./example/out/accepted/customer"
      rejected:
        format: "csv"
        path: "./example/out/rejected/customer"
    policy:
      severity: "reject"
    schema:
      columns:
        - name: "customer_id"
          type: "string"
          nullable: false
          unique: true
        - name: "created_at"
          type: "datetime"
          nullable: true

Full example: example/config.yml

Config reference: docs/config.md Support matrix: docs/support-matrix.md

Quickstart (Homebrew)

Install

brew tap malon64/floe
brew install floe
floe --version

Validate

floe validate -c example/config.yml

Automation / orchestrators (single JSON object on stdout):

floe validate -c example/config.yml --output json

Run

floe run -c example/config.yml

Troubleshooting

If Homebrew is unavailable:

GitHub Releases: download the prebuilt binary from the latest release
Cargo: cargo install floe-cli

More CLI details: docs/cli.md Full installation guide: docs/installation.md

Run with Docker

Pull

docker pull ghcr.io/malon64/floe:latest

Run (mount local folder)

Run Floe against the repo example config by mounting the current directory to /work:

docker run --rm -v "$PWD:/work" ghcr.io/malon64/floe:latest run -c /work/example/config.yml

Notes:

All CLI arguments are identical to local usage.
Cloud credentials are passed via environment variables (or runtime identity), not baked into the image.

Sample console output

run id: run-123
report base: ./reports
==> entity customer (severity=reject, format=csv)
  REJECTED customers.csv rows=10 accepted=8 rejected=2 elapsed_ms=12 accepted_out=customer rejected_out=customers_rejected.csv
Totals: files=1 rows=10 accepted=8 rejected=2
Overall: rejected (exit_code=0)
Run summary: ./reports/run_run-123/run.summary.json

Outputs explained

Accepted output: entities[].sink.accepted.path
Rejected output: entities[].sink.rejected.path
Reports: <report.path>/run_<run_id>/<entity.name>/run.json

Reports include per-entity JSON, a run summary, and key counters (rows, accepted/rejected, errors).

Report details: docs/report.md

Severity policy

warn: keep all rows and report violations
reject: reject only rows with violations; keep valid rows
abort: reject the entire file on first violation

Checks and policy details: docs/checks.md

Supported formats

Inputs:

CSV (local + S3/ADLS/GCS via temp download)
Fixed-width (local + S3/ADLS/GCS via temp download)
TSV (tab-delimited; local + S3/ADLS/GCS via temp download)
JSON (array/ndjson; local + S3/ADLS/GCS via temp download)
ORC (local + S3/ADLS/GCS via temp download)
Parquet (local + S3/ADLS/GCS via temp download)
XLSX (local + S3/ADLS/GCS via temp download)
Avro (local + S3/ADLS/GCS via temp download)
XML (local + S3/ADLS/GCS via temp download)

Outputs:

Accepted: Parquet (local + cloud via temp upload), Delta (local + cloud via object_store)
Rejected: CSV (local + cloud via temp upload)
Reports: JSON (local + cloud via temp upload)

Sink details:

Cloud integration and storages

Floe resolves all paths through a storage registry in the config. By default, paths use local://. To use cloud storage, define a storage (with credentials or bucket info) and reference it on source/sink. S3, ADLS, and GCS are implemented; dbfs:// (Databricks) is on the roadmap.

Example (S3 storage):

storages:
  default: local
  definitions:
    - name: local
      type: local
    - name: s3_raw
      type: s3
      bucket: my-bucket
      region: eu-west-1
      # credentials via standard AWS env vars or profile
entities:
  - name: customer
    source:
      storage: s3_raw
      path: raw/customer/

Storage guides:

More docs

How it works: docs/how-it-works.md
Checks: docs/checks.md
Reports: docs/report.md

License

MIT

floe-core 0.2.7