
Floe
Technical ingestion on a single node, driven by YAML contracts.
Floe is a Rust + Polars tool for technical ingestion on a single node. It ingests raw files into typed datasets using YAML contracts, applies schema enforcement and data quality rules, and produces clear, auditable outputs.
Start here: docs/summary.md
What Floe solves
- Schema enforcement and type casting (
strictvscoerce) - Nullability checks (
not_null) - Uniqueness checks (
unique) - Policy behavior:
warn/reject/abort - Accepted vs rejected outputs for clean separation
- JSON run reports for observability and audit
Why Polars + Rust
- Polars provides fast, columnar execution on a single node without JVM overhead.
- Rust gives predictable performance and low-level control while keeping memory usage tight.
- The combo fits contract-driven ingestion: schema checks, deterministic outputs, and stable reports.
Minimal config example
version: "0.1"
report:
path: "./reports"
entities:
- name: "customer"
source:
format: "csv"
path: "./example/in/customer"
sink:
accepted:
format: "parquet"
path: "./example/out/accepted/customer"
rejected:
format: "csv"
path: "./example/out/rejected/customer"
policy:
severity: "reject"
schema:
columns:
- name: "customer_id"
type: "string"
nullable: false
unique: true
- name: "created_at"
type: "datetime"
nullable: true
Full example: example/config.yml
Config reference: docs/config.md Support matrix: docs/support-matrix.md
Quickstart (Homebrew)
Install
Validate
Automation / orchestrators (single JSON object on stdout):
Run
Troubleshooting
If Homebrew is unavailable:
- GitHub Releases: download the prebuilt binary from the latest release
- Cargo:
cargo install floe-cli
More CLI details: docs/cli.md Full installation guide: docs/installation.md
Run with Docker
Pull
Run (mount local folder)
Run Floe against the repo example config by mounting the current directory to /work:
Notes:
- All CLI arguments are identical to local usage.
- Cloud credentials are passed via environment variables (or runtime identity), not baked into the image.
Sample console output
run id: run-123
report base: ./reports
==> entity customer (severity=reject, format=csv)
REJECTED customers.csv rows=10 accepted=8 rejected=2 elapsed_ms=12 accepted_out=customer rejected_out=customers_rejected.csv
Totals: files=1 rows=10 accepted=8 rejected=2
Overall: rejected (exit_code=0)
Run summary: ./reports/run_run-123/run.summary.json
Outputs explained
- Accepted output:
entities[].sink.accepted.path - Rejected output:
entities[].sink.rejected.path - Reports:
<report.path>/run_<run_id>/<entity.name>/run.json
Reports include per-entity JSON, a run summary, and key counters (rows, accepted/rejected, errors).
Report details: docs/report.md
Severity policy
warn: keep all rows and report violationsreject: reject only rows with violations; keep valid rowsabort: reject the entire file on first violation
Checks and policy details: docs/checks.md
Supported formats
Inputs:
- CSV (local + S3/ADLS/GCS via temp download)
- Fixed-width (local + S3/ADLS/GCS via temp download)
- TSV (tab-delimited; local + S3/ADLS/GCS via temp download)
- JSON (array/ndjson; local + S3/ADLS/GCS via temp download)
- ORC (local + S3/ADLS/GCS via temp download)
- Parquet (local + S3/ADLS/GCS via temp download)
- XLSX (local + S3/ADLS/GCS via temp download)
- Avro (local + S3/ADLS/GCS via temp download)
- XML (local + S3/ADLS/GCS via temp download)
Outputs:
- Accepted: Parquet (local + cloud via temp upload), Delta (local + cloud via object_store)
- Rejected: CSV (local + cloud via temp upload)
- Reports: JSON (local + cloud via temp upload)
Sink details:
- Options: docs/sinks/options.md
- Delta: docs/sinks/delta.md
- Iceberg: docs/sinks/iceberg.md
- Support matrix: docs/support-matrix.md
Cloud integration and storages
Floe resolves all paths through a storage registry in the config. By default,
paths use local://. To use cloud storage, define a storage (with credentials
or bucket info) and reference it on source/sink. S3, ADLS, and GCS are
implemented; dbfs:// (Databricks) is on the roadmap.
Example (S3 storage):
storages:
default: local
definitions:
- name: local
type: local
- name: s3_raw
type: s3
bucket: my-bucket
region: eu-west-1
# credentials via standard AWS env vars or profile
entities:
- name: customer
source:
storage: s3_raw
path: raw/customer/
Storage guides:
More docs
- How it works: docs/how-it-works.md
- Checks: docs/checks.md
- Reports: docs/report.md
License
MIT