floe-cli 0.6.2

CLI for Floe, a YAML-driven technical ingestion tool.
floe-cli-0.6.2 is not a library.

Floe logo

Floe

Floe is a Polars-powered data contract runtime for reliable file ingestion. It validates raw files or extracted datasets before they enter your trusted layer, routing accepted rows to lakehouse sinks and rejected rows to quarantine with audit reports.

Use Floe when you already have a platform such as Databricks, Fabric-style lakehouses, Snowflake/Open Catalog, MotherDuck, Airflow, or Dagster, but you need a lightweight entry gate for file contracts, quality checks, rejected rows, and run evidence.

Floe complements extract/load tools such as dlt, ingestr, and Airbyte: they get data out of source systems; Floe decides what is allowed into trusted storage.

What Floe does

  • Defines source, schema, checks, accepted output, rejected output, write mode, metadata, and reporting in one human-readable YAML contract.
  • Reads common file exports and extracted datasets from local or cloud storage.
  • Applies schema, casting, nullability, and uniqueness checks before write.
  • Writes accepted rows to Parquet, Delta Lake, Apache Iceberg, or DuckDB.
  • Writes invalid rows separately and emits deterministic JSON run reports.
  • Runs as a CLI binary, Docker image, Python library, or orchestrated job.

How it works

Floe architecture

Each floe run executes a deterministic gate per entity:

Stage What happens
1. Resolve inputs Discover and download source files from local or cloud storage
2. File-level checks Validate schema structure, file format, and headers
3. Row-level checks Apply type casting and not_null checks row by row
4. Entity-level checks Apply unique / primary-key checks across all input rows plus existing accepted data
5. Write outputs Route valid rows to accepted sinks, invalid rows to rejected sinks, and write reports

Floe uses Rust, Polars, and Arrow for single-node columnar execution. At the sink boundary, Arrow RecordBatches are handed to table-format writers without an extra serialization hop.

  • Inputs: CSV · TSV · JSON · Parquet · ORC · Avro · XLSX · XML · Fixed-width
  • Accepted outputs: Parquet · Delta Lake · Apache Iceberg · DuckDB / MotherDuck
  • Storage: local · S3 · ADLS · GCS
  • Catalogs: AWS Glue · Iceberg REST (Polaris, Nessie, Snowflake) · Databricks Unity Catalog

Feature index

Capability Start here
Contracts and full YAML reference docs/config.md
Pipeline phases and execution details docs/how-it-works.md
Checks: schema mismatch, cast, not_null, unique docs/checks.md
Supported inputs, outputs, storage, and catalogs docs/support-matrix.md
Write modes: overwrite, append, merge_scd1, merge_scd2 docs/write_modes.md
Parquet, Delta, Iceberg, and DuckDB sinks docs/sinks/parquet.md, docs/sinks/delta.md, docs/sinks/iceberg.md, docs/sinks/duckdb.md
S3, ADLS, and GCS storage docs/storages/s3.md, docs/storages/adls.md, docs/storages/gcs.md
Incremental file state docs/incremental.md
Profiles and variables docs/profiles.md, docs/variables.md
PII masking docs/pii.md
Reports, logs, and OpenLineage docs/report.md, docs/logging.md, docs/lineage.md
Airflow and Dagster manifests docs/manifest.md, orchestrators/airflow-floe/README.md, orchestrators/dagster-floe/README.md
Python and notebooks docs/python-bindings.md
Installation and CLI usage docs/installation.md, docs/cli.md

Install

macOS / Linux — Homebrew

brew tap malon64/floe
brew install floe

Windows — Scoop

scoop bucket add floe https://github.com/malon64/scoop-floe
scoop install floe

Docker

docker pull ghcr.io/malon64/floe:latest
docker run --rm -v "$PWD:/work" ghcr.io/malon64/floe:latest run -c /work/config.yml

Or download a prebuilt binary from GitHub Releases, or cargo install floe-cli.
Full installation guide

DuckDB sink is shipped as a companion (the default artifacts are lean): use the ghcr.io/malon64/floe-duckdb image, a floe-duckdb binary on your PATH, or the off-PyPI floe-duckdb wheel. The lean floe auto-delegates DuckDB-sink runs to it. → DuckDB support

Quick start

floe validate -c config.yml   # validate config and schema
floe run      -c config.yml   # run the pipeline

Config reference · Example config

For the full documentation entry point, see docs/summary.md.

License

MIT