
Floe
Floe is a single-node, YAML-driven data ingestion framework written in Rust.
You describe your data contract in a config file; Floe reads raw files, enforces your schema and quality rules, and writes clean accepted rows to your sink — routing invalid rows to a separate rejected output.
Architecture

Save the architecture diagram as
docs/assets/architecture.pngto render this image.
Each floe run executes a four-stage pipeline per entity:
| Stage | What happens |
|---|---|
| 1. Resolve inputs | Discover and download source files from local or cloud storage |
| 2. File-level checks | Validate schema structure, file format, and headers |
| 3. Row-level checks | Apply type casting and not_null checks row by row |
| 4. Entity-level checks | Apply unique / primary-key checks across all input rows plus existing accepted data (SCD-aware) |
Rows that pass all checks go to the accepted sink. Rows that fail go to the rejected sink. The severity policy (warn / reject / abort) controls how failures are handled. A JSON run report is written after every run.
What Floe solves
- Schema enforcement — strict or coerce cast modes, column type checking
- Data quality —
not_null,unique, primary-key checks - Clean separation — accepted and rejected outputs in the same run
- Incremental ingestion — per-entity file-state tracking to skip unchanged files
- Auditability — per-entity and summary JSON reports on every run
- Cloud-native paths — S3, ADLS, GCS sources and sinks via a storage registry
Minimal config example
version: "0.3"
report:
path: "./reports"
entities:
- name: customer
source:
format: csv
path: ./in/customer
sink:
accepted:
format: parquet
path: ./out/accepted/customer
rejected:
format: csv
path: ./out/rejected/customer
policy:
severity: reject
schema:
columns:
- name: customer_id
type: string
nullable: false
unique: true
- name: created_at
type: datetime
nullable: true
Full example: example/config.yml
Config reference: docs/config.md
Support matrix: docs/support-matrix.md
Quickstart
Install (Homebrew)
Alternatives: download a prebuilt binary from GitHub Releases or cargo install floe-cli.
Full installation guide: docs/installation.md
Validate a config
Run
Run with an environment profile
Use a profile to inject environment-specific values (bucket names, paths, etc.) into {{VAR}} placeholders in your config without editing the config itself:
# profiles/prod.yaml
apiVersion: floe/v1
kind: EnvironmentProfile
metadata:
name: prod
variables:
BUCKET: my-prod-bucket
BASE_PATH: /data/prod
# Cross-variable references are supported:
OUT_PATH: ${BASE_PATH}/accepted
Variable priority (highest wins): env.vars in config → env.file → profile variables.
Run with Docker
Cloud credentials are passed via environment variables, not baked into the image.
More CLI details: docs/cli.md
Sample output
run id: run-20240501-abc123
report base: ./reports
==> entity customer (severity=reject, format=csv)
REJECTED customers.csv rows=10 accepted=8 rejected=2 elapsed_ms=12
Totals: files=1 rows=10 accepted=8 rejected=2
Overall: rejected (exit_code=0)
Run summary: ./reports/run_run-20240501-abc123/run.summary.json
Severity policy
| Policy | Behaviour |
|---|---|
warn |
Keep all rows, surface violations in the report |
reject |
Route violating rows to rejected sink, keep valid rows |
abort |
Fail the entire entity on first violation |
Checks and policy details: docs/checks.md
Incremental ingestion
Set incremental_mode: file on an entity to enable file-level state tracking. Floe records
processed file metadata and skips unchanged files on subsequent runs.
Supported formats
Inputs (local + S3 / ADLS / GCS):
CSV · TSV · JSON (array/ndjson) · Parquet · ORC · Avro · XLSX · XML · Fixed-width
Accepted outputs:
Parquet · Delta Lake (append, overwrite, merge SCD1/SCD2) · Apache Iceberg
Rejected outputs: CSV
Reports: JSON
Sink details: docs/sinks/options.md · Delta · Iceberg
Cloud storage
Define a storage in your config and reference it on source/sink:
storages:
definitions:
- name: s3_raw
type: s3
bucket: my-bucket
region: eu-west-1
entities:
- name: customer
source:
storage: s3_raw
path: raw/customer/
Storage guides: S3 · ADLS · GCS
Orchestration
floe manifest generate produces a JSON manifest that orchestrators can read to schedule
entities as individual tasks:
Connectors: dagster-floe · airflow-floe
More docs
License
MIT