ferro-airflow-dag-parser
Static, AST-based extractor for Apache Airflow™ Python DAG files.
Recovers dag_id, task_ids, task dependencies, schedule, and a
catalogue of "this can't be resolved statically" markers — without
running the source.
⚠️ Alpha (
v0.0.1). API will change between0.0.xreleases. The implementation is pulled from production use in the Ferro ecosystem; it is the static fast-path that orchestrators use to skipCPythonevaluation when a DAG file's structure can be determined by looking at the Python AST alone.
Part of the Ferro ecosystem. Extracted from production use in FerroAir (an Airflow-3-compatible orchestrator written in Rust).
What this crate does
Apache Airflow's reference scheduler imports every dags/*.py file
through CPython on every poll cycle so it can read the resulting
DAG objects. That works, but it pays the full cost of evaluating
every import-time expression — including the ones that have no side
effects relevant to the scheduler.
This crate parses the Python source with the
ruff_python_parser (vendored as
littrs-ruff-python-parser) and walks the AST to
recover the same information statically:
- DAG ID — from
with DAG(dag_id="…")or@dag def fn():. - Task IDs — from every
task_id="…"operator kwarg and every@task-decorated function name, deduplicated, source-order preserved. - Schedule —
schedule="@daily"/schedule_interval="…"/timetable=…. Best-effort stringified for non-string literals. - Dependency edges —
>>/<</set_upstream/set_downstream. default_args={…}presence flag.- Source span (1-indexed inclusive lines) for error messages and jump-to-DAG features.
It also detects seven dynamic-fallback markers that say "static
analysis is incomplete; if you need full fidelity, route this DAG
through CPython":
dag_id=Path(__file__).stem(or any non-literal expression)chain(*list)/cross_downstream(*list)splattask_id=f"task_{i}"(f-string task IDs in a loop)schedule=Asset("…")/schedule=Timetable()(non-literal schedule)@task(expand=…)/@task(partial=…)/ dynamic taskflow decoratorsif X: with DAG(...)(import-time conditional DAG)for x in …: PythonOperator(...)(operator construction in a loop)
What this crate does not do
- Run the file. This is a static analyzer, not an executor. If a DAG's structure depends on runtime state, this crate will surface a dynamic-fallback marker rather than try to evaluate the expression.
- Validate semantics. Recovered identifiers are validated against
Airflow's identifier rule (1–250 chars,
[a-zA-Z0-9_\-\.]). The crate does not check whethertask_idsmatch operator contracts, whether DAG runs would succeed, or whether the schedule is sane. - Mirror the full upstream
DagBagAPI. Whereairflow.models.DagBagreports import errors, plugin lookups, dag-pickle round-trips, and more, this crate reportsResult<ExtractedDag, ParseError>. The call site decides what to do with parse failures.
Quick start
use ;
let src = r#"
from airflow import DAG
from airflow.operators.bash import BashOperator
with DAG(dag_id="hello", schedule="@daily"):
a = BashOperator(task_id="a", bash_command="echo a")
b = BashOperator(task_id="b", bash_command="echo b")
a >> b
"#;
let dag = extract_static_dag.unwrap;
assert_eq!;
assert_eq!;
assert_eq!;
assert!;
assert!;
Cache (for filesystem-watching consumers)
If you are a DAG-folder watcher (the typical use case), construct
a process-local [ParseCache] and call get_or_parse(path) instead
of re-parsing on every poll:
use Path;
use ParseCache;
let cache = new;
let outcome = cache.get_or_parse.unwrap;
println!;
The cache uses a stat-only fast path (mtime + size fingerprint) before falling back to re-hashing the file contents, matching the behaviour of Airflow's reference DAG processor.
Backend
The crate uses ruff_python_parser (vendored as the
littrs-ruff-python-parser crates.io mirror, pinned
for reproducibility). It is the only backend, gated behind the
parser-ruff feature which is on by default. Set
default-features = false if you only need the codec-free types
in [common] and [line_index].
A second rustpython-parser backend was used as a parity-checking
companion during the originating Ferro PoC and removed before
publication: its transitive dependency closure pulls
LGPL-3.0-only crates (the malachite-* family) and unmaintained
Unicode crates, neither of which are appropriate for this
workspace's Apache-2.0-clean license profile.
Status
| Aspect | Status |
|---|---|
| API stability | alpha (v0.0.x — breaking changes allowed at any release) |
| Use in production | Yes, in FerroAir |
| MSRV | rustc 1.88 |
| Coverage target | 80%+ line; current measured in CI |
| Async runtime | None (synchronous; the ParseCache uses dashmap for thread safety) |
| Test fixtures | Inline source strings only — no vendored Apache Airflow™ DAGs |
Used in production by
- FerroAir — Apache
Airflow-3-compatible orchestrator written in Rust. The static
fast-path uses this crate (private at time of writing; will switch
to
ferro-airflow-dag-parseronce published).
Compatibility note
Apache Airflow™ is a registered trademark of the Apache Software Foundation. This crate implements a static analyzer compatible with the Airflow DAG Python API; it is not endorsed by, or affiliated with, the ASF.
The identifier-validation rule (1–250 characters drawn from
[a-zA-Z0-9_\-\.]) is taken from Airflow's reference implementation
(airflow.models.dag.DAG_ID_RE_VALID_CHARS).
Triage policy
See the workspace CONTRIBUTING.md. In
short: security 48h, bugs (with a reproducer) 14 days best-effort,
features collected for the next minor.
License
Apache-2.0. See LICENSE.