inbq
A library for parsing BigQuery queries and extracting schema-aware, column-level lineage.
Features
- Parse BigQuery queries into well-structured ASTs with easy-to-navigate nodes.
- Extract schema-aware, column-level lineage.
- Trace data flow through nested structs and arrays.
- Capture referenced columns and the specific query components (e.g., select, where, join) they appear in.
- Process both single and multi-statement queries with procedural language constructs.
- Built for speed and efficiency, with lightweight Python bindings that add minimal overhead.
Python
Install
pip install inbq
Example (Pipeline API)
=
=
=
=
=
# This loop will iterate just once as we have only one query
assert ,
assert ,
# Prints:
# ast=Ast(...)
# Lineage:
# Inputs:
# project.dataset.out->id <- ['project.dataset.t2->id', 'project.dataset.t1->id']
# project.dataset.out->val <- ['project.dataset.t2->s.x', 'project.dataset.t1->x', 'project.dataset.out->val']
#
# Side inputs:
# project.dataset.out->id <- ['project.dataset.t2->s.source @ where', 'project.dataset.t2->id @ join', 'project.dataset.t1->id @ join']
# project.dataset.out->val <- ['project.dataset.t2->s.source @ where', 'project.dataset.t2->id @ join', 'project.dataset.t1->id @ join']
#
# Referenced columns:
# project.dataset.out->val referenced in ['default_var', 'select']
# project.dataset.t1->id referenced in ['join', 'select']
# project.dataset.t1->x referenced in ['select']
# project.dataset.t2->id referenced in ['join', 'select']
# project.dataset.t2->s.x referenced in ['select']
# project.dataset.t2->s.source referenced in ['where']
Note: What happens if you remove the insert and just keep the select in the query? inbq is designed to handle this gracefully. It will return the lineage for the last SELECT statement, but since the destination is no longer explicit, the output object (an anonymous query) will be assigned an anonymous identifier (e.g., !anon_4). Try it yourself and see how the output changes!
To learn more about the output elements (Lineage, Side Inputs, and Referenced Columns), please see the Concepts section.
Example (Individual Functions)
If you don't like the Pipeline API, you can use these functions instead:
parse_sql and parse_sql_to_dict
Parse a single SQL query:
=
# You can also get a dictionary representation of the AST
=
parse_sqls
Parse multiple SQL queries in parallel:
=
=
parse_sqls_and_extract_lineage
Parse SQLs and extract lineage in one go:
, =
AST Navigation
=
=
# Example: find updated tables and columns
:
:
:
| :
# Example: find `like` filters
:
:
Variants and Variant Types in Python
The AST nodes in Python are auto-generated dataclasses from their Rust definitions.
For instance, a Rust enum Expr might be defined as:
In Python, this translates to corresponding classes like Expr_Binary(vty=BinaryExpr), Expr_Identifier(vty=Identifier), etc.
The vty attribute stands for "variant type" (unit variants do not have a vty attribute).
You can search for any type of object using .find_all(), whether it's the variant (e.g., Expr_Identifier) or the concrete variant type (e.g., Identifier).
Rust
Install
cargo add inbq
Example
use ;
Command Line Interface
Install binary
Extract Lineage
-
Prepare your data catalog: create a JSON file (e.g., catalog.json) that defines the schema for all tables and views referenced in your SQL queries.
-
Run inbq: pass the catalog file and your SQL file or directory of multiple SQL files to the inbq lineage command.
The output is written to stdout.
Concepts
Lineage
Column-level lineage tracks how data flows from a destination column back to its original source columns. A destination column's value is derived from its direct input columns, and this process is applied recursively to trace the lineage back to the foundational source columns. For example, in with tmp as (select a+b as tmp_c from t) select tmp_c as c from t, the lineage for column c traces back to a and b as its source columns (the source table is t).
Lineage - Side Inputs
Side inputs are columns that indirectly contribute to the final set of output values. As the name implies, they aren't part of the direct SELECT list, but are found in the surrounding clauses that shape the result, such as WHERE, JOIN, WINDOW, etc. Side inputs influence is traced recursively. For example, in the query:
with cte as (select id, c1 from table1 where f1>10)
select c2 as z
from table2 inner join cte using (id)
table1.f1 is a side input to z with sides join and where (cte.id, later used in the join condition, is filtered by table1.f1). The other two side inputs are table1.id with side join and table2.id with side join.
Referenced Columns
Referenced columns provide a detailed map of where each input column is mentioned within a query. This is the entry point for a column into the query's logic. From this initial reference, the column can then influence other parts of the query indirectly through subsequent operations.
Limitations
While this library can parse and extract lineage for most BigQuery syntax, there are some current limitations. For example, the pipe (|) syntax and the recently introduced MATCH_RECOGNIZE clause are not yet supported. Requests and contributions for unsupported features are welcome.
Contributing
Here's a brief overview of the project's key modules:
crates/inbq/src/parser.rs: contains the hand-written top-down parser.crates/inbq/src/ast.rs: defines the Abstract Syntax Tree (AST) nodes.- Note: If you add or modify AST nodes here, you must regenerate the corresponding Python nodes. You can do this by running
cargo run --bin inbq_genpy, which will updatecrates/py_inbq/python/inbq/ast_nodes.py.
- Note: If you add or modify AST nodes here, you must regenerate the corresponding Python nodes. You can do this by running
crates/inbq/src/lineage.rs: contains the core logic for extracting column-level lineage from the AST.crates/py_inbq/: this crate exposes the Rust backend as a Python module via PyO3.crates/inbq/tests/: this directory contains the tests. You can add new test cases for parsing and lineage extraction by editing the.tomlfiles:parsing_tests.tomllineage_tests.toml