substrait_validator/
lib.rs

1// SPDX-License-Identifier: Apache-2.0
2
3//! Crate for validating [Substrait](https://substrait.io/).
4//!
5//! # Usage
6//!
7//! The usage pattern is roughly as follows.
8//!
9//!  1) Build a [`Config`] structure to configure the validator. You can also
10//!     just use [`std::default::Default`] if you don't need to configure
11//!     anything, but you might want to at least call
12//!     `Config::add_curl_uri_resolver()` (if you're using the `curl` feature).
13//!  2) Parse the incoming `substrait.Plan` message using [`parse()`] or
14//!     [`validate()`]. This creates a [ParseResult], containing a
15//!     [tree](output::tree) structure corresponding to the query plan that also
16//!     contains diagnostics and other annotations added by the validator.
17//!  3) You can traverse the tree yourself using [ParseResult::root], or you can
18//!     use one of the methods associated with [ParseResult] to obtain the
19//!     validation results you need.
20//!
21//! Note that only the binary protobuf serialization format is supported at the
22//! input; the JSON format is *not* supported. This is a limitation of `prost`,
23//! the crate that was used for protobuf deserialization. If you're looking for
24//! a library (or CLI) that supports more human-friendly input, check out the
25//! Python bindings.
26#![cfg_attr(
27    feature = "private_docs",
28    allow(rustdoc::private_intra_doc_links),
29    doc = "
30# Internal workings
31
32*Very* roughly speaking, the validation process boils down to a conversion from
33[one type of tree structure](input::proto::substrait::Plan) (including
34expansions of any referenced YAML files) to [another](output::tree::Node),
35using the facilities provided by the [parse module](mod@parse). This process is
36documented in much more detail [here](mod@parse). Once constructed, the
37resulting tree can then be [further converted](export) to a few export formats,
38or the validation [diagnostics](output::diagnostic) can simply be
39[extracted](ParseResult::iter_diagnostics()).
40
41This crate only supports the binary protobuf serialization format as input;
42that conversion is ultimately done [here](parse::traversal::parse_proto())
43using a combination of [prost] and some unfortunate magic in
44[substrait_validator_derive]. That is to say: it does *NOT* support JSON format
45or variations thereof. This is because support for protobuf JSON is flaky
46beyond the official bindings, likely in no small part due to all the case
47conversion magic and special cases crammed into that format. Since there are no
48official protobuf bindings for Rust, there is no way to do this from within the
49crate without reinventing the wheel as a square.
50
51Instead, the Python bindings, generated using
52[maturin](https://github.com/PyO3/maturin), include the user-facing logic for
53this. This is also the primary reason why the CLI is written in Python, rather
54than in Rust. When a format other than binary protobuf is passed to the Python
55package, it uses the official protobuf bindings for Python to (re)serialize to
56the binary format, before handing control to the Rust crate. For the return
57trip, the protobuf export format (using the message tree defined in the
58[substrait.validator](https://github.com/substrait-io/substrait-validator/blob\
59/main/proto/substrait/validator/validator.proto) protobuf namespace) is used to
60pass the parse result to Python.
61
62C bindings also exist. These are of the not-very-user-friendly sort, however;
63they exist primarily to allow the validator to be used from within the testing
64frameworks of whatever language you want, provided they support calling into
65C-like libraries.
66
67## Testing strategy
68
69Currently, this crate has (almost) no test cases of its own. This is primarily
70to do with the fact that validating only part of a plan would require complex
71context setup and that, ideally, the (bits of) plan for the test cases are
72written in either JSON or a yet-more user-friendly variant thereof. For the
73reasons given above, this can't really be done from within Rust.
74
75Instead, tests are run using the [test-runner crate](https://github.com/\
76    substrait-io/substrait-validator/tree/main/tests) and its associated Python
77frontend. The Python frontend pre-parses YAML test description files into a
78JSON file that's easy to read from within Rust via serde-json, after which the
79Rust crate takes over to run the test cases. The pre-parsing involves
80converting the JSON-as-YAML protobuf tree into the binary serialization format,
81but also allows diagnostic presence checks to be inserted in the plan where
82they are expected (rather than having to link up the tree paths manually) and
83allows YAML extensions to be specified inline (they'll be extracted and
84replaced with a special URI that the test runner understands).
85
86The APIs for the bindings on top of the Rust crate are tested using
87[pytest](https://docs.pytest.org/) (Python) and
88[googletest](https://google.github.io/googletest/) (C).
89
90## Resolving extension URIs
91
92URI resolution deserves an honorable mention here, because it unfortunately
93can't easily be hidden away in some private module: anything that uses HTTPS
94must either link into the operating system's certificate store or ship its own
95root certificates. The latter is sure to be a security issue, so let's restrict
96ourselves to the former solution.
97
98The problem with this is that it pollutes the Rust crate with runtime linking
99shenanigans that are not at all compatible from one system to another. In
100particular, we can't build universal Python packages around crates that do
101this. Since we rely on Python for the CLI, this is a bit of an issue.
102
103For this reason, URI resolution is guarded behind the `curl` feature. When the
104feature is enabled, `libcurl` will be used to resolve URIs, using the system's
105certificate store for HTTPS. When disabled, the crate will fall back to
106resolving only `file://` URIs, unless a more capable resolver is
107[installed](Config::add_uri_resolver()). The Python bindings will do just that:
108they install a resolver based on Python's own
109[urllib](https://docs.python.org/3/library/urllib.html).
110
111## Build process
112
113The build process for the crates and Python module also involves some
114not-so-obvious magic, to do with shipping the Substrait protobuf and YAML
115schema as appropriate. The problem is that Cargo and Python's packaging logic
116require that all files shipped with the package be located within the package
117source tree, which is not the case here due to the common submodule and proto
118directories.
119
120### Rust
121
122If the [`in-git-repo` file](https://github.com/substrait-io/\
123substrait-validator/blob/main/rs/in-git-repo) exists, the
124[build.rs file for this crate](https://github.com/substrait-io/\
125substrait-validator/blob/main/rs/build.rs) will copy the proto and schema files
126from their respective source locations into `src/resources`, thus keeping them
127in sync. The `in-git-repo` file is not included in the crate manifest, so this
128step is skipped when the crate is compiled after being downloaded from
129crates.io. Note however, that in order to release this crate, it must always
130first be built: the only time during the packaging process when build.rs is
131called is already on the user's machine, so the resource files won't be
132synchronized by `cargo package`.
133
134### Python
135
136The process for Python is much the same, but handled by a
137[wrapper around maturin](https://github.com/substrait-io/substrait-validator/\
138blob/main/py/substrait_validator_build/__init__.py), as maturin does not expose
139pre-build hooks of its own. The `in-git-repo` file isn't necessary here; we can
140use the `local_dependencies` file that will be generated by the packaging tools
141as part of a source distribution as a marker.
142
143Here, too, it's important that the synchronization logic is run manually prior
144to various release-like operations. This can be done by running
145[prepare_build.py](https://github.com/substrait-io/substrait-validator/blob/\
146main/py/prepare_build.py).
147
148### Protobuf
149
150Protobuf code generation is done via `prost`, which requires access to a
151`protoc` executable. This will need to be installed on your system while
152developing (e.g. via a package manager). In CI, it is installed as part of the
153Github actions.
154    "
155)]
156
157#[macro_use]
158pub mod output;
159
160#[macro_use]
161mod parse;
162
163pub mod export;
164pub mod input;
165
166mod util;
167
168use std::str::FromStr;
169
170use input::proto::substrait::Plan;
171use strum::IntoEnumIterator;
172
173// Aliases for common types used on the crate interface.
174pub use input::config::glob::Pattern;
175pub use input::config::Config;
176pub use output::comment::Comment;
177pub use output::diagnostic::Classification;
178pub use output::diagnostic::Diagnostic;
179pub use output::diagnostic::Level;
180pub use output::parse_result::ParseResult;
181pub use output::parse_result::Validity;
182
183/// Parses and validates the given substrait [Plan] message and returns the
184/// parse tree and diagnostic results.
185pub fn parse<B: prost::bytes::Buf + Clone>(buffer: B, config: &Config) -> ParseResult {
186    parse::parse(buffer, config)
187}
188
189/// Validates the given substrait [Plan] message and returns the parse tree  and
190/// diagnostic results.
191pub fn validate(plan: &Plan, config: &Config) -> ParseResult {
192    parse::validate(plan, config)
193}
194
195/// Returns an iterator that yields all known diagnostic classes.
196pub fn iter_diagnostics() -> impl Iterator<Item = Classification> {
197    Classification::iter()
198}
199
200/// Returns the version of the validator.
201pub fn version() -> semver::Version {
202    semver::Version::from_str(env!("CARGO_PKG_VERSION")).expect("invalid embedded crate version")
203}
204
205/// Returns the version of Substrait that this version of the validator was
206/// built against.
207pub fn substrait_version() -> semver::Version {
208    semver::Version::from_str(include_str!("resources/substrait-version"))
209        .expect("invalid embedded Substrait version")
210}
211
212/// Returns the Substrait version requirement for plans to be known to be
213/// supported.
214pub fn substrait_version_req() -> semver::VersionReq {
215    let version = substrait_version();
216    if version.major == 0 {
217        semver::VersionReq::parse(&format!("={}.{}", version.major, version.minor)).unwrap()
218    } else {
219        semver::VersionReq::parse(&format!("={}", version.major)).unwrap()
220    }
221}
222
223/// Returns the Substrait version requirement for plans to possibly be
224/// supported.
225pub fn substrait_version_req_loose() -> semver::VersionReq {
226    let version = substrait_version();
227    semver::VersionReq::parse(&format!("={}", version.major)).unwrap()
228}