rdf_fusion/
lib.rs

1#![doc(test(attr(deny(warnings))))]
2#![doc(
3    html_favicon_url = "https://raw.githubusercontent.com/tobixdev/rdf-fusion/main/misc/logo/logo.png"
4)]
5#![doc(
6    html_logo_url = "https://raw.githubusercontent.com/tobixdev/rdf-fusion/main/misc/logo/logo.png"
7)]
8
9//! RDF Fusion is an experimental columnar [SPARQL](https://www.w3.org/TR/sparql11-overview/) engine.
10//! It is built on [Apache DataFusion](https://datafusion.apache.org/), an extensible query engine that
11//! uses [Apache Arrow](https://arrow.apache.org/) as its in-memory data format.
12//!
13//! # Using RDF Fusion
14//!
15//! RDF Fusion can currently be used in two ways: via the convenient `Store` API or as a library for DataFusion.
16//!
17//! ## Store API
18//!
19//! The `Store` API provides high-level methods for interacting with the database, such as inserting data and running
20//! queries.
21//! Users who primarily want to *use* RDF Fusion are encouraged to use this API.
22//!
23//! While the `Store` API is based on [Oxigraph](https://github.com/oxigraph/oxigraph)'s `Store`, full compatibility is
24//! not a goal.
25//! Some aspects of RDF Fusion differ fundamentally, for example, its use of `async` methods.
26//!
27//! The `Store` API also supports extending SPARQL for domain-specific purposes.
28//! For instance, users can register custom SPARQL functions, similar to those found in other SPARQL engines.
29//! Additional extension points are planned for future releases.
30//!
31//! See the [examples](../../examples) directory for demonstrations of the `Store` API in action.
32//!
33//! ## Library Use
34//!
35//! RDF Fusion can also be used as a library for DataFusion.
36//! In this mode, users interact directly with DataFusion's query engine and leverage RDF Fusion's operators and rewriting
37//! rules used to implement SPARQL.
38//!
39//! This approach allows combining operators from DataFusion, RDF Fusion, and even other systems built on DataFusion within
40//! a single query.
41//! Users who want to *build new systems* using RDF Fusion's SPARQL implementation are encouraged to use this API.
42//!
43//! See the [examples](../../examples) directory for more details.
44//!
45//! # Background
46//!
47//! This documentation is intended to be somewhat self-sufficient for both users coming from the
48//! Semantic Web and those coming from DataFusion.
49//! To that end, the following sections introduce the basic concepts.
50//! If you are already familiar with them, feel free to skip ahead.
51//!
52//! ## A Brief Introduction to DataFusion
53//!
54//! For readers who want an in-depth understanding of DataFusion, we recommend the official
55//! [architecture documentation](https://docs.rs/datafusion/latest/datafusion/index.html#architecture).
56//! For those who prefer to learn by doing, the following brief overview should be enough to get started.
57//!
58//! DataFusion is an extensible relational query engine built on top of Apache Arrow.
59//! Let’s break this down:
60//!
61//! **Extensible** means that DataFusion can be customized in many ways, including adding new
62//! operators, optimizations, and data sources. Much of this introduction will focus on these
63//! aspects as otherwise RDF Fusion could not be built on top of DataFusion.
64//!
65//! **Relational** means that DataFusion is based on a variant of the
66//! [relational model](https://en.wikipedia.org/wiki/Relational_model).
67//! We won’t dive into the theory here: if you know the ideas of relations (tables), attributes
68//! (columns), tuples (rows), and attribute domains (column data types), you’re in good shape.
69//! Otherwise, you may want to skim through the linked resource first.
70//!
71//! [**Apache Arrow**](https://arrow.apache.org/) is an in-memory data format for high-performance analytics.
72//! DataFusion stores intermediate results in Arrow arrays. An array is a contiguous sequence of
73//! values with a certain type and known length. Arrow defines how arrays of various
74//! (including complex) data types are represented in memory. This standardization:
75//! 1. enables sharing data between different libraries and systems,
76//! 2. provides a memory layout well-suited for efficient vectorized operations, and
77//! 3. includes efficient compute kernels that extensions (such as RDF Fusion) can directly use.
78//!
79//! The high-level process of executing a query in DataFusion is shown below:
80//!
81//! ```text
82//!                  rewrite               rewrite
83//!                  ┌───◄──┐              ┌───◄──┐
84//!                  ▼      │              ▼      │
85//! ┌───────┐    ┌───────────────┐    ┌────────────────┐    ┌─────────┐
86//! │ Query ├───►│ Logical PLan  ├───►│ Execution Plan ├───►│ Streams │   <--- Computes the results
87//! └───────┘    └───────────────┘    └────────────────┘    └─────────┘
88//! ```
89//!
90//! 1. A query is parsed by a front-end (most often SQL) and transformed into a **logical plan**.
91//! 2. The logical plan is transformed using rewriting rules (e.g., optimizations).
92//! 3. The optimized logical plan is translated into an **execution plan**.
93//! 4. The execution plan is further transformed by rewriting rules (e.g., optimizations).
94//! 5. Finally, the execution plan is executed, producing streams of results that can be lazily
95//!    iterated.
96//!
97//! Usually, users do not interact with these APIs directly.
98//! Instead, queries are typically executed via the DataFrame API or the SQL interface.
99//! Internally, however, these APIs rely on the same mechanisms.
100//! For example, when you use the `collect` method of the DataFrame API, it gathers all results
101//! from the top-level stream to populate a DataFrame.
102//!
103//! ### Logical Plans vs. Execution Plans
104//!
105//! If you have never worked with query engines before, you might wonder why there are two different
106//! types of plans.
107//! The distinction is:
108//! - The **logical plan** specifies *what* operation to perform (e.g., `Join`).
109//! - The **execution plan** specifies *how* the operation should be performed
110//!   (e.g., `HashJoin`, number of partitions to use, etc.).
111//!
112//! ### Rewriting Rules
113//!
114//! Rewriting rules serve multiple purposes. Their role is to take a plan (e.g., a logical plan)
115//! and transform it into another plan.
116//!
117//! Often, the goal is to produce an equivalent plan that is more efficient (i.e., an optimization).
118//! However, rewriting rules can also be used for tasks such as mapping custom operators onto
119//! built-in operators.
120//!
121//! ### Extending DataFusion
122//!
123//! DataFusion provides extension points at all stages of query processing.
124//! You can create custom front-ends, define new logical and execution plan nodes, and implement new streams.
125//!
126//! To implement a SPARQL operator entirely on our own, the process generally involves:
127//! 1. Transforming it into a **custom logical plan**, possibly with custom optimization rules.
128//! 2. Transforming that logical plan into a **custom execution plan**, again potentially with optimizations.
129//! 3. Implementing a **custom stream** capable of executing the SPARQL operator.
130//!
131//! While this may sound complex, most SPARQL features can already be mapped to built-in DataFusion
132//! operators, allowing us to replace steps 2. and 3. with an often simple rewriting rule.
133//!
134//! ## A Brief Introduction to RDF and SPARQL
135//!
136//! If you're familiar with relational databases, you might wonder how SPARQL queries can be implemented on a relational
137//! query engine.
138//! At first glance, these query languages appear quite different.
139//!
140//! However, despite surface-level differences, SPARQL engines share many similarities with relational query engines.
141//! This common ground allows RDF Fusion to provide SPARQL support on top of DataFusion without re-implementing large
142//! portions of the query engine.
143//!
144//! For readers familiar with relational databases who want to start using RDF Fusion without diving deep into the SPARQL
145//! standard, this section provides a brief introduction to RDF and SPARQL.
146//! If you are already familiar with these technologies, you can safely skip this section.
147//!
148//! Some details are simplified to make the introduction more accessible.
149//! For example, we will completely ignore [blank nodes](https://www.w3.org/TR/rdf11-concepts/#dfn-blank-node).
150//! For a detailed specification, please refer to the official RDF and SPARQL standards.
151//!
152//! ### The Resource Description Framework
153//!
154//! The [Resource Description Framework (RDF)](https://www.w3.org/TR/rdf11-concepts/) is the data model that underpins
155//! SPARQL.
156//! Data in RDF are represented as **triples**, where each triple consists of a **subject**, a **predicate**, and an
157//! **object**.
158//!
159//! - The **subject** and **predicate** are
160//!   typically [IRIs](https://en.wikipedia.org/wiki/Internationalized_Resource_Identifier).
161//! - The **object** can be either an IRI or a **literal**.
162//!
163//! Think of IRIs as global identifiers that look similar to web links, while literals are standard values like strings,
164//! numbers, or dates.
165//! Lastly, an **RDF term** is either an IRI or a literal.
166//!
167//! For example, the following triple states that Spiderman (an IRI) has the name "Spiderman" (a literal):
168//!
169//! ```text
170//! (<http://example.org/spiderman>, <http://xmlns.com/foaf/0.1/name>, "Spiderman")
171//! ```
172//!
173//! An **RDF graph** is simply a set of triples.
174//! The following example, taken from the [Turtle Specification](https://www.w3.org/TR/turtle/), shows a small graph
175//! containing information about Spiderman, the Green Goblin, and their relationship.
176//!
177//! ```turtle
178//! # Base Address to resolve relative IRIs
179//! BASE <http://example.org/>
180//!
181//! # Some prefixes to make it easier to spell out other IRIs we are using
182//! PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
183//! PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
184//! PREFIX foaf: <http://xmlns.com/foaf/0.1/> .
185//! PREFIX rel: <http://www.perceive.net/schemas/relationship/> .
186//!
187//! <#spiderman>
188//!     rel:enemyOf <#green-goblin> ;               # The Green Goblin is an enemy of Spiderman.
189//!     a foaf:Person ;                             # Spiderman is a Person.
190//!     foaf:name "Spiderman", "Человек-паук"@ru .  # You can even add language tags to your literals
191//! ```
192//!
193//! At first, it may not be obvious how a set of triples represents a graph.
194//! In an RDF graph, subjects and objects correspond to **nodes**, while predicates label the **edges** connecting them.
195//! It is important that the same IRI always corresponds to the same node, even across multiple triples.
196//!
197//! ### Graph Patterns
198//!
199//! Given an RDF graph, we can ask questions about the data.
200//! For example: *"Who are the enemies of Spiderman?"*
201//!
202//! These questions can be expressed as **graph patterns**, a core concept in the SPARQL standard.
203//! A graph pattern is essentially a triple in which one or more components may be **variables**.
204//!
205//! For example, the following graph pattern expresses the question above (assuming the prefixes defined previously):
206//!
207//! ```text
208//! <#spiderman> rel:enemyOf ?enemy
209//! ```
210//!
211//! Evaluating graph patterns against an RDF graph is often referred to as **graph pattern matching**.
212//! In this process, we look for triples in the graph that match the components of the graph pattern.
213//! For example, the graph pattern above matches the following triple from the RDF graph introduced earlier:
214//!
215//! ```text
216//! <#spiderman> rel:enemyOf <#green-goblin>
217//! ```
218//!
219//! However, the result of graph pattern matching is not the triples themselves, but a **solution**.
220//! A solution is a set of bindings for the variables in the graph pattern.
221//! Here, we will depict solutions as a table, reflecting how SPARQL query execution can be mapped onto a relational query
222//! engine (more on this later).
223//! The result of the above graph pattern matching is the following table:
224//!
225//! | **?enemy**      |
226//! |-----------------|
227//! | <#green-goblin> |
228//!
229//! ### SPARQL
230//!
231//! Some relevant fundamental concepts of SPARQL were already covered in the previous section.
232//! Here, we will dive a bit deeper to try to understand the connection between SPARQL and relational query engines.
233//!
234//! First, let's look at a simple SPARQL query.
235//! The query below searches for all persons whose enemies contain the Green Goblin.
236//! Note that the variable `?superhero` is used multiple times in the query.
237//!
238//! ```text
239//! SELECT ?superhero
240//! {
241//!     ?superhero a foaf:Person .
242//!     ?superhero rel:enemyOf <#green-goblin> .
243//! }
244//! ```
245//!
246//! The SPARQL standard specifies that the query above must find all solutions (i.e., bindings of `?superhero`) that
247//! satisfy **both** graph patterns.
248//!
249//! In our approach, multiple graph patterns can be combined by **joining** them on the variables they share.
250//! In the example above, the two graph patterns can be joined on the variable `?superhero`.
251//! This produces the following (simplified) query plan:
252//!
253//! ```text
254//! Inner Join: lhs.superhero = rhs.superhero
255//!   SubqueryAlias: lhs
256//!     TriplePattern: ?superhero a foaf:Person
257//!   SubqueryAlias: rhs
258//!     TriplePattern ?superhero rel:enemyOf <#green-goblin>
259//! ```
260//!
261//! Users familiar with DataFusion should feel right at home.
262//! We have effectively transformed the core operator of SPARQL into relational query operators!
263//! Fortunately, DataFusion allows us to extend its set of operators with custom ones, such as `TriplePattern`.
264//!
265//! Next, let's consider an important challenge within this approach:
266//! What is the result of the following query that retrieves all information about Spiderman?
267//!
268//! ```text
269//! SELECT ?object
270//! {
271//!     <#spiderman> ?predicate ?object
272//! }
273//! ```
274//!
275//! Here is the result of the query:
276//!
277//! | **?object**       |
278//! |-------------------|
279//! | <#green-goblin>   |
280//! | foaf:Person       |
281//! | "Spiderman"       |
282//! | "Человек-паук"@ru |
283//!
284//! Those familiar with relational databases might naturally wonder about the data type of this column.
285//! After all, a single column can contain IRIs, plain strings, and language-tagged strings.
286//! It may also include other literal types such as booleans, numbers, and dates.
287//!
288//! Such variability is not typically possible in standard relational models.
289//! While the mathematical domain of the column is simple (the set of RDF terms), representing it efficiently in a
290//! relational query engine is not trivial.
291//!
292//! This challenge motivates **RDF Fusion's RDF term encodings**, which bridge the gap between the dynamic nature of SPARQL
293//! solutions and the expressive type system of Apache Arrow.
294//! We will cover this topic in more detail in the next section.
295//!
296//! # SPARQL on top of DataFusion
297//!
298//! As RDF Fusion is built on top of DataFusion, it shares the same architecture of the query engine
299//! which we have quickly covered in the previous section.
300//! Nevertheless, there are interesting aspects of how we extend DataFusion to support SPARQL.
301//! Here, we will briefly discuss various aspects of RDF Fusion and then link to the more detailed documentation.
302//!
303//! ## Encoding RDF Terms in Arrow
304//!
305//! Recall that within a solution set, one solution may map a variable to a string value, while another may map the same
306//! variable to an integer value.
307//! If this does not make sense to you, the previous section should help.
308//!
309//! If we were theoretical mathematicians, we could simply state that the domain of column within a SPARQL solution is the
310//! set of RDF terms, and we would be done.
311//! Easy enough, right?
312//! Unfortunately, in practice, this is not so easy as Arrow would need a native Data Type for RDF terms for this to work.
313//! As this is not the case, we must somehow encode the domain of RDF terms in the data types supported by Arrow.
314//!
315//! Furthermore, this encoding must support efficiently evaluating multiple types of operations.
316//! For example, one operation is joining solution sets, while another one is allowing is evaluating arithmetic expressions.
317//! As it turns out, doing this according to the SPARQL standard is not trivial.
318//!
319//! One of the major challenges lies in the "two worlds" that are associated with RDF literals.
320//! To recap, on the one hand, RDF literals have a lexical value and an optional datatype (ignoring language tags for now).
321//! On the other hand, the very same literal has a typed value that is part in a different domain.
322//! The domain is determined by the datatype.
323//! For example, the RDF term `"1"^^xsd:integer` has a typed value of `1` in the set of integers.
324//! In addition, another RDF term `"01"^^xsd:integer` also has a typed value of `1` in the same domain.
325//! Note that the necessary mapping functions (RDF term → Typed Value and vice versa) are not bijective.
326//! In other words, there is no one-to-one mapping between RDF terms and typed values.
327//!
328//! Let us stay a bit longer on this example because this is a very important aspect of RDF Fusion.
329//! Some SPARQL operations now would like to use the lexical value of a literal, while others would like to use the typed
330//! value.
331//! For example, the SPARQL join operation would like to use the lexical value of a literal, as the join operation is
332//! defined
333//! on RDF term equality, which in turn requires comparing the lexical values of the literals.
334//! As a result, RDF Fusion cannot simply encode the typed value of a literal because it would lose information about the
335//! lexical value.
336//!
337//! On the other hand, the SPARQL arithmetic operations would like to use the typed value of a literal, as the arithmetic
338//! operations are defined on typed values.
339//! For example, the SPARQL `+` operation does not care whether the lexical value of an integer literal is `"1"` or `"01"`.
340//! It cares about the typed value of the literal, which is `1` in both cases.
341//! Furthermore, while it is possible to extract the typed value of a literal (i.e., parsing), it is additional overhead
342//! that must be accounted for in evaluating each sub-expression, as DataFusion uses Arrow arrays to pass data between
343//! operators.
344//! So evaluating a complex expression would be scattered with parsing and stringification operations if only the lexical
345//! value of the literal was materialized.
346//! As a result, also encoding just the lexical value of a literal would create problems.
347//!
348//! To address these challenges, RDF Fusion uses multiple encodings for the same domain of RDF terms.
349//! One of the encodings retains the lexical value of the literal, while the other one retains the typed value.
350//! Then there are additional encodings that we use to improve the performance of certain operations.
351//! For further details, please refer to
352//! the [rdf-fusion-encoding](../encoding) crate.
353//!
354//! ## Using DataFusion's Extension Points
355//!
356//! As mentioned earlier, RDF Fusion leverages many of DataFusion’s extension points to implement SPARQL.
357//!
358//! First, we define **custom logical operators** for various graph patterns (e.g., pattern matching, filters).
359//! These custom logical operators and their rewriting rules are detailed in the [rdf-fusion-logical](../../logical) crate.
360//!
361//! Next, we provide **custom execution operators** for operations that cannot be mapped to built-in operators
362//! or for those where we want to preserve SPARQL semantics during the planning step.
363//! The custom execution operators and their rewriting rules are detailed in the [rdf-fusion-physical](../../physical) crate,
364//! which also contains the implementations of the streams.
365//!
366//! One of the most important operators that integrates these components is the `QuadPattern`.
367//! Its execution plan is defined in the [rdf-fusion-storage](../../storage) crate.
368//! This operator is special because the storage layer implementation is responsible for planning it.
369//!
370//! Additionally, RDF Fusion uses the `ScalarFunction` and `AggregateFunction` traits to implement SPARQL functions.
371//! Implementations of these functions are detailed in the [rdf-fusion-functions](../../functions) crate.
372//!
373//! # Crates
374//!
375//! To conclude, here is a list of the creates that constitute RDF Fusion with a quick description of each one.
376//! You can find more details in their respective documentation.
377//!
378//! - [rdf-fusion-encoding](../encoding): The RDF term encodings used by RDF Fusion.
379//! - [rdf-fusion-extensions](../extensions): Contains a set of traits and core data types used to extend RDF Fusion (e.g.,
380//!   custom storage layer).
381//! - [rdf-fusion-functions](../functions): Scalar and aggregate functions for RDF Fusion.
382//! - [rdf-fusion-logical](../logical): The logical plan operators and rewriting rules used by RDF Fusion.
383//! - [rdf-fusion-model](../model): Provides a model for RDF and SPARQL. This is not part of common as it does not have a
384//!   dependency on DataFusion.
385//! - [rdf-fusion-physical](../physical): The physical plan operators and rewriting rules used by RDF Fusion.
386//! - [rdf-fusion](../rdf-fusion): This crate. The primary entry point for RDF Fusion.
387//! - [rdf-fusion-storage](../storage): The storage layer implementations for RDF Fusion.
388//! - [rdf-fusion-web](../web): The web server for RDF Fusion.
389
390pub mod error;
391pub mod store;
392
393pub mod api {
394    pub use rdf_fusion_extensions::*;
395}
396
397pub mod encoding {
398    pub use rdf_fusion_encoding::*;
399}
400
401pub mod functions {
402    pub use rdf_fusion_functions::*;
403}
404
405pub mod io {
406    pub use oxrdfio::*;
407}
408
409pub mod model {
410    pub use rdf_fusion_model::*;
411}
412
413pub mod logical {
414    pub use rdf_fusion_logical::*;
415}
416
417pub mod execution {
418    pub use rdf_fusion_execution::*;
419}
420
421pub mod storage {
422    pub use rdf_fusion_storage::*;
423}
rdf_fusion/lib.rs

rdf_fusion/
lib.rs