tree_sitter_graph/reference/mod.rs
1// -*- coding: utf-8 -*-
2// ------------------------------------------------------------------------------------------------
3// Copyright © 2021, tree-sitter authors.
4// Licensed under either of Apache License, Version 2.0, or MIT license, at your option.
5// Please see the LICENSE-APACHE or LICENSE-MIT files in this distribution for license details.
6// ------------------------------------------------------------------------------------------------
7
8//! This section defines the graph DSL implemented by this library.
9//!
10//! [tree-sitter]: https://docs.rs/tree-sitter/
11//! [tree-sitter-python]: https://docs.rs/tree-sitter-python/
12//!
13//! # Overview
14//!
15//! You can use [tree-sitter][] to parse the content of source code into a _concrete syntax tree_.
16//! For instance, using the [tree-sitter-python][] grammar, you can parse Python source code:
17//!
18//! ``` python
19//! # test.py
20//! from one.two import d, e.c
21//! import three
22//! print(d, e.c)
23//! print three.f
24//! ```
25//!
26//! ``` text
27//! $ tree-sitter parse test.py
28//! (module [0, 0] - [4, 0]
29//! (import_from_statement [0, 0] - [0, 26]
30//! module_name: (dotted_name [0, 5] - [0, 12]
31//! (identifier [0, 5] - [0, 8])
32//! (identifier [0, 9] - [0, 12]))
33//! name: (dotted_name [0, 20] - [0, 21]
34//! (identifier [0, 20] - [0, 21]))
35//! name: (dotted_name [0, 23] - [0, 26]
36//! (identifier [0, 23] - [0, 24])
37//! (identifier [0, 25] - [0, 26])))
38//! (import_statement [1, 0] - [1, 12]
39//! name: (dotted_name [1, 7] - [1, 12]
40//! (identifier [1, 7] - [1, 12])))
41//! (expression_statement [2, 0] - [2, 13]
42//! (call [2, 0] - [2, 13]
43//! function: (identifier [2, 0] - [2, 5])
44//! arguments: (argument_list [2, 5] - [2, 13]
45//! (identifier [2, 6] - [2, 7])
46//! (attribute [2, 9] - [2, 12]
47//! object: (identifier [2, 9] - [2, 10])
48//! attribute: (identifier [2, 11] - [2, 12])))))
49//! (print_statement [3, 0] - [3, 13]
50//! argument: (attribute [3, 6] - [3, 13]
51//! object: (identifier [3, 6] - [3, 11])
52//! attribute: (identifier [3, 12] - [3, 13]))))
53//! ```
54//!
55//! There are many interesting use cases where you want to use this parsed syntax tree to create
56//! some _other_ graph-like structure. This library lets you do that, using a declarative DSL to
57//! identify patterns in the parsed syntax tree, along with rules for which graph nodes and edges
58//! to create for the syntax nodes that match those patterns.
59//!
60//! # Terminology
61//!
62//! One confusing aspect of the graph DSL is that we have to talk about two different kinds of
63//! "node": the nodes of the concrete syntax tree that is our input when executing a graph DSL
64//! file, and the nodes of the graph that we produce as an output. We will be careful to always
65//! use "syntax node" and "graph node" in this documentation to make it clear which kind of node we
66//! mean.
67//!
68//! # High-level structure
69//!
70//! A graph DSL file consists of one or more **_stanzas_**. Each stanza starts with a tree-sitter
71//! [**_query pattern_**][query], which identifies a portion of the concrete syntax tree. The
72//! query pattern should use [**_captures_**][captures] to identify particular syntax nodes in the
73//! matched portion of the tree. Following the query is a **_block_**, which is a sequence of
74//! statements that construct **_graph nodes_**, link them with **_edges_**, and annotate both with
75//! **_attributes_**.
76//!
77//! [query]: https://tree-sitter.github.io/tree-sitter/using-parsers#pattern-matching-with-queries
78//! [captures]: https://tree-sitter.github.io/tree-sitter/using-parsers#capturing-nodes
79//!
80//! Query patterns can be suffixed by quantification operators. If a pattern with a quantification
81//! suffix is captured, the suffix determines the capture value. In the case of `?`, the capture value
82//! is a null value, or a syntax node. In the case of `+` and `*`, the value is a list of syntax nodes.
83//!
84//! [quantification]: https://tree-sitter.github.io/tree-sitter/using-parsers#quantification-operators
85//!
86//! Comments start with a semicolon, and extend to the end of the line.
87//!
88//! Identifiers start with either an ASCII letter or underscore, and all remaining characters are
89//! ASCII letters, numbers, underscores, or hyphens. (More precisely, they satisfy the regular
90//! expression `/[a-zA-Z_][a-zA-Z0-9_-]*/`.) Identifiers are used as the names of
91//! [attributes](#attributes), [functions](#functions), and [variables](#variables).
92//!
93//! To execute a graph DSL file against a concrete syntax tree, we execute each stanza in the graph
94//! DSL file exhaustively. For each stanza, we identify each place where the concrete syntax tree
95//! matches the query pattern. For each of these places, we end up with a different set of syntax
96//! nodes assigned to the query pattern's captures. We execute the block of statements for each of
97//! these capture assignments, creating any graph nodes, edges, or attributes mentioned in the
98//! block.
99//!
100//! Regular execution will apply the stanzas _in order_, and it is important to make sure that scoped variables
101//! have been assigned before they are used. This is not a requirement when using the lazy evaluation strategy,
102//! which handles this implicitly. The lazy evaluation strategy is also more efficient when there are many stanzas,
103//! because it can reduce tree traversals. Therefore, using the lazy evaluation strategy is recommended, and will
104//! likely become the only supported strategy in future releases.
105//!
106//! For instance, the following stanza would match all of the identifiers in our example syntax
107//! tree:
108//!
109//! ``` tsg
110//! (identifier) @id
111//! {
112//! ; Some statements that will be executed for each identifier in the
113//! ; source file. You can use @id to refer to the matched identifier
114//! ; syntax node.
115//! }
116//! ```
117//!
118//! # Expressions
119//!
120//! The value of an expression in the graph DSL can be any of the following:
121//!
122//! - null
123//! - a boolean
124//! - a string
125//! - an integer (unsigned, 32 bits)
126//! - a reference to a syntax node
127//! - a reference to a graph node
128//! - an ordered list of values
129//! - a list comprehension
130//! - an unordered set of values
131//! - a set comprehension
132//!
133//! The null value is spelled `#null`.
134//!
135//! The boolean literals are spelled `#true` and `#false`.
136//!
137//! String constants are enclosed in double quotes. They can contain backslash escapes `\\`, `\0`,
138//! `\n`, `\r`, and `\t`:
139//!
140//! - `"a string"`
141//! - `"a string with\na newline"`
142//! - `"a string with\\a backslash"`
143//!
144//! Integer constants are encoded in ASCII decimal:
145//!
146//! - `0`
147//! - `10`
148//! - `42`
149//!
150//! Lists consist of zero or more expressions, separated by commas, enclosed in square brackets.
151//! The elements of a list do not have to have the same type:
152//!
153//! ``` tsg
154//! [0, "string", 0, #true, @id]
155//! ```
156//!
157//! Sets have the same format, but are enclosed in curly braces instead of square brackets:
158//!
159//! ``` tsg
160//! {0, "string", #true, @id}
161//! ```
162//!
163//! Both lists and sets both allow trailing commas after the final element, if you prefer that
164//! style to play more nicely with source control diffs:
165//!
166//! ``` tsg
167//! [
168//! 0,
169//! "string",
170//! #true,
171//! @id,
172//! ]
173//! ```
174//!
175//! List comprehensions allow mapping over a list and producing a new list with elements based on the
176//! given element expression:
177//!
178//! ``` tsg
179//! [ (some-function x) for x in @xs ]
180//! [ @x.scoped_var for x in @xs ]
181//! ```
182//!
183//! Set comprehensions have similar syntax, but the resulting value will be a set instead of an ordered list:
184//!
185//! ``` tsg
186//! { (some-function x) for x in @xs }
187//! { @x.scoped_var for x in @xs }
188//! ```
189//!
190//! List and set comprehensions are subject to the same restrictions as for loops, which means the list
191//! value that is iterated over must be local. It is therefore not possible to iterator over the value
192//! of a scoped variable. Using scoped variables in the element expression however is no problem.
193//!
194//! # Syntax nodes
195//!
196//! Syntax nodes are identified by tree-sitter query captures (`@name`). For instance, in our
197//! example stanza, whose query is `(identifier) @id`, `@id` would refer to the `identifier` syntax
198//! node that the stanza matched against.
199//!
200//! Unused query captures are considered errors, unless they start with an underscode. For example,
201//! a capture `@id` must be used within the stanza, but `@_id` does not.
202//!
203//! # Variables
204//!
205//! You can use variables to pass information between different stanzas and statements in a graph
206//! DSL file. There are three kinds of variables:
207//!
208//! - **_Global_** variables are provided externally by whatever process is executing the graph
209//! DSL file. You can use this, for instance, to pass in the path of the source file being
210//! analyzed, to use as part of the graph structure that you create.
211//!
212//! - **_Local_** variables are only visible within the current execution of the current stanza.
213//! Once all of the statements in the stanza have been executed, all local variables disappear.
214//!
215//! - **_Scoped_** variables are "attached to" syntax nodes. Their values carry over from stanza
216//! to stanza. Scoped variables are referenced by using a syntax node expression (typically a
217//! query capture) and a variable name, separated by a period: `@node.variable`.
218//!
219//! Global variables are declared using a `global` declaration. The external process that executes the
220//! graph DSL file must provide values for all declared global variables. (It is not possible to define
221//! a global variable and give it a value from within the graph DSL.) The name of the global variable can
222//! be suffixed by a quantifier: '*' and '+' for lists, and '?' for optional values, which allows them to
223//! be used in iteration and conditional statements, respectively.
224//!
225//! Local and scoped variables are created using `var` or `let` statements. A `let` statement
226//! creates an **_immutable variable_**, whose value cannot be changed. A `var` statement creates
227//! a **_mutable variable_**. You use a `set` statement to change the value of a mutable variable.
228//! Local variables are not allowed to have the same name as a declared global variable.
229//!
230//! Local variables are block scoped. For example, a local variable defined in a `scan` arm is not
231//! visible in other scan arms, or after the `scan` statement. If you need to persist a value for use
232//! after a block, introduce a mutable variable before the block and assign to it inside the block.
233//!
234//! ``` tsg
235//! global global_variable
236//!
237//! (identifier) @id
238//! {
239//! let local_variable = "a string"
240//! ; The following would be an error, since `let` variables are immutable:
241//! ; set local_variable = "a new value"
242//!
243//! ; The following is also an error, since you can't mutate a variable that
244//! ; doesn't exist:
245//! ; set missing_variable = 42
246//!
247//! ; The following is an error, since you cannot hide a global variable with
248//! ; a local one:
249//! ; let global_variable = "a new value"
250//!
251//! var mutable_variable = "first value"
252//! set mutable_variable = global_variable ; we can refer to the global variable
253//!
254//! var @id.kind = "id"
255//! }
256//! ```
257//!
258//! Variables can be referenced anywhere that you can provide an expression. It's an error if you
259//! try to reference a variable that hasn't been defined.
260//!
261//! # Functions
262//!
263//! The process executing a graph DSL file can provide **_functions_** that can be called from
264//! within graph DSL stanzas.
265//!
266//! Function calls use a Lisp-like syntax, where the name of the function being called is _inside_
267//! of the parentheses. The parameters to a function call are arbitrary expressions. For
268//! instance, if the executing process provides a function named `+`, you could call it as:
269//!
270//! ``` tsg
271//! (identifier) @id
272//! {
273//! let x = 4
274//! let @id.nine = (+ x 5)
275//! }
276//! ```
277//!
278//! Note that it's the process executing the graph DSL file that decides which functions are
279//! available. We do define a [standard library][], and most of the time those are the functions
280//! that are available, but you should double-check the documentation of whatever graph DSL tool
281//! you're using to make sure.
282//!
283//! [standard library]: functions/index.html
284//!
285//! # Graph nodes
286//!
287//! You can use this graph DSL to create any graph structure that you want. There are no
288//! limitations on which graph nodes you create, nor on how you use edges to connect the graph
289//! nodes. You are not limited to creating a tree, and in particular, you are not limited to
290//! creating a tree that "lines" up with the parsed syntax tree.
291//!
292//! There are two ways to create a new graph node. The first, and most common, is to use a `node`
293//! statement:
294//!
295//! ``` tsg
296//! (identifier) @id
297//! {
298//! node @id.node
299//! }
300//! ```
301//!
302//! This creates a graph node, and assigns a reference to the new node to a new immutable variable
303//! named `new_node`.
304//!
305//! The second way to create a graph node is to call the [`node`][] function:
306//!
307//! [`node`]: functions/index.html#node
308//!
309//! ``` tsg
310//! (identifier) @id
311//! {
312//! let @id.node = (node)
313//! }
314//! ```
315//!
316//! Note that these two examples do exactly the same thing! In fact, the `node` statement in the
317//! first example is just syntactic sugar for the `let` statement in the second example. Since the
318//! most common pattern is to create a graph node and immediately assign it to an immutable scoped
319//! variable, we provide the `node` statement as a convenient shorthand. If you need to do
320//! anything more complex, such as assigning the graph node reference to a _mutable_ variable, you
321//! can call the [`node`][] function directly.
322//!
323//! By attaching a graph node to a syntax node using a [scoped variable](#variables), you can refer
324//! to them from multiple stanzas:
325//!
326//! ``` tsg
327//! (identifier) @id
328//! {
329//! node @id.node
330//! }
331//!
332//! (dotted_name (identifier) @dotted_element)
333//! {
334//! ; We will learn more about the attr statement below
335//! attr (@dotted_element.node) kind = "dotted"
336//! }
337//! ```
338//!
339//! In this example, we will create a graph node for _every_ `identifier` syntax node. Each of
340//! those syntax nodes will have a `node` scoped variable, containing a reference to its graph
341//! node. But only the graph nodes of those identifiers that appear inside of a `dotted_name` will
342//! have a `kind` attribute. And even though we used different capture names, inside of different
343//! queries, to find those `identifier` nodes, the graph node references in both stanzas refer to
344//! the same graph nodes.
345//!
346//! # Edges
347//!
348//! Edges are created via an `edge` statement, which specifies the two graph nodes that should be
349//! connected. Edges are directed, and the `->` arrow in the `edge` statement indicates the
350//! direction of the edge.
351//!
352//! ``` tsg
353//! (import_statement name: (_) @name)
354//! {
355//! node @name.source
356//! node @name.sink
357//! edge @name.source -> @name.sink
358//! }
359//! ```
360//!
361//! There can be at most one edge connecting any particular source and sink graph node in the
362//! graph. If multiple stanzas create edges between the same graph nodes, those are "collapsed"
363//! into a single edge.
364//!
365//! # Attributes
366//!
367//! Graph nodes and edges have an associated set of **_attributes_**. Each attribute has a name
368//! (which is an identifier), and a value.
369//!
370//! You add attributes to a graph node or edge using an `attr` statement:
371//!
372//! ``` tsg
373//! (import_statement name: (_) @name)
374//! {
375//! node @name.source
376//! node @name.sink
377//! attr (@name.sink) kind = "module"
378//! attr (@name.source -> @name.sink) precedence = 10
379//! }
380//! ```
381//!
382//! Note that you have to have already created the graph node or edge, and the graph node or edge
383//! must not already have an attribute with the same name.
384//!
385//! (Attributes might seem similar to scoped variables, but they are quite different. Attributes
386//! are attached to graph nodes and edges, while scoped variables are attached to syntax nodes.
387//! More importantly, scoped variables only exist while executing the graph DSL file. Once the
388//! execution has completed, the variables disappear. Attributes, on the other hand, are part of
389//! the output produced by the graph DSL file, and live on after execution has finished.)
390//!
391//! ## Attribute shorthands
392//!
393//! Commonly used combinations of attributes can be captured in **_shorthands_**. Each shorthand defines
394//! the attribute name, a variable which captures the attribute value, and a list of attributes to which
395//! it expands.
396//!
397//! Attribute shorthands are defined at the same level as stanzas. For example, the following shorthand
398//! takes a syntax node as argument, and expands to attributes for its source text and child index:
399//!
400//! ``` tsg
401//! attribute node_props = node => node_text = (source-text node), node_index = (named-child-index node)
402//!
403//! (argument (_)@expr) {
404//! attr (@expr) node_props = @expr
405//! }
406//! ```
407//!
408//! # Regular expressions
409//!
410//! You can use a `scan` statement to match the content of a string value against a set of regular
411//! expressions.
412//!
413//! Starting at the beginning of the string, we determine the first character where one of the
414//! regular expressions matches. If more than one regular expression matches at this earliest
415//! character, we use the regular expression that appears first in the `scan` statement. We then
416//! execute the statements in the "winning" regular expression's block.
417//!
418//! After executing the matching block, we try to match all of the regular expressions again,
419//! starting _after_ the text that was just matched. We continue this process, applying the
420//! earliest matching regular expression in each iteration, until we have exhausted the entire
421//! string, or none of the regular expressions match.
422//!
423//! Within each regular expression's block, you can use `$0`, `$1`, etc., to refer to any capture
424//! groups in the regular expression.
425//!
426//! The value being scanned must be local, which means it cannot be derived from scoped variables.
427//!
428//! For example, if `filepath` is a global variable containing the path of a Python source file,
429//! you could use the following `scan` statement to construct graph nodes for the name of the
430//! module defined in the file:
431//!
432//! ``` tsg
433//! global filepath
434//!
435//! (module) @mod
436//! {
437//! var new_node = #null
438//! var current_node = (node)
439//!
440//! scan filepath {
441//! "([^/]+)/"
442//! {
443//! ; This arm will match any directory component of the file path. In
444//! ; Python, this gives you the sequence of packages that the module
445//! ; lives in.
446//!
447//! ; Note that we keep appending additional tag names to the `current`
448//! ; graph node to create an arbitrary number of graph nodes linked to
449//! ; the @mod syntax node.
450//! set new_node = (node)
451//! attr (new_node) name = $1
452//! edge current_node -> new_node
453//! set current_node = new_node
454//! }
455//!
456//! "__init__\\.py$"
457//! {
458//! ; This arm will match a trailing __init__.py, indicating that the
459//! ; module's name comes from the last directory component.
460//!
461//! ; Expose the graph node that we created for that component as a
462//! ; scoped variable that later stanzas can see.
463//! let @mod.root = current_node
464//! }
465//!
466//! "([^/]+)\\.py$"
467//! {
468//! ; This arm will match any other trailing module name. Note that
469//! ; __init__.py also matches this regular expression, but since it
470//! ; appears later, the __init__.py clause will take precedence.
471//!
472//! set new_node = (node)
473//! attr (new_node) name = $1
474//! edge current_node -> new_node
475//! let @mod.root = new_node
476//! }
477//! }
478//! }
479//! ```
480//!
481//! # Conditionals
482//!
483//! You can use `if` statements to make blocks of statements conditional on optional values.
484//! Conditions are comma-separated lists of clauses. The clause `some EXPRESSION` indicates
485//! that the optional value must be present. The clause `none EXPRESSION` indicates that the
486//! optional value is absent. A bare expression is evaluated as to boolean. All values in
487//! conditions must be local, which means they cannot be derived from scoped variables.
488//!
489//! ``` tsg
490//! (lexical_declaration type:(_)? @type value:(_)? @value)
491//! {
492//! if some @type, none @value {
493//! ; ...
494//! } elif some @value {
495//! ; ...
496//! } else {
497//! ; ...
498//! }
499//! }
500//! ```
501//!
502//! # List iteration
503//!
504//! You can use a `for` statement to execute blocks of statements for every element in list
505//! values. The list value must be local, which means it cannot be derived from scoped variables.
506//!
507//! ```tsg
508//! (module (_)* @stmts)
509//! {
510//! for stmt in @stmts {
511//! print stmt
512//! }
513//! }
514//! ```
515//!
516//! # Debugging
517//!
518//! To support members of the Ancient and Harmonious Order of Printf Debuggers, you can use `print`
519//! statements to print out (to `stderr`) the content of any expressions during the execution of a graph DSL
520//! file:
521//!
522//! ``` tsg
523//! (identifier) @id
524//! {
525//! let x = 4
526//! print "Hi! x = ", x
527//! }
528//! ```
529
530pub mod functions;