clang_ast/
lib.rs

1//! [![github]](https://github.com/dtolnay/clang-ast) [![crates-io]](https://crates.io/crates/clang-ast) [![docs-rs]](https://docs.rs/clang-ast)
2//!
3//! [github]: https://img.shields.io/badge/github-8da0cb?style=for-the-badge&labelColor=555555&logo=github
4//! [crates-io]: https://img.shields.io/badge/crates.io-fc8d62?style=for-the-badge&labelColor=555555&logo=rust
5//! [docs-rs]: https://img.shields.io/badge/docs.rs-66c2a5?style=for-the-badge&labelColor=555555&logo=docs.rs
6//!
7//! <br>
8//!
9//! This library provides deserialization logic for efficiently processing
10//! Clang's `-ast-dump=json` format.
11//!
12//! <br>
13//!
14//! # Format overview
15//!
16//! An AST dump is generated by a compiler command like:
17//!
18//! <pre>
19//! <code>$  <b>clang++ -Xclang -ast-dump=json -fsyntax-only path/to/source.cc</b></code>
20//! </pre>
21//!
22//! The high-level structure is a tree of nodes, each of which has an `"id"` and
23//! a `"kind"`, zero or more further fields depending on what the node kind is,
24//! and finally an optional `"inner"` array of child nodes.
25//!
26//! As an example, for an input file containing just the declaration `class S;`,
27//! the AST would be as follows:
28//!
29//! ```
30//! # stringify! {
31//! {
32//!   "id": "0x1fcea38",                 //<-- root node
33//!   "kind": "TranslationUnitDecl",
34//!   "inner": [
35//!     {
36//!       "id": "0xadf3a8",              //<-- first child node
37//!       "kind": "CXXRecordDecl",
38//!       "loc": {
39//!         "offset": 6,
40//!         "file": "source.cc",
41//!         "line": 1,
42//!         "col": 7,
43//!         "tokLen": 1
44//!       },
45//!       "range": {
46//!         "begin": {
47//!           "offset": 0,
48//!           "col": 1,
49//!           "tokLen": 5
50//!         },
51//!         "end": {
52//!           "offset": 6,
53//!           "col": 7,
54//!           "tokLen": 1
55//!         }
56//!       },
57//!       "name": "S",
58//!       "tagUsed": "class"
59//!     }
60//!   ]
61//! }
62//! # };
63//! ```
64//!
65//! <br><br>
66//!
67//! # Library design
68//!
69//! By design, the clang-ast crate *does not* provide a single great big data
70//! structure that exhaustively covers every possible field of every possible
71//! Clang node type. There are three major reasons:
72//!
73//! - **Performance** &mdash; these ASTs get quite large. For a reasonable
74//!   mid-sized translation unit that includes several platform headers, you can
75//!   easily get an AST that is tens to hundreds of megabytes of JSON. To
76//!   maintain performance of downstream tooling built on the AST, it's critical
77//!   that you deserialize only the few fields which are directly required by
78//!   your use case, and allow Serde's deserializer to efficiently ignore all
79//!   the rest.
80//!
81//! - **Stability** &mdash; as Clang is developed, the specific fields
82//!   associated with each node kind are expected to change over time in
83//!   non-additive ways. This is nonproblematic because the churn on the scale
84//!   of individual nodes is minimal (maybe one change every several years).
85//!   However, if there were a data structure that promised to be able to
86//!   deserialize every possible piece of information in every node, practically
87//!   every change to Clang would be a breaking change to some node *somewhere*
88//!   despite your tooling not caring anything at all about that node kind. By
89//!   deserializing only those fields which are directly relevant to your use
90//!   case, you become insulated from the vast majority of syntax tree changes.
91//!
92//! - **Compile time** &mdash; a typical use case involves inspecting only a
93//!   tiny fraction of the possible nodes or fields, on the order of 1%.
94//!   Consequently your code will compile 100&times; faster than if you tried to
95//!   include everything in the data structure.
96//!
97//! <br>
98//!
99//! # Data structures
100//!
101//! The core data structure of the clang-ast crate is `Node<T>`.
102//!
103//! ```
104//! # use clang_ast::Id;
105//! #
106//! pub struct Node<T> {
107//!     pub id: Id,
108//!     pub kind: T,
109//!     pub inner: Vec<Node<T>>,
110//! }
111//! ```
112//!
113//! The caller must provide their own kind type `T`, which is an enum or struct
114//! as described below. `T` determines exactly what information the clang-ast
115//! crate will deserialize out of the AST dump.
116//!
117//! By convention you should name your `T` type `Clang`.
118//!
119//! <br>
120//!
121//! # T = enum
122//!
123//! Most often, you'll want `Clang` to be an enum. In this case your enum must
124//! have one variant per node kind that you care about. The name of each variant
125//! matches the `"kind"` entry seen in the AST.
126//!
127//! Additionally there must be a fallback variant, which must be named either
128//! `Unknown` or `Other`, into which clang-ast will put all tree nodes not
129//! matching one of the expected kinds.
130//!
131//! ```no_run
132//! use serde::Deserialize;
133//! # use serde_derive::Deserialize;
134//!
135//! pub type Node = clang_ast::Node<Clang>;
136//!
137//! #[derive(Deserialize)]
138//! pub enum Clang {
139//!     NamespaceDecl { name: Option<String> },
140//!     EnumDecl { name: Option<String> },
141//!     EnumConstantDecl { name: String },
142//!     Other,
143//! }
144//!
145//! fn main() {
146//!     let json = std::fs::read_to_string("ast.json").unwrap();
147//!     let node: Node = serde_json::from_str(&json).unwrap();
148//!
149//! }
150//! ```
151//!
152//! The above is a simple example with variants for processing `"kind":
153//! "NamespaceDecl"`,&ensp;`"kind": "EnumDecl"`,&ensp;and `"kind":
154//! "EnumConstantDecl"` nodes. This is sufficient to extract the set of variants
155//! of every enum in the translation unit, and the enums' namespace (possibly
156//! anonymous) and enum name (possibly anonymous).
157//!
158//! Newtype variants are fine too, particularly if you'll be deserializing more
159//! than one field for some nodes.
160//!
161//! ```
162//! use serde::Deserialize;
163//! # use serde_derive::Deserialize;
164//!
165//! pub type Node = clang_ast::Node<Clang>;
166//!
167//! #[derive(Deserialize)]
168//! pub enum Clang {
169//!     NamespaceDecl(NamespaceDecl),
170//!     EnumDecl(EnumDecl),
171//!     EnumConstantDecl(EnumConstantDecl),
172//!     Other,
173//! }
174//!
175//! #[derive(Deserialize, Debug)]
176//! pub struct NamespaceDecl {
177//!     pub name: Option<String>,
178//! }
179//!
180//! #[derive(Deserialize, Debug)]
181//! pub struct EnumDecl {
182//!     pub name: Option<String>,
183//! }
184//!
185//! #[derive(Deserialize, Debug)]
186//! pub struct EnumConstantDecl {
187//!     pub name: String,
188//! }
189//! ```
190//!
191//! <br><br>
192//!
193//! # T = struct
194//!
195//! Rarely, it can make sense to instantiate Node with `Clang` being a struct
196//! type, instead of an enum. This allows for deserializing a uniform group of
197//! data out of *every* node in the syntax tree.
198//!
199//! The following example struct collects the `"loc"` and `"range"` of every
200//! node if present; these fields provide the file name / line / column position
201//! of nodes. Not every node kind contains this information, so we use `Option`
202//! to collect it for just the nodes that have it.
203//!
204//! ```
205//! use serde::Deserialize;
206//! # use serde_derive::Deserialize;
207//!
208//! pub type Node = clang_ast::Node<Clang>;
209//!
210//! #[derive(Deserialize)]
211//! pub struct Clang {
212//!     pub kind: String,  // or clang_ast::Kind
213//!     pub loc: Option<clang_ast::SourceLocation>,
214//!     pub range: Option<clang_ast::SourceRange>,
215//! }
216//! ```
217//!
218//! If you really need, it's also possible to store *every other piece of
219//! key/value information about every node* via a weakly typed `Map<String,
220//! Value>` and the Serde `flatten` attribute.
221//!
222//! ```
223//! use serde::Deserialize;
224//! # use serde_derive::Deserialize;
225//! use serde_json::{Map, Value};
226//!
227//! #[derive(Deserialize)]
228//! pub struct Clang {
229//!     pub kind: String,  // or clang_ast::Kind
230//!     #[serde(flatten)]
231//!     pub data: Map<String, Value>,
232//! }
233//! ```
234//!
235//! <br><br>
236//!
237//! # Hybrid approach
238//!
239//! To deserialize kind-specific information about a fixed set of node kinds you
240//! care about, as well as some uniform information about every other kind of
241//! node, you can use a hybrid of the two approaches by giving your `Other` /
242//! `Unknown` fallback variant some fields.
243//!
244//! ```
245//! use serde::Deserialize;
246//! # use serde_derive::Deserialize;
247//!
248//! pub type Node = clang_ast::Node<Clang>;
249//!
250//! #[derive(Deserialize)]
251//! pub enum Clang {
252//!     NamespaceDecl(NamespaceDecl),
253//!     EnumDecl(EnumDecl),
254//!     Other {
255//!         kind: clang_ast::Kind,
256//!     },
257//! }
258//! #
259//! # #[derive(Deserialize)]
260//! # struct NamespaceDecl;
261//! #
262//! # #[derive(Deserialize)]
263//! # struct EnumDecl;
264//! ```
265//!
266//! <br><br>
267//!
268//! # Source locations
269//!
270//! Many node kinds expose the source location of the corresponding source code
271//! tokens, which includes:
272//!
273//! - the filepath at which they're located;
274//! - the chain of `#include`s by which that file was brought into the
275//!   translation unit;
276//! - line/column positions within the source file;
277//! - macro expansion trace for tokens constructed by expansion of a C
278//!   preprocessor macro.
279//!
280//! You'll find this information in fields called `"loc"` and/or `"range"` in
281//! the JSON representation.
282//!
283//! ```
284//! # stringify! {
285//! {
286//!   "id": "0x1251428",
287//!   "kind": "NamespaceDecl",
288//!   "loc": {                           //<--
289//!     "offset": 7004,
290//!     "file": "/usr/include/x86_64-linux-gnu/c++/10/bits/c++config.h",
291//!     "line": 258,
292//!     "col": 11,
293//!     "tokLen": 3,
294//!     "includedFrom": {
295//!       "file": "/usr/include/c++/10/utility"
296//!     }
297//!   },
298//!   "range": {                         //<--
299//!     "begin": {
300//!       "offset": 6994,
301//!       "col": 1,
302//!       "tokLen": 9
303//!     },
304//!     "end": {
305//!       "offset": 7155,
306//!       "line": 266,
307//!       "col": 1,
308//!       "tokLen": 1
309//!     }
310//!   },
311//!   ...
312//! }
313//! # };
314//! ```
315//!
316//! The naive deserialization of these structures is challenging to work with
317//! because Clang uses field omission to mean "same as previous". So if a
318//! `"loc"` is printed without a `"file"` inside, it means the loc is in the
319//! same file as the immediately previous loc in serialization order.
320//!
321//! The clang-ast crate provides types for deserializing this source location
322//! information painlessly, producing `Arc<str>` as the type of filepaths which
323//! may be shared across multiple source locations.
324//!
325//! ```
326//! use serde::Deserialize;
327//! # use serde_derive::Deserialize;
328//!
329//! pub type Node = clang_ast::Node<Clang>;
330//!
331//! #[derive(Deserialize)]
332//! pub enum Clang {
333//!     NamespaceDecl(NamespaceDecl),
334//!     Other,
335//! }
336//!
337//! #[derive(Deserialize, Debug)]
338//! pub struct NamespaceDecl {
339//!     pub name: Option<String>,
340//!     pub loc: clang_ast::SourceLocation,    //<--
341//!     pub range: clang_ast::SourceRange,     //<--
342//! }
343//! ```
344//!
345//! <br><br>
346//!
347//! # Node identifiers
348//!
349//! Every syntax tree node has an `"id"`. In JSON it's the memory address of
350//! Clang's internal memory allocation for that node, serialized to a hex
351//! string.
352//!
353//! The AST dump uses ids as backreferences in nodes of directed acyclic graph
354//! nature. For example the following MemberExpr node is part of the invocation
355//! of an `operator bool` conversion, and thus its syntax tree refers to the
356//! resolved `operator bool` conversion function declaration:
357//!
358//! ```
359//! # stringify! {
360//! {
361//!   "id": "0x9918b88",
362//!   "kind": "MemberExpr",
363//!   "valueCategory": "rvalue",
364//!   "referencedMemberDecl": "0x12d8330",     //<--
365//!   ...
366//! }
367//! # };
368//! ```
369//!
370//! The node it references, with memory address 0x12d8330, is found somewhere
371//! earlier in the syntax tree:
372//!
373//! ```
374//! # stringify! {
375//! {
376//!   "id": "0x12d8330",                       //<--
377//!   "kind": "CXXConversionDecl",
378//!   "name": "operator bool",
379//!   "mangledName": "_ZNKSt17integral_constantIbLb1EEcvbEv",
380//!   "type": {
381//!     "qualType": "std::integral_constant<bool, true>::value_type () const noexcept"
382//!   },
383//!   "constexpr": true,
384//!   ...
385//! }
386//! # };
387//! ```
388//!
389//! Due to the ubiquitous use of ids for backreferencing, it is valuable to
390//! deserialize them not as strings but as a 64-bit integer. The clang-ast crate
391//! provides an `Id` type for this purpose, which is cheaply copyable, hashable,
392//! and comparible more cheaply than a string. You may find yourself with lots
393//! of hashtables keyed on `Id`.
394
395#![doc(html_root_url = "https://docs.rs/clang-ast/0.1.31")]
396#![allow(
397    clippy::blocks_in_conditions,
398    clippy::derivable_impls,
399    clippy::doc_markdown,
400    clippy::elidable_lifetime_names,
401    clippy::let_underscore_untyped,
402    clippy::match_like_matches_macro,
403    clippy::must_use_candidate,
404    clippy::needless_lifetimes,
405    clippy::ptr_arg,
406    clippy::uninlined_format_args,
407    clippy::unnecessary_map_or
408)]
409
410mod dedup;
411mod deserializer;
412mod id;
413mod intern;
414mod kind;
415mod loc;
416mod serializer;
417
418extern crate serde;
419
420use crate::deserializer::NodeDeserializer;
421use crate::kind::AnyKind;
422use crate::serializer::NodeSerializer;
423use serde::de::{Deserialize, Deserializer, MapAccess, Visitor};
424use serde::ser::{Serialize, SerializeMap, Serializer};
425use std::fmt;
426use std::marker::PhantomData;
427
428pub use crate::id::Id;
429pub use crate::kind::Kind;
430pub use crate::loc::{BareSourceLocation, IncludedFrom, SourceLocation, SourceRange};
431
432/// <font style="font-variant:small-caps">syntax tree root</font>
433#[derive(Clone, Eq, PartialEq, Hash, Debug)]
434pub struct Node<T> {
435    pub id: Id,
436    pub kind: T,
437    pub inner: Vec<Node<T>>,
438}
439
440struct NodeVisitor<T> {
441    marker: PhantomData<fn() -> T>,
442}
443
444impl<'de, T> Visitor<'de> for NodeVisitor<T>
445where
446    T: Deserialize<'de>,
447{
448    type Value = Node<T>;
449
450    fn expecting(&self, formatter: &mut fmt::Formatter) -> fmt::Result {
451        formatter.write_str("clang syntax tree node")
452    }
453
454    fn visit_map<M>(self, mut map: M) -> Result<Self::Value, M::Error>
455    where
456        M: MapAccess<'de>,
457    {
458        enum FirstField {
459            Id,
460            Kind,
461            Inner,
462        }
463
464        struct FirstFieldVisitor;
465
466        impl<'de> Visitor<'de> for FirstFieldVisitor {
467            type Value = FirstField;
468
469            fn expecting(&self, formatter: &mut fmt::Formatter) -> fmt::Result {
470                formatter.write_str("field identifier")
471            }
472
473            fn visit_str<E>(self, field: &str) -> Result<Self::Value, E>
474            where
475                E: serde::de::Error,
476            {
477                static FIELDS: &[&str] = &["id", "kind", "inner"];
478                match field {
479                    "id" => Ok(FirstField::Id),
480                    "kind" => Ok(FirstField::Kind),
481                    "inner" => Ok(FirstField::Inner),
482                    _ => Err(E::unknown_field(field, FIELDS)),
483                }
484            }
485        }
486
487        impl<'de> Deserialize<'de> for FirstField {
488            fn deserialize<D>(deserializer: D) -> Result<Self, D::Error>
489            where
490                D: Deserializer<'de>,
491            {
492                deserializer.deserialize_identifier(FirstFieldVisitor)
493            }
494        }
495
496        let mut id = None;
497        let mut inner = Vec::new();
498        let kind = loop {
499            match map.next_key()? {
500                None => {
501                    let kind = AnyKind::Kind(Kind::null);
502                    let deserializer = NodeDeserializer::new(&kind, &mut inner, map);
503                    break T::deserialize(deserializer)?;
504                }
505                Some(FirstField::Id) => {
506                    if id.is_some() {
507                        return Err(serde::de::Error::duplicate_field("id"));
508                    }
509                    id = Some(map.next_value()?);
510                }
511                Some(FirstField::Kind) => {
512                    let kind: AnyKind = map.next_value()?;
513                    let deserializer = NodeDeserializer::new(&kind, &mut inner, map);
514                    break T::deserialize(deserializer)?;
515                }
516                Some(FirstField::Inner) => {
517                    return Err(serde::de::Error::missing_field("kind"));
518                }
519            }
520        };
521
522        let id = id.unwrap_or_default();
523
524        Ok(Node { id, kind, inner })
525    }
526}
527
528impl<'de, T> Deserialize<'de> for Node<T>
529where
530    T: Deserialize<'de>,
531{
532    fn deserialize<D>(deserializer: D) -> Result<Self, D::Error>
533    where
534        D: Deserializer<'de>,
535    {
536        let _intern = intern::activate();
537        let marker = PhantomData;
538        let visitor = NodeVisitor { marker };
539        deserializer.deserialize_map(visitor)
540    }
541}
542
543impl<T> Serialize for Node<T>
544where
545    T: Serialize,
546{
547    fn serialize<S>(&self, serializer: S) -> Result<S::Ok, S::Error>
548    where
549        S: Serializer,
550    {
551        let _dedup = dedup::activate();
552        let mut map = serializer.serialize_map(None)?;
553        map.serialize_entry("id", &self.id)?;
554        T::serialize(&self.kind, NodeSerializer::new(&mut map))?;
555        if !self.inner.is_empty() {
556            map.serialize_entry("inner", &self.inner)?;
557        }
558        map.end()
559    }
560}