clang_ast/lib.rs
1//! [![github]](https://github.com/dtolnay/clang-ast) [![crates-io]](https://crates.io/crates/clang-ast) [![docs-rs]](https://docs.rs/clang-ast)
2//!
3//! [github]: https://img.shields.io/badge/github-8da0cb?style=for-the-badge&labelColor=555555&logo=github
4//! [crates-io]: https://img.shields.io/badge/crates.io-fc8d62?style=for-the-badge&labelColor=555555&logo=rust
5//! [docs-rs]: https://img.shields.io/badge/docs.rs-66c2a5?style=for-the-badge&labelColor=555555&logo=docs.rs
6//!
7//! <br>
8//!
9//! This library provides deserialization logic for efficiently processing
10//! Clang's `-ast-dump=json` format.
11//!
12//! <br>
13//!
14//! # Format overview
15//!
16//! An AST dump is generated by a compiler command like:
17//!
18//! <pre>
19//! <code>$ <b>clang++ -Xclang -ast-dump=json -fsyntax-only path/to/source.cc</b></code>
20//! </pre>
21//!
22//! The high-level structure is a tree of nodes, each of which has an `"id"` and
23//! a `"kind"`, zero or more further fields depending on what the node kind is,
24//! and finally an optional `"inner"` array of child nodes.
25//!
26//! As an example, for an input file containing just the declaration `class S;`,
27//! the AST would be as follows:
28//!
29//! ```
30//! # stringify! {
31//! {
32//! "id": "0x1fcea38", //<-- root node
33//! "kind": "TranslationUnitDecl",
34//! "inner": [
35//! {
36//! "id": "0xadf3a8", //<-- first child node
37//! "kind": "CXXRecordDecl",
38//! "loc": {
39//! "offset": 6,
40//! "file": "source.cc",
41//! "line": 1,
42//! "col": 7,
43//! "tokLen": 1
44//! },
45//! "range": {
46//! "begin": {
47//! "offset": 0,
48//! "col": 1,
49//! "tokLen": 5
50//! },
51//! "end": {
52//! "offset": 6,
53//! "col": 7,
54//! "tokLen": 1
55//! }
56//! },
57//! "name": "S",
58//! "tagUsed": "class"
59//! }
60//! ]
61//! }
62//! # };
63//! ```
64//!
65//! <br><br>
66//!
67//! # Library design
68//!
69//! By design, the clang-ast crate *does not* provide a single great big data
70//! structure that exhaustively covers every possible field of every possible
71//! Clang node type. There are three major reasons:
72//!
73//! - **Performance** — these ASTs get quite large. For a reasonable
74//! mid-sized translation unit that includes several platform headers, you can
75//! easily get an AST that is tens to hundreds of megabytes of JSON. To
76//! maintain performance of downstream tooling built on the AST, it's critical
77//! that you deserialize only the few fields which are directly required by
78//! your use case, and allow Serde's deserializer to efficiently ignore all
79//! the rest.
80//!
81//! - **Stability** — as Clang is developed, the specific fields
82//! associated with each node kind are expected to change over time in
83//! non-additive ways. This is nonproblematic because the churn on the scale
84//! of individual nodes is minimal (maybe one change every several years).
85//! However, if there were a data structure that promised to be able to
86//! deserialize every possible piece of information in every node, practically
87//! every change to Clang would be a breaking change to some node *somewhere*
88//! despite your tooling not caring anything at all about that node kind. By
89//! deserializing only those fields which are directly relevant to your use
90//! case, you become insulated from the vast majority of syntax tree changes.
91//!
92//! - **Compile time** — a typical use case involves inspecting only a
93//! tiny fraction of the possible nodes or fields, on the order of 1%.
94//! Consequently your code will compile 100× faster than if you tried to
95//! include everything in the data structure.
96//!
97//! <br>
98//!
99//! # Data structures
100//!
101//! The core data structure of the clang-ast crate is `Node<T>`.
102//!
103//! ```
104//! # use clang_ast::Id;
105//! #
106//! pub struct Node<T> {
107//! pub id: Id,
108//! pub kind: T,
109//! pub inner: Vec<Node<T>>,
110//! }
111//! ```
112//!
113//! The caller must provide their own kind type `T`, which is an enum or struct
114//! as described below. `T` determines exactly what information the clang-ast
115//! crate will deserialize out of the AST dump.
116//!
117//! By convention you should name your `T` type `Clang`.
118//!
119//! <br>
120//!
121//! # T = enum
122//!
123//! Most often, you'll want `Clang` to be an enum. In this case your enum must
124//! have one variant per node kind that you care about. The name of each variant
125//! matches the `"kind"` entry seen in the AST.
126//!
127//! Additionally there must be a fallback variant, which must be named either
128//! `Unknown` or `Other`, into which clang-ast will put all tree nodes not
129//! matching one of the expected kinds.
130//!
131//! ```no_run
132//! use serde::Deserialize;
133//! # use serde_derive::Deserialize;
134//!
135//! pub type Node = clang_ast::Node<Clang>;
136//!
137//! #[derive(Deserialize)]
138//! pub enum Clang {
139//! NamespaceDecl { name: Option<String> },
140//! EnumDecl { name: Option<String> },
141//! EnumConstantDecl { name: String },
142//! Other,
143//! }
144//!
145//! fn main() {
146//! let json = std::fs::read_to_string("ast.json").unwrap();
147//! let node: Node = serde_json::from_str(&json).unwrap();
148//!
149//! }
150//! ```
151//!
152//! The above is a simple example with variants for processing `"kind":
153//! "NamespaceDecl"`, `"kind": "EnumDecl"`, and `"kind":
154//! "EnumConstantDecl"` nodes. This is sufficient to extract the set of variants
155//! of every enum in the translation unit, and the enums' namespace (possibly
156//! anonymous) and enum name (possibly anonymous).
157//!
158//! Newtype variants are fine too, particularly if you'll be deserializing more
159//! than one field for some nodes.
160//!
161//! ```
162//! use serde::Deserialize;
163//! # use serde_derive::Deserialize;
164//!
165//! pub type Node = clang_ast::Node<Clang>;
166//!
167//! #[derive(Deserialize)]
168//! pub enum Clang {
169//! NamespaceDecl(NamespaceDecl),
170//! EnumDecl(EnumDecl),
171//! EnumConstantDecl(EnumConstantDecl),
172//! Other,
173//! }
174//!
175//! #[derive(Deserialize, Debug)]
176//! pub struct NamespaceDecl {
177//! pub name: Option<String>,
178//! }
179//!
180//! #[derive(Deserialize, Debug)]
181//! pub struct EnumDecl {
182//! pub name: Option<String>,
183//! }
184//!
185//! #[derive(Deserialize, Debug)]
186//! pub struct EnumConstantDecl {
187//! pub name: String,
188//! }
189//! ```
190//!
191//! <br><br>
192//!
193//! # T = struct
194//!
195//! Rarely, it can make sense to instantiate Node with `Clang` being a struct
196//! type, instead of an enum. This allows for deserializing a uniform group of
197//! data out of *every* node in the syntax tree.
198//!
199//! The following example struct collects the `"loc"` and `"range"` of every
200//! node if present; these fields provide the file name / line / column position
201//! of nodes. Not every node kind contains this information, so we use `Option`
202//! to collect it for just the nodes that have it.
203//!
204//! ```
205//! use serde::Deserialize;
206//! # use serde_derive::Deserialize;
207//!
208//! pub type Node = clang_ast::Node<Clang>;
209//!
210//! #[derive(Deserialize)]
211//! pub struct Clang {
212//! pub kind: String, // or clang_ast::Kind
213//! pub loc: Option<clang_ast::SourceLocation>,
214//! pub range: Option<clang_ast::SourceRange>,
215//! }
216//! ```
217//!
218//! If you really need, it's also possible to store *every other piece of
219//! key/value information about every node* via a weakly typed `Map<String,
220//! Value>` and the Serde `flatten` attribute.
221//!
222//! ```
223//! use serde::Deserialize;
224//! # use serde_derive::Deserialize;
225//! use serde_json::{Map, Value};
226//!
227//! #[derive(Deserialize)]
228//! pub struct Clang {
229//! pub kind: String, // or clang_ast::Kind
230//! #[serde(flatten)]
231//! pub data: Map<String, Value>,
232//! }
233//! ```
234//!
235//! <br><br>
236//!
237//! # Hybrid approach
238//!
239//! To deserialize kind-specific information about a fixed set of node kinds you
240//! care about, as well as some uniform information about every other kind of
241//! node, you can use a hybrid of the two approaches by giving your `Other` /
242//! `Unknown` fallback variant some fields.
243//!
244//! ```
245//! use serde::Deserialize;
246//! # use serde_derive::Deserialize;
247//!
248//! pub type Node = clang_ast::Node<Clang>;
249//!
250//! #[derive(Deserialize)]
251//! pub enum Clang {
252//! NamespaceDecl(NamespaceDecl),
253//! EnumDecl(EnumDecl),
254//! Other {
255//! kind: clang_ast::Kind,
256//! },
257//! }
258//! #
259//! # #[derive(Deserialize)]
260//! # struct NamespaceDecl;
261//! #
262//! # #[derive(Deserialize)]
263//! # struct EnumDecl;
264//! ```
265//!
266//! <br><br>
267//!
268//! # Source locations
269//!
270//! Many node kinds expose the source location of the corresponding source code
271//! tokens, which includes:
272//!
273//! - the filepath at which they're located;
274//! - the chain of `#include`s by which that file was brought into the
275//! translation unit;
276//! - line/column positions within the source file;
277//! - macro expansion trace for tokens constructed by expansion of a C
278//! preprocessor macro.
279//!
280//! You'll find this information in fields called `"loc"` and/or `"range"` in
281//! the JSON representation.
282//!
283//! ```
284//! # stringify! {
285//! {
286//! "id": "0x1251428",
287//! "kind": "NamespaceDecl",
288//! "loc": { //<--
289//! "offset": 7004,
290//! "file": "/usr/include/x86_64-linux-gnu/c++/10/bits/c++config.h",
291//! "line": 258,
292//! "col": 11,
293//! "tokLen": 3,
294//! "includedFrom": {
295//! "file": "/usr/include/c++/10/utility"
296//! }
297//! },
298//! "range": { //<--
299//! "begin": {
300//! "offset": 6994,
301//! "col": 1,
302//! "tokLen": 9
303//! },
304//! "end": {
305//! "offset": 7155,
306//! "line": 266,
307//! "col": 1,
308//! "tokLen": 1
309//! }
310//! },
311//! ...
312//! }
313//! # };
314//! ```
315//!
316//! The naive deserialization of these structures is challenging to work with
317//! because Clang uses field omission to mean "same as previous". So if a
318//! `"loc"` is printed without a `"file"` inside, it means the loc is in the
319//! same file as the immediately previous loc in serialization order.
320//!
321//! The clang-ast crate provides types for deserializing this source location
322//! information painlessly, producing `Arc<str>` as the type of filepaths which
323//! may be shared across multiple source locations.
324//!
325//! ```
326//! use serde::Deserialize;
327//! # use serde_derive::Deserialize;
328//!
329//! pub type Node = clang_ast::Node<Clang>;
330//!
331//! #[derive(Deserialize)]
332//! pub enum Clang {
333//! NamespaceDecl(NamespaceDecl),
334//! Other,
335//! }
336//!
337//! #[derive(Deserialize, Debug)]
338//! pub struct NamespaceDecl {
339//! pub name: Option<String>,
340//! pub loc: clang_ast::SourceLocation, //<--
341//! pub range: clang_ast::SourceRange, //<--
342//! }
343//! ```
344//!
345//! <br><br>
346//!
347//! # Node identifiers
348//!
349//! Every syntax tree node has an `"id"`. In JSON it's the memory address of
350//! Clang's internal memory allocation for that node, serialized to a hex
351//! string.
352//!
353//! The AST dump uses ids as backreferences in nodes of directed acyclic graph
354//! nature. For example the following MemberExpr node is part of the invocation
355//! of an `operator bool` conversion, and thus its syntax tree refers to the
356//! resolved `operator bool` conversion function declaration:
357//!
358//! ```
359//! # stringify! {
360//! {
361//! "id": "0x9918b88",
362//! "kind": "MemberExpr",
363//! "valueCategory": "rvalue",
364//! "referencedMemberDecl": "0x12d8330", //<--
365//! ...
366//! }
367//! # };
368//! ```
369//!
370//! The node it references, with memory address 0x12d8330, is found somewhere
371//! earlier in the syntax tree:
372//!
373//! ```
374//! # stringify! {
375//! {
376//! "id": "0x12d8330", //<--
377//! "kind": "CXXConversionDecl",
378//! "name": "operator bool",
379//! "mangledName": "_ZNKSt17integral_constantIbLb1EEcvbEv",
380//! "type": {
381//! "qualType": "std::integral_constant<bool, true>::value_type () const noexcept"
382//! },
383//! "constexpr": true,
384//! ...
385//! }
386//! # };
387//! ```
388//!
389//! Due to the ubiquitous use of ids for backreferencing, it is valuable to
390//! deserialize them not as strings but as a 64-bit integer. The clang-ast crate
391//! provides an `Id` type for this purpose, which is cheaply copyable, hashable,
392//! and comparible more cheaply than a string. You may find yourself with lots
393//! of hashtables keyed on `Id`.
394
395#![doc(html_root_url = "https://docs.rs/clang-ast/0.1.31")]
396#![allow(
397 clippy::blocks_in_conditions,
398 clippy::derivable_impls,
399 clippy::doc_markdown,
400 clippy::elidable_lifetime_names,
401 clippy::let_underscore_untyped,
402 clippy::match_like_matches_macro,
403 clippy::must_use_candidate,
404 clippy::needless_lifetimes,
405 clippy::ptr_arg,
406 clippy::uninlined_format_args,
407 clippy::unnecessary_map_or
408)]
409
410mod dedup;
411mod deserializer;
412mod id;
413mod intern;
414mod kind;
415mod loc;
416mod serializer;
417
418extern crate serde;
419
420use crate::deserializer::NodeDeserializer;
421use crate::kind::AnyKind;
422use crate::serializer::NodeSerializer;
423use serde::de::{Deserialize, Deserializer, MapAccess, Visitor};
424use serde::ser::{Serialize, SerializeMap, Serializer};
425use std::fmt;
426use std::marker::PhantomData;
427
428pub use crate::id::Id;
429pub use crate::kind::Kind;
430pub use crate::loc::{BareSourceLocation, IncludedFrom, SourceLocation, SourceRange};
431
432/// <font style="font-variant:small-caps">syntax tree root</font>
433#[derive(Clone, Eq, PartialEq, Hash, Debug)]
434pub struct Node<T> {
435 pub id: Id,
436 pub kind: T,
437 pub inner: Vec<Node<T>>,
438}
439
440struct NodeVisitor<T> {
441 marker: PhantomData<fn() -> T>,
442}
443
444impl<'de, T> Visitor<'de> for NodeVisitor<T>
445where
446 T: Deserialize<'de>,
447{
448 type Value = Node<T>;
449
450 fn expecting(&self, formatter: &mut fmt::Formatter) -> fmt::Result {
451 formatter.write_str("clang syntax tree node")
452 }
453
454 fn visit_map<M>(self, mut map: M) -> Result<Self::Value, M::Error>
455 where
456 M: MapAccess<'de>,
457 {
458 enum FirstField {
459 Id,
460 Kind,
461 Inner,
462 }
463
464 struct FirstFieldVisitor;
465
466 impl<'de> Visitor<'de> for FirstFieldVisitor {
467 type Value = FirstField;
468
469 fn expecting(&self, formatter: &mut fmt::Formatter) -> fmt::Result {
470 formatter.write_str("field identifier")
471 }
472
473 fn visit_str<E>(self, field: &str) -> Result<Self::Value, E>
474 where
475 E: serde::de::Error,
476 {
477 static FIELDS: &[&str] = &["id", "kind", "inner"];
478 match field {
479 "id" => Ok(FirstField::Id),
480 "kind" => Ok(FirstField::Kind),
481 "inner" => Ok(FirstField::Inner),
482 _ => Err(E::unknown_field(field, FIELDS)),
483 }
484 }
485 }
486
487 impl<'de> Deserialize<'de> for FirstField {
488 fn deserialize<D>(deserializer: D) -> Result<Self, D::Error>
489 where
490 D: Deserializer<'de>,
491 {
492 deserializer.deserialize_identifier(FirstFieldVisitor)
493 }
494 }
495
496 let mut id = None;
497 let mut inner = Vec::new();
498 let kind = loop {
499 match map.next_key()? {
500 None => {
501 let kind = AnyKind::Kind(Kind::null);
502 let deserializer = NodeDeserializer::new(&kind, &mut inner, map);
503 break T::deserialize(deserializer)?;
504 }
505 Some(FirstField::Id) => {
506 if id.is_some() {
507 return Err(serde::de::Error::duplicate_field("id"));
508 }
509 id = Some(map.next_value()?);
510 }
511 Some(FirstField::Kind) => {
512 let kind: AnyKind = map.next_value()?;
513 let deserializer = NodeDeserializer::new(&kind, &mut inner, map);
514 break T::deserialize(deserializer)?;
515 }
516 Some(FirstField::Inner) => {
517 return Err(serde::de::Error::missing_field("kind"));
518 }
519 }
520 };
521
522 let id = id.unwrap_or_default();
523
524 Ok(Node { id, kind, inner })
525 }
526}
527
528impl<'de, T> Deserialize<'de> for Node<T>
529where
530 T: Deserialize<'de>,
531{
532 fn deserialize<D>(deserializer: D) -> Result<Self, D::Error>
533 where
534 D: Deserializer<'de>,
535 {
536 let _intern = intern::activate();
537 let marker = PhantomData;
538 let visitor = NodeVisitor { marker };
539 deserializer.deserialize_map(visitor)
540 }
541}
542
543impl<T> Serialize for Node<T>
544where
545 T: Serialize,
546{
547 fn serialize<S>(&self, serializer: S) -> Result<S::Ok, S::Error>
548 where
549 S: Serializer,
550 {
551 let _dedup = dedup::activate();
552 let mut map = serializer.serialize_map(None)?;
553 map.serialize_entry("id", &self.id)?;
554 T::serialize(&self.kind, NodeSerializer::new(&mut map))?;
555 if !self.inner.is_empty() {
556 map.serialize_entry("inner", &self.inner)?;
557 }
558 map.end()
559 }
560}