Expand description
Types and Traits for efficient String storage and deduplication.
Because cstree
is aimed at concrete syntax trees that faithfully represent all of the original program input,
cstree
aks for the text of each token when building a syntax tree. You’ll notice this when looking at
GreenNodeBuilder::token
, which takes the kind of token and a refernce to the text of the token in the source.
Of course, there are tokens whose text will always be the same, such as punctuation (like a semicolon), keywords
(like fn
), or operators (like <=
). Use Syntax::static_text
when implementing Syntax
to make cstree
aware of such tokens.
There is, however, another category of tokens whose text will appear repeatedly, but for which we cannot know the text upfront. Any variable, type, or method that is user-defined will likely be named more than once, but there is no way to know beforehand what names a user will choose.
In order to avoid storing the source text for these tokens many times over, cstree
interns the text of its
tokens (if that text is not static). What this means is that each unique string is only stored once. When a new
token is added - say, a variable -, we check if we already know its contents (the variable name). If the text is
new, we save it and give it a unique Id. If we have seen the text before, we look up its unique Id and don’t need to
keep the new data around. As an additional benefit, interning also makes it much cheaper to copy source text around
and also to compare it with other source text, since what is actually being copied or compared is just an integer.
I just want to build a syntax tree
If you don’t want to worry about this for now, you (mostly) can! All required functionality is implemented in
cstree
and you can just use GreenNodeBuilder::new
to obtain a tree builder with everything set up (see the
crate documentation for more on how to get started). This will create an interner, which the builder returns
together with the syntax tree on finish
as part of its node cache (call NodeCache::into_interner
on the
result to get the interner out).
Here begins the part where you do have to think about interning: cstree
needs the interner you get when you want
to look at the source text for some part of the syntax tree, so you’ll have to keep it around somehow until the
point where you need it.
How best to do this depends on what you need the text for. If the code that accesses the text is close-by, it might
be enough to pass the return value to the functions that need it (within cstree
or in your code). Other options
could be to store the interner together with the syntax tree. If you use SyntaxNode::new_root_with_resolver
, you
get a syntax tree that can handle text without any need to manage and pass an interner (the reason the method is
called _with_resolver
and not _with_interner
is that it doesn’t actually needs a full Interner
– once the
tree is created, no more text will be added, so it just needs to be able to look up text. This part is called a
Resolver
). Or you could put the interner somewhere “global”, where you can easily access it from anywhere.
Using other interners
By default, cstree
uses its own, simple interner implementation. You can obtain an interner by calling
new_interner
, or bring your own by implementing the Resolver
and Interner
traits defined in this module.
Most methods in cstree
require that you support interning TokenKey
s. TokenKey
implements InternKey
, so
your implementation can use that to convert to whatever types it uses for its internal representation. Note that
there is no way to change the size of the internal representation.
lasso
Using features, you can enable support for some third-party interners. The primary one is lasso
, a crate focused
on efficient interning of text strings. This is enabled via the lasso_compat
feature and adds the necessary trait
implementation to make lasso
’s interners work with cstree
(as well as a re-export of the matching version of
lasso
here). If enabled, cstree
’s built-in interning functionality is replaced with lasso
’s more efficient one
transparently, so you’ll now be returned a lasso
interner from new_interner
.
Multi-threaded interners
If you want to use your interner on more than one thread, the interner needs to support interning new text through
shared access. With the multi_threaded_interning
feature, you can get such an interner by calling
new_threaded_interner
. The feature also enables support for ThreadedRodeo
, the multi-threaded interner from
lasso
.
You can pass a reference to that interner to anything that expects an Interner
!
While the interning methods on Interner
require a &mut self
to also work for single-threaded interners, both
Resolver
and Interner
will be implemented for &interner
if interner
is multi-threaded:
let interner = new_threaded_interner();
let mut builder: GreenNodeBuilder<MySyntax, &MultiThreadedTokenInterner> =
GreenNodeBuilder::from_interner(&interner);
parse(&mut builder, "42");
let (tree, cache) = builder.finish();
// Note that we get a cache and interner back, because we passed an "owned" reference to `from_interner`
let used_interner = cache.unwrap().into_interner().unwrap();
assert_eq!(used_interner as *const _, &interner as *const _);
let int = tree.children().next().unwrap();
assert_eq!(int.as_token().unwrap().text(&interner), Some("42"));
Here, we use from_interner
, but pass it only a shared reference to “own”. Take care to denote the type signature
of the GreenNodeBuilder
appropriately.
Re-exports
pub use lasso;
Structs
- MultiThreadedTokenInterner
multi_threaded_interning
A threadsafeInterner
for deduplicatingGreenToken
strings. - The default
Interner
used to deduplicate green token strings. - The intern key type for the source text of
GreenToken
s. Each unique key uniquely identifies a deduplicated, interned source string.
Traits
- Common interface for all intern keys via conversion to and from
u32
. - A full interner, which can intern new strings returning intern keys and also resolve intern keys to the interned value.
- The read-only part of an interner. Allows to perform lookups of intern keys to resolve them to their interned text.
Functions
- Constructs a new, single-threaded
Interner
. - new_threaded_interner
multi_threaded_interning
Constructs a newInterner
that can be used across multiple threads.