Internment
A very easy to use library for interning strings or other data in rust. Interned data is very efficient to either hash or compare for equality (just a pointer comparison). Data is also automatically de-duplicated.
You have two options with the internment crate:
-
Intern
, which will never free your data. This means that anIntern
isCopy
, so you can make as many copies of the pointer as you may care to at no cost. -
ArcIntern
, which reference-counts your data and frees it when there are no more references.ArcIntern
will keep memory use down, but requires locking whenever a clone of your pointer is made, as well as when dropping the pointer.
In both cases, accessing your data is a single pointer
dereference, and the size of either Intern
or ArcIntern
is a
single pointer. In both cases, you have a guarantee that a single
data value (as defined by Eq
and Hash
) will correspond to a
single pointer value. This means that we can use pointer
comparison (and a pointer hash) in place of value comparisons,
which is very fast.
Example
use Intern;
let x = new;
let y = new;
assert_ne!;
Comparison with other interning crates
There are already several interning crates available on
crates.io. What makes
internship
different? Many of the interning crates are specific to
strings. The general purpose interning crates are:
Each of these crates implement arena allocation, with tokens of
various sizes to reference an interned object. This approach makes
them far more challenging to use than internship
. Their approach
also enables freeing of all interned objects at once when they go out
of scope (which is an advantage).
The primary disadvantages of arena allocation relative to
internship
's approach are:
-
Lookup of a token could fail, either because an invalid token could be generated by hand, or a token from one pool could be used by another. This adds an element of unsafety to code that uses interned objects: either they assume that they are bug-free and panic on errors, or they have error handling any place that uses tokens.
-
Lookup of a token could give the wrong object, if multiple pools are used. This is easy to avoid if you avoid ever using more than one pool, but then you may gain little benefit from the arena allocation.
-
Lookup of a token is slow. They all advertise being fast, but any lookup is going to be slower than pointer dereferencing. To be fair, increased memory locality could in principle make token lookup faster for some access patterns, but I doubt it.
To balance this, because internship
has tokens that are globally
valid, it uses a Mutex
to protect its internal data, which is taken
on the interning of new data as well as changing of reference counts,
which is probably slower than the other internship crates (unless you
want to use their tokens across threads, in which case you'd have to
put the pool in a Mutex
and pay the same penalty).