Expand description
A Serde addon that allows interning of strings and
byte sequences behind Arc
s during deserialization.
Unlike the stock Rc
/ Arc
deserialization available in the main Serde
crate, these custom deserializer functions will find duplicate values
and instead of wrapping each of them into an individual Arc
it will
reuse the existing arcs.
§Example
use serde_intern::{clear_arc_cache, intern_arc_str};
#[derive(Deserialize)]
struct Person {
// add a custom deserializer hook
#[serde(deserialize_with="intern_arc_str")]
name: Arc<str>,
}
// when deserializing:
let json = r#"[
{ "name": "Yenna" },
{ "name": "Yenna" },
{ "name": "Yenna" }
]"#;
let people: Vec<Person> = serde_json::from_str(json)?;
// All structs share the same text sequence "Yenna" through reference
// counting. There's an extra reference used by internal lookup table.
let first = &people[0];
assert_eq!(Arc::strong_count(&first.name), 4);
// This function clears up the lookup table.
clear_arc_cache();
assert_eq!(Arc::strong_count(&first.name), 3);
Currently serde-intern
supports string slices and slices of bytes.
More types can be added later.
§Note:
While this library allows sharing a common data storage across multiple
deserialized entities it is NOT a Zero-copy.
The first time a new sequence is encountered it is copied to the newly
created heap region administered by Arc
.
To avoid copying data and instead refer to text sequences in the underlying
buffer you should use Serde’s built-in borrow
deserializer attribute
instead:
#[derive(Deserialize)]
struct Person<'storage> {
#[serde(borrow)]
name: Cow<'storage, str>,
}
Note that in this case the deserialized struct needs to keep the raw data
in memory, as denoted by 'storage
lifetime annotation.
serde-intern
lets you drop the underlying buffer at a cost of a single
copy.
§Implementation details
To track the previously observed string slices and compare them with
a currently deserializing slice the library maintains a lookup table.
Its memory overhead is fairy small: it’s a HashMap<u64, Arc<str>>
, so
each entity is a pair of (u64, usize)
behind the scenes.
We use string hashes for keys to avoid extra memory overhead and not to
force storing string references for a long time.
In case of a hash collision the library will wrap the string into
a separate new Arc
.
To speed things up we use a non-standard fast hash function from
rustc-hash
crate.
The lookup table is stored as a thread-local to avoid synchronizations.
While the overhead is minimal, the library does offer clear_arc_cache
hook to clear up lookup tables.
Functions§
- clear_
arc_ cache - This function will clear up lookup tables for the current-thread only!
- intern_
arc_ str - A Serde deserializer hook that allows multiple structs to share the same string slice between them.
- intern_
arc_ u8s - A Serde deserializer hook that allows multiple structs to share the same slice of data between them.