Crate serde_intern

Source
Expand description

A Serde addon that allows interning of strings and byte sequences behind Arcs during deserialization.

Unlike the stock Rc / Arc deserialization available in the main Serde crate, these custom deserializer functions will find duplicate values and instead of wrapping each of them into an individual Arc it will reuse the existing arcs.

§Example

use serde_intern::{clear_arc_cache, intern_arc_str};

#[derive(Deserialize)]
struct Person {
    // add a custom deserializer hook
    #[serde(deserialize_with="intern_arc_str")]
    name: Arc<str>,
}

// when deserializing:
let json = r#"[
    { "name": "Yenna" },
    { "name": "Yenna" },
    { "name": "Yenna" }
]"#;
let people: Vec<Person> = serde_json::from_str(json)?;

// All structs share the same text sequence "Yenna" through reference
// counting. There's an extra reference used by internal lookup table.
let first = &people[0];
assert_eq!(Arc::strong_count(&first.name), 4);

// This function clears up the lookup table.
clear_arc_cache();
assert_eq!(Arc::strong_count(&first.name), 3);

Currently serde-intern supports string slices and slices of bytes. More types can be added later.

§Note:

While this library allows sharing a common data storage across multiple deserialized entities it is NOT a Zero-copy. The first time a new sequence is encountered it is copied to the newly created heap region administered by Arc. To avoid copying data and instead refer to text sequences in the underlying buffer you should use Serde’s built-in borrow deserializer attribute instead:


#[derive(Deserialize)]
struct Person<'storage> {
    #[serde(borrow)]
    name: Cow<'storage, str>,
}

Note that in this case the deserialized struct needs to keep the raw data in memory, as denoted by 'storage lifetime annotation.

serde-intern lets you drop the underlying buffer at a cost of a single copy.

§Implementation details

To track the previously observed string slices and compare them with a currently deserializing slice the library maintains a lookup table. Its memory overhead is fairy small: it’s a HashMap<u64, Arc<str>>, so each entity is a pair of (u64, usize) behind the scenes. We use string hashes for keys to avoid extra memory overhead and not to force storing string references for a long time. In case of a hash collision the library will wrap the string into a separate new Arc.

To speed things up we use a non-standard fast hash function from rustc-hash crate. The lookup table is stored as a thread-local to avoid synchronizations. While the overhead is minimal, the library does offer clear_arc_cache hook to clear up lookup tables.

Functions§

clear_arc_cache
This function will clear up lookup tables for the current-thread only!
intern_arc_str
A Serde deserializer hook that allows multiple structs to share the same string slice between them.
intern_arc_u8s
A Serde deserializer hook that allows multiple structs to share the same slice of data between them.