serde_describe
Make a non-self-describing serde format (like
bincode, bitcode or
postcard) behave as like a self-describing one by
transparently serializing a schema alongside (or separately
from) the data.
use ;
use ;
// Define a type that non-self-describing formats would generally struggle
// with, using skipped fields and untagged unions.
// Serialize a `Vec<TopLevel>` using the `SelfDescribed` wrapper and bitcode, a
// particularly restrictive binary format.
let original = vec!;
let bytes = serialize?;
// Then deserialize using the same wrapper to use the embedded schema.
let roundtripped =
?;
assert_eq!;
# Ok::
Trade-offs
Broadly, the library prioritizes, in order, deserialization speed, size speed and finally serialization speed.
-
Deserialization Speed. Should only introduce a small amount of overhead compared to the underlying format. The only additional operations performed are validation and dispatching based on a relatively small, cache-friendly schema object. In addition to the schema object itself, the only other data the wrapper adds is discriminants to tag dynamically shaped data (structs with skipped fields and enums).
-
Size. Given a compact underlying format, the
SelfDescribedrepresentation should end up more compact than most self-describing formats for large enough objects, as type information (shapes, field names and variant names) is deduplicated across the entire object. See the Advanced Usage section on how to amortize the schema overhead across multiple objects as well. -
Serialization Speed. The library is optimized for data that is written once and read many times. As such, various schema optimization passes are performed to keep the schema size small (and not grow unbounded with the input data).
Note that these goals are, at this point, mostly a statement of intent and reflect only high-level architectural choices. On the nuts and bolts of the implementation, the crate hasn't gone through extensive profiling and optimization yet.
Current limitations
The crate's current implementation has a few limitations. None of these are fundamental architectural constraints and they could all be removed backwards-compatibly if there is sufficient demand / motivation to do so.
- Objects must be < ~4GiB in size. More precisely, the various indices used
internally by the library need to fit in 32-bit unsigned integers //! * At
most 64 "skippable" fields. This can be lifted by using a
BitVecinstead of as single. This includes the overall size of the object, but also the number of items in a collection etc. - ZST collections are not special-cased. A
vec![(); u32::MAX], will take an unnecessary amount of memory and time to serialize. - Structs can have at most 64 skippable fields. These are fields that, in
any given trace, appear as both present and absent. Fields that are always
present or always absent (irrespective of their
#[serde(skip_serializing_if)]attributes) do not count towards this limit.
Advanced usage: external schema
As shown above, the simplest way to use this library is to simply wrap your
data in SelfDescribed. However, if you're storing many objects of the same
type that you need to access independently (so you can't simply serialize a
Vec<Item>), then saving them each as SelfDescribed adds unnecessary
overhead due to repeatedly serializing the schema with each object
use ;
use SelfDescribed;
let bytes1 = to_stdvec?;
let bytes2 = to_stdvec?;
assert_eq!;
assert_eq!;
let SelfDescribed =
?;
let SelfDescribed =
?;
assert_eq!;
# Ok::
To avoid this overhead, serde_describe supports the SchemaBuilder API
which, while more complicated to use, makes schema serialization and
deserialization explicit and separate from the data
use ;
use ;
// A `SchemaBuilder` is used to trace multiple values, before serializing them
// and the inferred schema that describes all of them.
let mut builder = new;
let trace1 = builder.trace?;
let trace2 = builder.trace?;
let schema = builder.build?;
// Save the three objects without the schema, and the schema separately.
let bytes1 = to_stdvec?;
let bytes2 = to_stdvec?;
let bytes_schema = to_stdvec?;
// Each individual serialized object is now very small and the combined size of
// the two objects + schema is still smaller, as the schema is only saved once.
assert_eq!;
assert_eq!;
// To deserialize the data, first load the schema, then use it to deserialize
// `DescribedBy` objects for each of the items. Behind the scenes this uses
// `DeserializeSeed` so the underlying format needs to have a public API that
// supports it.
let schema = ?;
let item1: Item = schema.deserialize_described?;
let item2: Item = schema.deserialize_described?;
assert_eq!;
// You can also get a human-readable view of the inferred schema for debugging
// purposes.
assert_eq!;
# Ok::
How it works
You can completely skip this section unless you're curious about the internal implementation of the library.
Serializing anything with serde_describe is a two-pass process
(SelfDescribed simply does both of these two phases for you)
- An serializer runs on the given value producing a
Traceand aSchema. - The underlying serializer is used to save the pair of
(schema, transform(schema, trace)).
At deserialization time, a Deserializer is built which wraps the underlying
deserializer and walks the schema graph performing coercions and calling the
appropriate serde::de::Visitor methods.
We'll tackle the trace, schema and transform in turn.
Trace
The Trace is a replayable representation of all the serde::Serializer calls
made on the internal serializer (serialize_u32, serialize_struct,
serialize_struct(field_name, ...) etc).
The upside of using this intermediate representation is that it side-steps any
risks of non-deterministic serialization and its structure is chosen such that
matching back to the Schema later is simpler. The downside is that it
potentially doubles the amount of memory required during serialization, and it
is almost certainly slower than serializing the same type again would be.
Schema
The schema contains interned type, field and variant names, as well as a
graph for the finite state machine describing how to parse the stored object.
Importantly, it is a representation of the serialized data, not of the actual abstract type definition from Rust. So if certain kinds of data are never observed, information about them isn't serialized, saving space. The upshot is that this makes a particular schema only useful with the data that was traced to generate it, not generally useful for any value of given type.
To peek behind the curtain and bring the point home let's look first at a human human-readable description of the inferred schema for a non-trivial object.
use ;
use ;
let data = vec!;
// Print the serialized `SelfDescribed` object as RON making it easier to
// inspect. We'll look at this later.
eprintln!;
// The `display_for_value` method on `Schema` does the work of tracing the
// value and converts it to a human-readable format. The output isn't
// stable, but is useful for debugging.
assert_eq!;
Each type gets displayed with a _<id> suffix which maps to a unique node id
in the FSM graph. This way, the mostly tree-like schema can be displayed as
such, even though in reality it has cycles and repeated nodes. Children of
previously expanded nodes nodes are simply shown as ..; see for instance
union_4(..), not repeating the previous union_4(none_1, some_3(str_2)).
Getting back to the schema itself:
- Top-level, we see it's a
seq(sequence) ofParentstructs (braces mean struct). - The
always_skipped_by_serdefield doesn't show up at all. Its presence will not take up any space in serialized object - The
always_skipped_but_data_dependentfield shows up as optional (suffixed with?), but its type is!(the never type, like in rust). Its presence will not take up any space in serialized objects because it's inferred to never be present. - The
never_skippedfield, even though it /could/ be skipped byserdesimply shows up as "required" (the same astag), because it is always present in the traced data. - The
sometimes_skippedfield is sometimes persent and sometimes absent. As such, its presence will require 1 bit per serialized object. More detail on how this is actually represented in the transform section below. - The type of
sometimes_skippedissome_10(u32_9); because the value is neverNone, we don't actually need a discriminant for this type; when present, it's alwaysSome. - Unlike
sometimes_skipped,tagis sometimesNonesometimesSome, so we'll store it as a discriminated union betweenNoneandSome(String).Option-s are not treated in any special way, instead they act like any other unit struct and newtype struct respectively.
Looking at the children in never_skipped, note that the schema if unified
across different values of Parent and Child. Also note that Child::Unused
doesn't appear in union_7, because it's never serialized.
Transform
The transform function converts the object into a subset of serde features
that are supported by non-self-describing formats. In particular:
- Unions (as seen above) are used to turn heterogeneous sequences and maps into homogeneous collections of discriminated enums.
- Union discriminants rely on variant index, with original enum variant names deduplicated and stored in the schema.
- Structs are turned into tuples with field names, types and order deduplicated and stored in the schema.
- Structs (and struct variants) with skippable fields are turned into enums with the discriminant encoding a bitset of present fields. As mentioned above, bits are only used for fields that are sometimes present; fields that always absent or always present do not use up a bit.
- Discriminants with more than 256 values are chunked into nested enums of at
most 256 variants. This is because
serdeinsists on'staticvariant names. So the crate ships with a static array of variants named_00through to_ffwhich can be used any times they're required.
This is what the RON dump of the object from the Schema example, annotated
with comments and cleaned up a bit.
// seq_8
[
// Parent_11::_01 (discriminant means the `sometimes_skipped` field is present)
_01(
// never_skipped:
[
// union_7::_00 => Child::WithData_5
_00(
// mixed: union_4::_00 => None
_00,
// always_some: (no Some("...") or discriminant, as it's never None)
"value1",
),
// union_7::_00 => Child::WithData_5
_00(_00, "value2"),
// union_7::_01 => Child::NoData
_01,
],
// sometimes_skipped: (no discriminant as it's never None when present)
1234,
// tag: union_4::_00 => None
_00,
),
// Parent_11::_00 (`sometimes_skipped` absent)
_00(
// never_skipped:
[
// union_7::_00 => Child::WithData_5
_00(
// mixed: union_4::_01 => Some(String)
_01("value3"),
// always_some:
"value4",
),
],
// tag: union_4::_01 => Some(String)
_01("tag_value"),
),
]