1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308
// Copyright (c) The Libra Core Contributors // SPDX-License-Identifier: Apache-2.0 #![forbid(unsafe_code)] //! # Libra Canonical Serialization (LCS) //! //! LCS defines a deterministic means for translating a message or data structure into bytes //! irrespective of platform, architecture, or programming language. //! //! ## Background //! //! In Libra, participants pass around messages or data structures that often times need to be //! signed by a prover and verified by one or more verifiers. Serialization in this context refers //! to the process of converting a message into a byte array. Many serialization approaches support //! loose standards such that two implementations can produce two different byte streams that would //! represent the same, identical message. While for many applications, non-deterministic //! serialization causes no issues, it does so for applications using serialization for //! cryptographic purposes. For example, given a signature and a message, a verifier may not unable //! to produce the same serialized byte array constructed by the prover when the prover signed the //! message resulting in a non-verifiable message. In other words, to ensure message verifiability //! when using non-deterministic serialization, participants must either retain the original //! serialized bytes or risk losing the ability to verify messages. This creates a burden requiring //! participants to maintain both a copy of the serialized bytes and the deserialized message often //! leading to confusion about safety and correctness. While there exist a handful of existing //! deterministic serialization formats, there is no obvious choice. To address this, we propose //! Libra Canonical Serialization that defines a deterministic means for translating a message into //! bytes and back again. //! //! ## Specification //! //! LCS supports the following data types: //! //! * Booleans //! * Signed 8-bit, 16-bit, 32-bit, 64-bit, and 128-bit integers //! * Unsigned 8-bit, 16-bit, 32-bit, 64-bit, and 128-bit integers //! * Option //! * Unit (an empty value) //! * Fixed and variable length sequences //! * UTF-8 Encoded Strings //! * Tuples //! * Structures (aka "structs") //! * Externally tagged enumerations (aka "enums") //! * Maps //! //! ## General structure //! //! LCS is not a self-describing format and as such, in order to deserialize a message, one must //! know the message type and layout ahead of time. //! //! Unless specified, all numbers are stored in little endian, two's complement format. //! //! ## Recursion and Depth of LCS Data //! //! Recursive data-structures (e.g. trees) are allowed. However, because of the possibility of stack //! overflow during (de)serialization, the *container depth* of any valid LCS data cannot exceed the constant //! `MAX_CONTAINER_DEPTH`. Formally, we define *container depth* as the number of structs and enums traversed //! during (de)serialization. //! //! This definition aims to minimize the number of operations while ensuring that //! (de)serialization of a known LCS format cannot cause arbitrarily large stack allocations. //! //! As an example, if `v1` and `v2` are values of depth `n1` and `n2`, //! * a struct value `Foo { v1, v2 }` has depth `1 + max(n1, n2)`; //! * an enum value `E::Foo { v1, v2 }` has depth `1 + max(n1, n2)`; //! * a pair `(v1, v2)` has depth `max(n1, n2)`; //! * the value `Some(v1)` has depth `n1`. //! //! All string and integer values have depths `0`. //! //! ### Booleans and Integers //! //! |Type |Original data |Hex representation |Serialized format | //! |--- |--- |--- |--- | //! |Boolean |True / False |0x01 / 0x00 |[01] / [00] | //! |8-bit signed integer |-1 |0xFF |[FF] | //! |8-bit unsigned integer |1 |0x01 |[01] | //! |16-bit signed integer |-4660 |0xEDCC |[CCED] | //! |16-bit unsigned integer |4660 |0x1234 |[3412] | //! |32-bit signed integer |-305419896 |0xEDCBA988 |[88A9CBED] | //! |32-bit unsigned integer |305419896 |0x12345678 |[78563412] | //! |64-bit signed integer |-1311768467750121216 |0xEDCBA98754321100 |[0011325487A9CBED] | //! |64-bit unsigned integer |1311768467750121216 |0x12345678ABCDEF00 |[00EFCDAB78563412] | //! //! ### ULEB128-Encoded Integers //! //! The LCS format also uses the [ULEB128 encoding](https://en.wikipedia.org/wiki/LEB128) internally //! to represent unsigned 32-bit integers in two cases where small values are usually expected: //! (1) lengths of variable-length sequences and (2) tags of enum values (see the corresponding //! sections below). //! //! |Type |Original data |Hex representation |Serialized format | //! |--- |--- |--- |--- | //! |ULEB128-encoded u32-integer|2^0 = 1 |0x00000001 |[01] | //! | |2^7 = 128 |0x00000080 |[8001] | //! | |2^14 = 16384 |0x00004000 |[808001] | //! | |2^21 = 2097152 |0x00200000 |[80808001] | //! | |2^28 = 268435456 |0x10000000 |[8080808001] | //! | |9487 |0x0000250f |[8f4a] | //! //! In general, a ULEB128 encoding consists of a little-endian sequence of base-128 (7-bit) //! digits. Each digit is completed into a byte by setting the highest bit to 1, except for the //! last (highest-significance) digit whose highest bit is set to 0. //! //! In LCS, the result of decoding ULEB128 bytes is required to fit into a 32-bit unsigned //! integer and be in canonical form. For instance, the following values are rejected: //! * `[808080808001]` (2^36) is too large. //! * `[8080808010]` (2^33) is too large. //! * `[8000]` is not a minimal encoding of 0. //! //! ### Optional Data //! //! Optional or nullable data either exists in its full representation or does not. LCS represents //! this as a single byte representing the presence `0x01` or absence `0x00` of data. If the data //! is present then the serialized form of that data follows. For example: //! //! ```rust //! # use libra_canonical_serialization::{Result, to_bytes}; //! # fn main() -> Result<()> { //! let some_data: Option<u8> = Some(8); //! assert_eq!(to_bytes(&some_data)?, vec![1, 8]); //! //! let no_data: Option<u8> = None; //! assert_eq!(to_bytes(&no_data)?, vec![0]); //! # Ok(())} //! ``` //! //! ### Fixed and Variable Length Sequences //! //! Sequences can be made of up of any LCS supported types (even complex structures) but all //! elements in the sequence must be of the same type. If the length of a sequence is fixed and //! well known then LCS represents this as just the concatenation of the serialized form of each //! individual element in the sequence. If the length of the sequence can be variable, then the //! serialized sequence is length prefixed with a ULEB128-encoded unsigned integer indicating //! the number of elements in the sequence. All variable length sequences must be //! `MAX_SEQUENCE_LENGTH` elements long or less. //! //! ```rust //! # use libra_canonical_serialization::{Result, to_bytes}; //! # fn main() -> Result<()> { //! let fixed: [u16; 3] = [1, 2, 3]; //! assert_eq!(to_bytes(&fixed)?, vec![1, 0, 2, 0, 3, 0]); //! //! let variable: Vec<u16> = vec![1, 2]; //! assert_eq!(to_bytes(&variable)?, vec![2, 1, 0, 2, 0]); //! //! let large_variable_length: Vec<()> = vec![(); 9_487]; //! assert_eq!(to_bytes(&large_variable_length)?, vec![0x8f, 0x4a]); //! # Ok(())} //! ``` //! //! ### Strings //! //! Only valid UTF-8 Strings are supported. LCS serializes such strings as a variable length byte //! sequence, i.e. length prefixed with a ULEB128-encoded unsigned integer followed by the byte //! representation of the string. //! //! ```rust //! # use libra_canonical_serialization::{Result, to_bytes}; //! # fn main() -> Result<()> { //! // Note that this string has 10 characters but has a byte length of 24 //! let utf8_str = "çå∞≠¢õß∂ƒ∫"; //! let expecting = vec![ //! 24, 0xc3, 0xa7, 0xc3, 0xa5, 0xe2, 0x88, 0x9e, 0xe2, 0x89, 0xa0, 0xc2, //! 0xa2, 0xc3, 0xb5, 0xc3, 0x9f, 0xe2, 0x88, 0x82, 0xc6, 0x92, 0xe2, 0x88, 0xab, //! ]; //! assert_eq!(to_bytes(&utf8_str)?, expecting); //! # Ok(())} //! ``` //! //! ### Tuples //! //! Tuples are typed composition of objects: `(Type0, Type1)` //! //! Tuples are considered a fixed length sequence where each element in the sequence can be a //! different type supported by LCS. Each element of a tuple is serialized in the order it is //! defined within the tuple, i.e. [tuple.0, tuple.2]. //! //! ```rust //! # use libra_canonical_serialization::{Result, to_bytes}; //! # fn main() -> Result<()> { //! let tuple = (-1i8, "libra"); //! let expecting = vec![0xFF, 5, b'l', b'i', b'b', b'r', b'a']; //! assert_eq!(to_bytes(&tuple)?, expecting); //! # Ok(())} //! ``` //! //! //! ### Structures //! //! Structures are fixed length sequences consisting of fields with potentially different types. //! Each field within a struct is serialized in the order specified by the canonical structure //! definition. Structs can exist within other structs and as such, LCS recurses into each struct //! and serializes them in order. There are no labels in the serialized format, the struct ordering //! defines the organization within the serialization stream. //! //! ```rust //! # use libra_canonical_serialization::{Result, to_bytes}; //! # use serde::Serialize; //! # fn main() -> Result<()> { //! # #[derive(Serialize)] //! struct MyStruct { //! boolean: bool, //! bytes: Vec<u8>, //! label: String, //! } //! //! # #[derive(Serialize)] //! struct Wrapper { //! inner: MyStruct, //! name: String, //! } //! //! let s = MyStruct { //! boolean: true, //! bytes: vec![0xC0, 0xDE], //! label: "a".to_owned(), //! }; //! let s_bytes = to_bytes(&s)?; //! let mut expecting = vec![1, 2, 0xC0, 0xDE, 1, b'a']; //! assert_eq!(s_bytes, expecting); //! //! let w = Wrapper { //! inner: s, //! name: "b".to_owned(), //! }; //! let w_bytes = to_bytes(&w)?; //! assert!(w_bytes.starts_with(&s_bytes)); //! //! expecting.append(&mut vec![1, b'b']); //! assert_eq!(w_bytes, expecting); //! # Ok(())} //! ``` //! //! ### Externally Tagged Enumerations //! //! An enumeration is typically represented as a type that can take one of potentially many //! different variants. In LCS, each variant is mapped to a variant index, a ULEB128-encoded 32-bit unsigned //! integer, followed by serialized data if the type has an associated value. An //! associated type can be any LCS supported type. The variant index is determined based on the //! ordering of the variants in the canonical enum definition, where the first variant has an index //! of `0`, the second an index of `1`, etc. //! //! ```rust //! # use libra_canonical_serialization::{Result, to_bytes}; //! # use serde::Serialize; //! # fn main() -> Result<()> { //! # #[derive(Serialize)] //! enum E { //! Variant0(u16), //! Variant1(u8), //! Variant2(String), //! } //! //! let v0 = E::Variant0(8000); //! let v1 = E::Variant1(255); //! let v2 = E::Variant2("e".to_owned()); //! //! assert_eq!(to_bytes(&v0)?, vec![0, 0x40, 0x1F]); //! assert_eq!(to_bytes(&v1)?, vec![1, 0xFF]); //! assert_eq!(to_bytes(&v2)?, vec![2, 1, b'e']); //! # Ok(())} //! ``` //! //! If you need to serialize a C-style enum, you should use a primitive integer type. //! //! ### Maps (Key / Value Stores) //! //! Maps are represented as a variable-length, sorted sequence of (Key, Value) tuples. Keys must be //! unique and the tuples sorted by increasing lexicographical order on the LCS bytes of each key. //! The representation is otherwise similar to that of a variable-length sequence. In particular, //! it is preceded by the number of tuples, encoded in ULEB128. //! //! ```rust //! # use libra_canonical_serialization::{Result, to_bytes}; //! # use std::collections::HashMap; //! # fn main() -> Result<()> { //! let mut map = HashMap::new(); //! map.insert(b'e', b'f'); //! map.insert(b'a', b'b'); //! map.insert(b'c', b'd'); //! //! let expecting = vec![(b'a', b'b'), (b'c', b'd'), (b'e', b'f')]; //! //! assert_eq!(to_bytes(&map)?, to_bytes(&expecting)?); //! # Ok(())} //! ``` //! //! ## Backwards compatibility //! //! Complex types dependent upon the specification in which they are used. LCS does not provide //! direct provisions for versioning or backwards / forwards compatibility. A change in an objects //! structure could prevent historical clients from understanding new clients and vice-versa. mod de; mod error; mod ser; pub mod test_helpers; /// Variable length sequences in LCS are limited to max length of 2^31 - 1. pub const MAX_SEQUENCE_LENGTH: usize = (1 << 31) - 1; /// Maximal allowed depth of LCS data, counting only structs and enums. pub const MAX_CONTAINER_DEPTH: usize = 500; pub use de::{from_bytes, from_bytes_seed}; pub use error::{Error, Result}; pub use ser::{is_human_readable, serialize_into, serialized_size, to_bytes};