pub fn compute_diff(original: &[u8], changed: &[u8]) -> AlignedVecExpand description
Compute diff between original and changed.
Returns:
- AlignedVec (a vector of u8) which is aligned to 16-byte boundary that can be used to construct DiffSet.
Note that DiffSet::try_new() requires the argument &u8 to be aligned to 4-byte boundary, but compute_diff provides much stricter alignment guarantee.
ref: https://docs.rs/rkyv/0.7.45/rkyv/util/struct.AlignedVec.html
Scenarios:
- original.len() == changed.len()
- diff is computed by comparing corresponding indices.
- bytes comparison
original: o1 o2 o3 o4 … oN
changed: c1 c2 c3 c4 … cN - diff consists of the bytes from the “changed” slice.
- original.len() < changed.len()
- that implies the account has been reallocated and expanded
- bytes comparison
original: o1 o2 o3 o4 … oN
changed: c1 c2 c3 c4 … cN cN+1 cN+2 … cN+M
- original.len() > changed.len()
- that implies the account has been reallocated and shrunk
- bytes comparison original: o1 o2 o3 o4 … oN oN+1 oN+2 … oN+M changed: c1 c2 c3 c4 … cN
Diff Format:
Since an account could have modifications at multiple places (i.e imagine modifying multiple fields in a struct), the diff could consist of multiple segments of u8, one segment for each contiguous change. So we compute these segments of change, and concatenate them to form one big segment/slice. However, in doing so, we lose the lengths of each segment and the position of change in the original data. So the headers below captures all these necessary information:
- Length of the given changed data (the second argument to compute_diff).
- Number of slices: first 4 bytes.
- Then follows offset-pairs, for each slice/segment, that captures offset in concatenated diff as well as in the original account.
- Then follows the concatenated diff bytes.
| Length | # Offset Pairs | Offset Pair 0 | Offset Pair 1 | … | Offset Pair N-1 | Concatenated Diff | |= 4 bytes =|==== 4 bytes ====|=== 8 bytes ===|=== 8 bytes ===| … |==== 8 bytes ====|==== M bytes =====|
Offset Pair Format:
| OffsetInDiffBytes | OffsetInAccountData | | ==== 4 bytes ===== | ===== 4 bytes ===== |
where,
- OffsetInDiffBytes is the index in ConcatenatedDiff indicating the beginning of the slice
and the next OffsetInDiffBytes from the next pair indicates the end of the slice, i.e the length of
slice[i] is OffsetInDiffBytes[i+1] - OffsetInDiffBytes[i].
- Note that the OffsetInDiffBytes is relative to the beginning of ConcatenatedDiff, not relative to the beginning of the whole serialized data, that implies the OffsetInDiffBytes in the first pair is always 0.
- M is a variable and is the sum of the length of the diff-slices.
- M = diff_segments.map(|s| s.len()).sum()
- The length of ith slice = OffsetInDiffBytes@(i+1) - OffsetInDiffBytes@i, as noted earlier.
Example,
Suppose we have an account with datalen = 100:
----------------------------
| a 100 bytes account |
----------------------------
| M | A | G | I | ... | C |
----------------------------
| 0 | 1 | 2 | 3 | ... | 99 |
----------------------------Also, suppose we have modifications at two locations (say u32 and u64), so we have 2 slices of size 4 and 8 bytes respectively, and the corresponding indices are the followings (note that these are “indices” in the original account, not the indices in the concatenated diff):
- | 11 | 12 | 13 | 14 | – 4 bytes
- | 71 | 72 | 73 | 74 | 75 | 76 | 77 | 78 | – 8 bytes
In this case, the serialized bytes would be:
------------------------------------------------------
| len | num | offsets | concatenated slices |
------------------------------------------------------
| 100 | 2 | 0 11 | 4 71 | 11 12 13 14 71 72 ... 78 | OffsetInAccountData
------------------------------------------------------
| 0 1 2 3 4 5 ... 11 | OffsetInDiffBytes
----------------------------- 100 : u32 : changed.len()
- 2 : u32 : number of offset pairs
- 0 11 : u32, u32 : first offset pair (offset in diff, offset in account)
- 4 71 : u32, u32 : second offset pair (offset in diff, offset in account)
- 11 … 78 : each u8 : concatenated diff bytes