# senax-encoder Binary Format Specification
**Version:** 1.1
**Date:** 2025
**Status:** Draft
## Table of Contents
1. [Overview](#overview)
2. [Format Basics](#format-basics)
3. [Tag System](#tag-system)
4. [Data Type Specifications](#data-type-specifications)
5. [Struct and Enum Encoding](#struct-and-enum-encoding)
6. [Schema Evolution](#schema-evolution)
7. [Implementation Notes](#implementation-notes)
## 1. Overview
The senax-encoder binary format is designed for efficient, compact serialization with a focus on forward and backward compatibility. Each value is tagged with a type identifier, enabling schema evolution and version compatibility.
### Key Design Principles
- **Compact Representation:** Variable-length encoding for common values
- **Self-describing:** Each value includes type information
- **Version Resilience:** Unknown fields/types can be safely skipped
- **Little Endian:** Consistent byte order across platforms
## 2. Format Basics
### 2.1 Byte Order
All multi-byte integers are encoded in **little-endian** format.
### 2.2 Basic Structure
All encoded values follow this pattern:
```
[TAG:u8] [DATA:variable]
```
Where:
- `TAG` is a single byte identifying the type and encoding method
- `DATA` is the encoded value, format depends on the tag
### 2.3 Variable-Length Integer Encoding
For optimal space efficiency, integers use variable-length encoding:
- Values 0-127: Encoded directly in the tag byte
- Larger values: Use dedicated tag + payload encoding
- Signed integers: Negative values use bit-inverted encoding (not ZigZag)
### 2.4 Optimized Field ID Encoding
Field IDs and variant IDs use an optimized encoding scheme for space efficiency:
**Encoding Rules:**
- **Field IDs 1-250**: Encoded as single `u8` byte
- **Field IDs 251+**: Encoded as `0xFF` marker byte followed by `u64` little-endian
- **Terminator**: Encoded as `0x00` byte to mark end of fields
**Format:**
```
// Small field ID (1-250)
[field_id:u8] [field_value]
// Large field ID (251+)
[0xFF] [field_id:u64_le] [field_value]
// Terminator
[0x00]
```
**Size Benefits:**
- Most field IDs (1-250) use only 1 byte instead of 8 bytes
- Terminator uses 1 byte instead of 8 bytes
- Large field IDs (rare) use 9 bytes (1 marker + 8 data)
**Examples:**
```
field_id=1 -> [0x01] // Direct u8 encoding
field_id=250 -> [0xFA] // Direct u8 encoding (250 = 0xFA)
field_id=251 -> [0xFF, 0xFB, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00] // Marker + u64_le
terminator -> [0x00] // End of fields
```
This optimization significantly reduces binary size for typical structs and enums while maintaining full u64 field ID range support.
## 3. Tag System
### 3.1 Tag Assignment
Tags are assigned in ranges for semantic grouping:
```rust
pub const TAG_ZERO: u8 = 0;
pub const TAG_ONE: u8 = 1;
// 2-127: Direct encoding for values 2-127
pub const TAG_U8_127: u8 = 127; // Value 127
// Extended integer types
pub const TAG_NONE: u8 = 128;
pub const TAG_SOME: u8 = 129;
pub const TAG_U8: u8 = 131;
pub const TAG_U16: u8 = 132;
pub const TAG_U32: u8 = 133;
pub const TAG_U64: u8 = 134;
pub const TAG_U128: u8 = 135;
pub const TAG_NEGATIVE: u8 = 136;
// Floating point
pub const TAG_F32: u8 = 137;
pub const TAG_F64: u8 = 138;
// Strings
pub const TAG_STRING_BASE: u8 = 139; // 139-179: Short strings (0-40 chars)
pub const TAG_STRING_LONG: u8 = 180;
// Collections and containers
pub const TAG_BINARY: u8 = 181;
pub const TAG_STRUCT_UNIT: u8 = 182;
pub const TAG_STRUCT_NAMED: u8 = 183;
pub const TAG_STRUCT_UNNAMED: u8 = 184;
pub const TAG_ENUM: u8 = 185;
pub const TAG_ENUM_NAMED: u8 = 186;
pub const TAG_ENUM_UNNAMED: u8 = 187;
pub const TAG_ARRAY_VEC_SET_BASE: u8 = 188; // 188-193: Short arrays (0-5 elements)
pub const TAG_ARRAY_VEC_SET_LONG: u8 = 194;
pub const TAG_TUPLE: u8 = 195;
pub const TAG_MAP: u8 = 196;
// Extended types (optional features)
pub const TAG_CHRONO_DATETIME: u8 = 197;
pub const TAG_CHRONO_NAIVE_DATE: u8 = 198;
pub const TAG_CHRONO_NAIVE_TIME: u8 = 199;
pub const TAG_DECIMAL: u8 = 200;
pub const TAG_UUID: u8 = 201; // Shared by UUID and ULID
```
## 4. Data Type Specifications
### 4.1 Boolean
**Encoding:**
- `false`: `TAG_ZERO` (0x00)
- `true`: `TAG_ONE` (0x01)
**Example:**
```
true -> [0x01]
false -> [0x00]
```
### 4.2 Unsigned Integers
**Compact Encoding (0-127):**
```
value -> [TAG_ZERO + value]
```
**Extended Encoding:**
```
u8 -> [TAG_U8] [value-128:u8] (range: 128-383)
u16 -> [TAG_U16] [value:u16_le] (range: 256-65535)
u32 -> [TAG_U32] [value:u32_le] (range: 65536-4294967295)
u64 -> [TAG_U64] [value:u64_le] (range: 4294967296-18446744073709551615)
u128 -> [TAG_U128] [value:u128_le] (range: 18446744073709551616+)
```
**Size Selection:**
- 0-127: Direct encoding (1 byte total)
- 128-383: u8 encoding (2 bytes total) - stores value-128
- 384-65535: u16 encoding (3 bytes total)
- etc.
**Examples:**
```
42 -> [0x2A] // TAG_ZERO + 42 = 0 + 42 = 42 = 0x2A
128 -> [0x83, 0x00] // TAG_U8, 128-128=0
255 -> [0x83, 0x7F] // TAG_U8, 255-128=127
383 -> [0x83, 0xFF] // TAG_U8, 383-128=255
384 -> [0x84, 0x80, 0x01] // TAG_U16, 384 in LE
```
### 4.3 Signed Integers
**Special Cases:**
- `0`: `TAG_ZERO` (0x00)
- `1`: `TAG_ONE` (0x01)
**Encoding Rule:**
- 0 and positive values: Encoded as unsigned integers
- Negative values: `TAG_NEGATIVE` (0x88) + bit-inverted encoding
**Format:**
```
// 0, positive values
[value:variable_uint]
// Negative values
[TAG_NEGATIVE] [(!n):variable_uint]
```
**Examples:**
```
0 -> [0x00] // TAG_ZERO
1 -> [0x01] // TAG_ONE
2 -> [0x02] // TAG_ZERO+2
-1 -> [0x88, 0x00] // TAG_NEGATIVE, !(-1)=0 -> TAG_ZERO
-2 -> [0x88, 0x01] // TAG_NEGATIVE, !(-2)=1 -> TAG_ONE
-128 -> [0x88, 0x7F] // TAG_NEGATIVE, !(-128)=127 -> TAG_ZERO+127
```
### 4.4 Floating Point
**Format:**
```
f32 -> [TAG_F32] [value:f32_le]
f64 -> [TAG_F64] [value:f64_le]
```
**Cross-Type Decoding:**
- f64 can be decoded as f32 (with potential precision loss)
- f32 to f64 cross-decoding is not supported due to precision ambiguity
### 4.5 Strings
**Short Strings (0-40 bytes):**
```
[TAG_STRING_BASE + length] [utf8_bytes]
```
**Long Strings:**
```
[TAG_STRING_LONG] [length:variable_uint] [utf8_bytes]
```
**Examples:**
```
"" -> [0x8B] // TAG_STRING_BASE + 0
"hi" -> [0x8D, 0x68, 0x69] // TAG_STRING_BASE + 2, "hi"
"long" -> [0xB4, 0x04, 0x6C, 0x6F, 0x6E, 0x67] // TAG_STRING_LONG, length=4, "long"
```
### 4.6 Option Types
**Format:**
```
None -> [TAG_NONE] // 0x80 (128)
Some(v) -> [TAG_SOME] [encoded_value] // 0x81 (129) + value
```
### 4.7 Collections
#### Arrays, Vectors, Sets
**Short Collections (0-5 elements):**
```
[TAG_ARRAY_VEC_SET_BASE + count] [element1] [element2] ...
```
**Long Collections:**
```
[TAG_ARRAY_VEC_SET_LONG] [count:variable_uint] [element1] [element2] ...
```
#### Maps
**Format:**
```
[TAG_MAP] [count:variable_uint] [key1] [value1] [key2] [value2] ...
```
#### Tuples
**Format:**
```
[TAG_TUPLE] [element_count:variable_uint] [element1] [element2] ...
```
### 4.8 Binary Data
**Vec<u8> and Bytes:**
```
[TAG_BINARY] [length:variable_uint] [raw_bytes]
```
### 4.9 Extended Types (Feature-Dependent)
#### DateTime (chrono feature)
**Format:**
```
[TAG_CHRONO_DATETIME] [seconds:i64] [nanos:u32]
```
All DateTime types (UTC, Local) are normalized to UTC for storage.
#### NaiveDate (chrono feature)
**Format:**
```
[TAG_CHRONO_NAIVE_DATE] [days_from_epoch:i64]
```
Epoch: 1970-01-01
#### NaiveTime (chrono feature)
**Format:**
```
[TAG_CHRONO_NAIVE_TIME] [seconds_from_midnight:u32] [nanoseconds:u32]
```
#### NaiveDateTime (chrono feature)
**Format:**
```
[TAG_CHRONO_NAIVE_DATETIME] [seconds:i64] [nanos:u32]
```
Stores as seconds and nanoseconds since Unix epoch (1970-01-01 00:00:00 UTC).
#### Decimal (rust_decimal feature)
**Format:**
```
[TAG_DECIMAL] [mantissa:i128] [scale:u32]
```
#### UUID/ULID (uuid/ulid features)
**Format:**
```
[TAG_UUID] [value:u128_le]
```
**Note:** UUID and ULID share the same tag and are binary compatible at the encoding level.
### 4.14 serde_json::Value (Feature: serde_json)
Dynamic JSON values are supported when the `serde_json` feature is enabled. Each JSON value variant has its own tag:
- **TAG_JSON_NULL (202)**: JSON null value
- **TAG_JSON_BOOL (203)**: JSON boolean (uses existing boolean encoding)
- **TAG_JSON_NUMBER (204)**: JSON number with type preservation
- **TAG_JSON_STRING (205)**: JSON string (uses existing string encoding)
- **TAG_JSON_ARRAY (206)**: JSON array
- **TAG_JSON_OBJECT (207)**: JSON object
#### JSON Number Encoding
JSON numbers are encoded with type preservation to maintain integer/float distinction:
**Format:** `TAG_JSON_NUMBER` + `type_marker` + `value`
- `type_marker = 0`: Unsigned integer, followed by u64 encoding
- `type_marker = 1`: Signed integer, followed by i64 encoding
- `type_marker = 2`: Float, followed by f64 encoding
**Examples:**
- `42` (integer) → `[204, 0, ...]` (TAG_JSON_NUMBER, unsigned integer marker, i64 encoding)
- `3.14159` (float) → `[204, 2, ...]` (TAG_JSON_NUMBER, float marker, f64 encoding)
#### JSON Array Encoding
**Format:** `TAG_JSON_ARRAY` + `length` + `elements...`
#### JSON Object Encoding
**Format:** `TAG_JSON_OBJECT` + `length` + `(key, value)...`
Keys are encoded as strings, values are recursively encoded as JSON values.
**Examples:**
- `null` → `[202]`
- `true` → `[203, 4]` (TAG_JSON_BOOL, TAG_ONE)
- `"hello"` → `[205, 144]` (TAG_JSON_STRING, string encoding)
- `[]` → `[206, 3]` (TAG_JSON_ARRAY, length 0)
- `{}` → `[207, 3]` (TAG_JSON_OBJECT, length 0)
## 5. Struct and Enum Encoding
### 5.1 Unit Structs
**Format:**
```
[TAG_STRUCT_UNIT]
```
### 5.2 Named Field Structs
**Format:**
```
[TAG_STRUCT_NAMED] [field_id_optimized] [field_value] ... [0x00]
```
**Field Encoding Rules:**
- Each field is encoded as `[field_id_optimized] [field_value]`
- Field IDs are derived from field names (CRC64(ECMA-182) hash) or custom `#[senax(id=n)]` attributes
- Field IDs 1-250 are encoded as single `u8` bytes
- Field IDs 251+ are encoded as `0xFF` marker + `u64` little-endian
- Optional fields with `None` values are omitted entirely
- Terminator: single zero byte (0x00) marks end of fields
### 5.3 Unnamed Field Structs (Tuples)
**Format:**
```
[TAG_STRUCT_UNNAMED] [field_count:variable_uint] [field1] [field2] ...
```
### 5.4 Enums
#### Unit Variants
**Format:**
```
[TAG_ENUM] [variant_id_optimized]
```
#### Named Field Variants
**Format:**
```
[TAG_ENUM_NAMED] [variant_id_optimized] [field_id_optimized] [field_value] ... [0x00]
```
#### Unnamed Field Variants
**Format:**
```
[TAG_ENUM_UNNAMED] [variant_id_optimized] [field_count:variable_uint] [field1] [field2] ...
```
**Variant ID Assignment:**
- Derived from variant name (CRC64 hash) or custom `#[senax(id=n)]` attributes
- Variant IDs 1-250 are encoded as single `u8` bytes
- Variant IDs 251+ are encoded as `0xFF` marker + `u64` little-endian
- Must be stable across versions for compatibility
## 6. Schema Evolution
### 6.1 Forward Compatibility
**Adding Fields:**
- New optional fields: Automatically handled (default to None)
- New required fields: Must have defaults or be made optional
- In addition to having a Rust default value, you **must** explicitly annotate the field with `#[senax(default)]` to ensure forward/backward compatibility.
- Fields with `#[senax(skip_default)]`: Only encoded when value differs from default, automatically use default value when missing during decode
**Adding Enum Variants:**
- Use custom `#[senax(id=n)]` for stable IDs
- Unknown variants cause decode errors
### 6.2 Backward Compatibility
**Removing Fields:**
- Unknown field IDs are automatically skipped during decoding
- No decoder changes required
**Removing Enum Variants:**
- May cause decode errors if old data contains removed variants
- Consider deprecation strategy
### 6.3 Field Reordering
Field order changes are automatically handled due to ID-based encoding.
### 6.4 Type Changes
**Compatible Changes:**
- `u32` ↔ `i64` (if values fit)
- `f32` ↔ `f64`
- `u32` → `Option<u32>`
**Incompatible Changes:**
- `String` → `u32`
- `Vec<T>` → `HashMap<K,V>`
- None → Required
## 7. Implementation Notes
### 7.1 Skip Function
Decoders must implement a `skip_value()` function that can skip unknown tagged values without parsing them. This enables forward compatibility.
### 7.2 Error Handling
**Decode Errors:**
- Invalid UTF-8 in strings
- Unknown enum variants
- Malformed data (unexpected EOF, invalid tags)
- Type conversion failures
### 7.3 Endianness
All multi-byte values use little-endian encoding for consistency across platforms.
---
**This specification defines the complete binary format for senax-encoder. Implementations should follow these rules exactly to ensure cross-version and cross-platform compatibility.**