synta 0.1.11 - Docs.rs

# ASN.1 Primitive Performance

```bash
cargo bench -p synta-bench --bench encoding
cargo bench -p synta-bench --bench derive_performance
cargo bench -p synta-bench --bench constrained_integers
```

## Tag/Length Parsing

| Operation                | Time     |
| ------------------------ | -------- |
| Short length (1-byte)    | 6.09 ns  |
| Long length (multi-byte) | 6.95 ns  |

Tag+length parsing is the inner loop of the DER decoder. The 0.86 ns difference between
short and long lengths reflects the cost of the extra branch and multi-byte length assembly
in the BER/DER long-form path. Both paths are branch-predicted and cache-resident in
production workloads.

## Integer Encode/Decode

| Operation             | Time    |
| --------------------- | ------- |
| Encode small (42)     | 31.3 ns |
| Encode medium (i64::MAX) | 34.4 ns |
| Encode large (i128::MAX) | 31.5 ns |
| Decode small          | 13.2 ns |
| Decode medium         | 13.4 ns |
| Roundtrip integer_42  | 43.9 ns |

Decode cost (~13 ns) is nearly independent of integer size because the decoder reads the
tag+length and then slices the content bytes into a `SmallVec<[u8; 16]>` — a copy of at
most 16 bytes on the stack. Encode cost varies slightly by value because the encoder must
determine the minimum byte representation and handle the sign extension byte for negative
values.

## Constrained INTEGER — Native Primitive Types

```bash
cargo bench -p synta-bench --bench constrained_integers
```

When a schema declares `INTEGER (lo..hi)`, `synta-codegen` selects the smallest native
Rust primitive (`u8`, `u16`, `u32`, `u64`, `i8`, `i16`, `i32`, `i64`) that covers the
constraint range, instead of the general-purpose `Integer` wrapper. This benchmark
measures the memory and runtime cost difference between the two representations using
four constrained newtype examples with non-trivial ranges:

- `ConstrainedU8` — `INTEGER (0..200)`, stored as `u8`
- `ConstrainedU16` — `INTEGER (0..10000)`, stored as `u16`
- `ConstrainedI16` — `INTEGER (-1000..1000)`, stored as `i16`
- `ConstrainedI64` — `INTEGER (-1000000000..1000000000)`, stored as `i64`

The primary benefit is memory layout; the trade-off is a slightly slower decode path due
to extra validation steps that enforce the declared constraint at decode time.

### Struct Size

| Field type                 | Size |
| -------------------------- | ---- |
| `Integer` (unconstrained)  | 32 B |
| `u8` (0..=200)             | 1 B  |
| `u16` (0..=10 000)         | 2 B  |
| `i16` (−1 000..=1 000)     | 2 B  |
| `i64` (−1e9..=1e9)         | 8 B  |

A struct with three `Integer` fields occupies 96 B; the equivalent with `u8` + `u16` +
`i64` fields occupies 16 B — 6× smaller. In schemas with many integer fields (e.g.,
Kerberos KDC-REQ-BODY, SNMP PDUs) this significantly reduces cache pressure when
processing large volumes of messages.

### Decode Overhead Per Field (run date: 2026-03-08)

| Type                         | 1-byte wire | 2/4-byte wire |
| ---------------------------- | ----------- | ------------- |
| `Integer` (unconstrained)    | 13.7 ns     | 14.2 ns       |
| `u8_constrained` (0..=200)   | 20.6 ns     | 20.9 ns       |
| `u16_constrained` (0..=10k)  | 21.9 ns     | 22.5 ns       |
| `i16_constrained` (±1 000)   | 21.9 ns     | 23.0 ns       |
| `i64_constrained` (±1e9)     | 20.7 ns     | 22.0 ns       |

Each constrained decode adds ~7 ns over raw `Integer`: one `as_i64()` call (sign-extend
bytes to `i64`), one `try_from()` narrowing cast for sub-`i64` types, and one range check
in `new()`. The overhead is uniform across widths — the dominant factor is the extra
function calls, not the value size or wire width.

### Encode Overhead Per Field

| Type                   | Time    |
| ---------------------- | ------- |
| `Integer` (baseline)   | 39.5 ns |
| `u8_constrained`       | 41.9 ns |
| `u16_constrained`      | 42.9 ns |
| `i16_constrained`      | 41.7 ns |
| `i64_constrained`      | 42.6 ns |

Encode adds ~3 ns: the constrained path calls `Integer::from_i64(self.0 as i64)` to
create a temporary `Integer` before encoding, whereas raw `Integer` encodes its stored
bytes directly. Both paths avoid heap allocation because `Integer` uses a 16-byte
inline `SmallVec`.

### Three-Field Struct Decode/Encode

| Struct                               | Decode   | Encode    |
| ------------------------------------ | -------- | --------- |
| `IntegerStruct` (3×`Integer`)        | 62.5 ns  | 87.0 ns   |
| `ConstrainedStruct` (u8 + u16 + i64) | 80.3 ns  | 105.2 ns  |

The overhead scales linearly: 3 fields × ~6 ns per-field decode overhead ≈ 18 ns extra.
For certificate parsing (one struct at a time, hot cache) the extra 18 ns is negligible;
for bulk message processing where many structs are simultaneously live, the 6× struct-size
reduction materially improves cache efficiency.

## OctetString Encode/Decode

| Size     | Encode   | Decode   |
| -------- | -------- | -------- |
| 16 bytes | 31.3 ns  | 19.1 ns  |
| 64 bytes | 69.3 ns  | 19.5 ns  |
| 256 bytes | 73.5 ns | 22.2 ns  |
| 1024 bytes | 83.0 ns | 26.9 ns |

**Decode is nearly constant-time** with respect to payload size: `OctetStringRef<'a>`
borrows a slice of the input buffer with no copy. The small growth from 16-byte to
1024-byte decode (19.1 → 26.9 ns) is from cache-line effects on the returned slice
struct, not from reading the content bytes.

**Encode** grows with payload size because the encoder must copy the bytes into the output
buffer. The encode path uses `OctetStringRef` internally to avoid a redundant allocation
before copying, which accounts for the significant improvement in the 1024-byte case
compared to earlier measurements that used an owned `OctetString` (which added a heap
allocation before the copy).

## Sequence Encode/Decode

| Operation                  | Time     |
| -------------------------- | -------- |
| Encode simple (3 elements) | 87.9 ns  |
| Encode nested (2 levels)   | 140.8 ns |
| Decode simple (3 elements) | 12.8 ns  |
| Roundtrip complex sequence | 149.2 ns |

**Sequence decode is O(1)**: `Sequence` captures raw content bytes as a borrowed slice at
decode time. The 12.8 ns covers only tag+length parsing and content-slice setup — no
elements are decoded. Elements are decoded lazily on first iteration.

Sequence encode uses a backpatching strategy: the encoder writes a placeholder length, encodes
all child elements, then patches the length field. The nested (2-level) encode (140.8 ns)
is roughly twice the simple encode (87.9 ns) because both the outer and inner sequences
require length-field backpatching.

The roundtrip complex sequence (149.2 ns) is lower than encode+decode separately because
Criterion measures total wall-clock time including the decode half, which is O(1).

## Derive Macro Overhead

| Operation | Manual   | Derived  | Overhead |
| --------- | -------- | -------- | -------- |
| Encode    | 77.3 ns  | 77.4 ns  | ~0%      |
| Decode    | 62.9 ns  | 64.1 ns  | +2%      |
| Roundtrip | 128.4 ns | 134.0 ns | +4%      |

Derive macros generate code that is indistinguishable from hand-written implementations
within Criterion's measurement noise. The +2% decode overhead and +4% roundtrip overhead
are within the confidence interval of the measurement and should not be treated as
meaningful regressions. The compiler fully inlines and specialises the generated trait
implementations.