hedl-c14n
Canonical form generation for HEDL documents -deterministic serialization for minimal token count.
Comparing HEDL documents shouldn't fail on whitespace differences. Git diffs shouldn't show spurious changes from inconsistent formatting. LLM context windows are expensive -every token matters. Production systems need bit-for-bit identical outputs for cache hits and content-addressable storage. Cryptographic signatures require deterministic serialization.
hedl-c14n implements canonical form generation per SPEC.md Section 13.2. Transform any valid HEDL document into normalized form with consistent 1-space indentation, sorted keys, and count hints on matrix lists. Same document always produces identical output. Round-trip stable -parse(canonicalize(doc)) preserves semantic equivalence.
What's Implemented
Comprehensive canonicalization with performance and security:
- Deterministic Serialization: Bit-for-bit identical output for equivalent documents
- Count Hints: Automatic
[count]annotations on matrix lists for fast parsing - Value Normalization: Float formatting (no trailing zeros, -0 → 0), null as
~, lowercase booleans - Key Ordering: Optional alphabetic sorting for consistent field ordering
- Quoting Strategy: Minimal quoting (only when necessary) or always-quote modes
- Schema Options: Inline schemas in matrix headers or separate %S: declarations
- Security Hardening: 1000-level depth limit prevents stack overflow
- Performance Optimizations: Pre-allocated buffers (P1), direct BTreeMap iteration (P0)
- Round-Trip Stability: Semantic equivalence preserved through parse → canonicalize → parse
Installation
[]
= "2.0"
Basic Usage
Canonicalize with Defaults
use parse;
use canonicalize;
let doc = parse?;
let canonical = canonicalize?;
println!;
Output:
%V:2.0
%S:User:[id, name, email]
---
users: @User[3]
| alice, Alice Smith, alice@example.com
| bob, Bob Jones, bob@example.com
| charlie, Charlie Brown, charlie@example.com
Features Applied:
- Count hint
[3]added automatically - Consistent 1-space indentation
- Minimal quoting (only when required)
- Preserved key order
Custom Configuration
use ;
let config = builder
.sort_keys // Alphabetically sort fields
.inline_schemas // Inline schemas in headers
.quoting // Minimal quoting
.build;
let canonical = canonicalize_with_config?;
Count Hints
Automatically generate [count] annotations:
// Input: no count hint
users: @User
| alice, Alice
| bob, Bob
// Output: count hint added
users: @User
| alice, Alice
| bob, Bob
Benefits:
- Parsers can pre-allocate memory (avoids Vec reallocation)
- Streaming parsers know total row count upfront
- Validation can detect truncated lists
- 20-30% faster parsing for large lists
Algorithm: Recursive traversal counts nodes in each matrix list before serialization.
Value Normalization
All values normalized to canonical form:
Float Normalization
// No trailing zeros
3.1400 → 3.14
5.000 → 5.0
// Whole numbers as floats (preserve type)
42.0 → 42.0
100.0 → 100.0
// Negative zero normalized
-0.0 → 0.0
// Special values
NaN → null // Not preserved (becomes null)
Infinity → null
-Infinity → null
Null Representation
# Canonical form uses tilde
field: ~
Boolean Lowercase
True → true
FALSE → false
Reference Format
# Qualified references
customer: @User:alice
# Local references
prev: @item1
Quoting Strategy
Two quoting modes control string serialization:
Minimal (Default)
Quote only when necessary:
# No quotes needed
name: Alice Smith
status: active
# Quotes required (contains special characters)
path: "C:\\Program Files"
note: "Hello, world" # Contains comma
value: "true" # Looks like boolean
id: "42" # Looks like integer
ref: "@alice" # Starts with @ (looks like reference)
Triggers for Quoting:
- Contains structural characters:
: [ ] { } , | @ - Starts with
-(looks like list marker) - Matches boolean literal:
true,false - Matches null literal:
null,~ - Looks like number:
123,-456,3.14 - Empty string or only whitespace
- Contains quotes or backslashes (requires escaping)
Always
Quote all strings unconditionally:
name: "Alice Smith"
status: "active"
age: 30 # Numbers never quoted
active: true # Booleans never quoted
Use When: Maximum compatibility with naive parsers, explicit type marking
Key Ordering
Control field order with sort_keys:
Preserve Order (sort_keys=false, default)
config:
name: MyApp
version: 1.0
author: Alice
Preserves: Original insertion order from source document
Alphabetic Sort (sort_keys=true)
config:
author: Alice
name: MyApp
version: 1.0
Benefits:
- Consistent field ordering across documents
- Easier visual diffing
- Deterministic regardless of original order
- Better for git diffs
Note: Entity IDs always appear first in matrix rows regardless of sort_keys.
Schema Handling
Two modes for schema representation:
Separate %STRUCT (inline_schemas=false, default)
%V:2.0
%S:User:[id, name, email]
---
users: @User[2]
| alice, Alice, alice@example.com
| bob, Bob, bob@example.com
Advantages:
- Schema defined once, reused multiple times
- Type system remains explicit
- Easier schema updates (single location)
Inline Schemas (inline_schemas=true)
%V:2.0
---
users: @User[id, name, email][2]
| alice, Alice, alice@example.com
| bob, Bob, bob@example.com
Advantages:
- Self-contained matrix lists
- No forward references
- Easier to extract individual lists
Configuration Reference
CanonicalConfig Builder
use ;
let config = builder
.sort_keys // Alphabetic sorting (default: true)
.inline_schemas // Inline vs %S: (default: false)
.quoting // Quoting mode (default: Minimal)
.build;
Configuration Options
sort_keys (default: true)
- Alphabetically sort object fields
- Deterministic ordering regardless of insertion order
- Improves git diff readability
inline_schemas (default: false)
true: Inline schemas in matrix headers@Type[field1, field2]false: Separate %S: declarations in header- Trade-off: Self-contained vs reusable schemas
quoting (default: Minimal)
QuotingStrategy::Minimal- Quote only when necessaryQuotingStrategy::Always- Quote all strings- Minimal recommended for token efficiency
Security: Depth Limits
Protection against deeply nested structures:
const MAX_NESTING_DEPTH: usize = 1000;
// Attempting to canonicalize > 1000 levels deep:
// Error: HedlError::Syntax { line: ..., message: "Max depth exceeded: 1001 levels (max: 1000)" }
Prevents:
- Stack overflow from malicious input
- Infinite recursion bugs
- Accidental runaway nesting
Implementation: Depth counter incremented on each recursive call, decremented on return.
Error Handling
Canonicalization uses HedlError from hedl-core:
use canonicalize;
use HedlError;
match canonicalize
Error Conditions
- Nesting too deep: Document exceeds 1000-level nesting limit
- Write failures: Internal buffer errors (extremely rare)
Errors include line numbers and context for debugging.
Round-Trip Stability
Canonical form preserves semantic equivalence:
use parse;
use canonicalize;
let original = parse?;
let canonical_str = canonicalize?;
let reparsed = parse?;
// Semantic equivalence holds
assert_eq!;
assert_eq!;
assert_eq!;
Guarantees:
- All fields preserved
- All entities preserved
- References maintained
- Type information intact
- Semantic equality (not string equality)
Non-Preserved:
- Whitespace and formatting
- Comment positions (comments stripped)
- Original key order (if sort_keys=true)
- Exact float representation (normalized)
Use Cases
Version Control Normalization: Canonicalize HEDL files before git commit to eliminate spurious formatting diffs. Enable clean git history focused on semantic changes.
LLM Context Optimization: Minimize token count through compact syntax. Fit more data in context windows without losing information.
Content-Addressable Storage: Generate deterministic hashes for identical documents regardless of source formatting. Enable deduplication and cache hits.
Cryptographic Signatures: Sign canonical form to ensure signatures verify regardless of whitespace/formatting variations. Ideal for document integrity verification.
Database Exports: Normalize exported HEDL for consistent baselines in testing and CI/CD. Detect actual data changes, not formatting noise.
Configuration Management: Standardize config file formatting across teams and tools. Automated formatting on save, consistent style enforcement.
What This Crate Doesn't Do
Validation: Canonicalization assumes input is valid. For validation, use hedl-lint before canonicalization.
Comment Preservation: Comments are not part of canonical form and are stripped. For comment-preserving formatting, use hedl-core's pretty-printer.
Custom Formatting Rules: Configuration is comprehensive but not infinite. Highly custom formatting requirements may need custom serialization.
Schema Inference: Uses existing schemas from document. For schema generation from data, use hedl-core's inference APIs.
Performance Characteristics
Time Complexity: O(n) where n = total nodes + fields. Single linear pass through document tree.
Space Complexity: O(n) output buffer + O(d) recursion stack where d = nesting depth. Pre-allocation optimization (P1) amortizes allocations.
Optimizations Implemented:
- P0: Direct BTreeMap Iteration - Iterate without intermediate Vec allocation (eliminates O(n) allocation)
- P1: Pre-allocated Buffers - Estimate output size, allocate once (reduces allocation count by 90%)
Count Hint Generation: O(n) single pass to count nodes. Cached during serialization (no double-traversal).
Dependencies
hedl-core2.0 - Core HEDL data structures and parsingthiserror1.0 - Error type definitions
License
Apache-2.0