1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
//! Partition utilities for Delta table writes.
//!
//! A partition value goes through multiple transformations before it reaches the Delta
//! log. Each step is handled by a separate submodule. The differences between steps are
//! subtle and easy to get wrong, so this documentation is thorough with examples observed
//! from real Spark-written Delta tables.
//!
//! # Steps
//!
//! ```text
//! Scalar value (e.g., String("US/East"))
//! |
//! | Step 1: validation --> (type/key checks against table schema)
//! | Step 2: serialization --> "US/East" (add.partitionValues)
//! | Step 3: path encoding --> "US%2FEast" (filesystem directory name)
//! | Step 4: URI encoding --> "US%252FEast" (add.path in the Delta log)
//! v
//! AddFile { partitionValues: {"region": "US/East"}, path: "region=US%252FEast/..." }
//! ```
//!
//! - **Step 1: Validation** (`validation` module): checks partition column key completeness and
//! value types against the table schema. Runs first to reject bad input before doing any
//! serialization work.
//!
//! - **Step 2: Serialization** (`serialization` module): converts validated typed `Scalar` values
//! to protocol-compliant strings for `add.partitionValues`. Null, empty string, and empty binary
//! all serialize as JSON `null`.
//!
//! - **Step 3: Path encoding** (`hive` module): percent-encodes serialized strings for Hive-style
//! filesystem directory names. Only a small set of ASCII characters is encoded; spaces and
//! non-ASCII pass through raw. Hive-style paths are optional; the Delta protocol does not require
//! them. Kernel uses them for partitioned tables without column mapping enabled. Happens lazily
//! in the write path.
//!
//! - **Step 4: URI encoding**: an additional layer of URI encoding per RFC 2396 applied when
//! constructing `add.path` in the Delta log. More aggressive than step 3: characters like space
//! (0x20), `<`, `>`, `{`, `}`, and `|` that survive Hive encoding get percent-encoded. Characters
//! already percent-encoded by step 3 (e.g., `%2F`) get double-encoded (e.g., `%252F`).
//!
//! # Encoding by type
//!
//! Shows how each data type serializes across all three forms. Numeric and date types are
//! straightforward (no special characters to encode), but are included here for
//! completeness since the string representation in `partitionValues` must match exactly.
//!
//! ```text
//! Row | Type | Input Value | Filesystem Dir | add.path | partitionValues.p
//! ----+----------------+------------------------------+-------------------------------------------+--------------------+----------------------------
//! 1 | INT | 0 | p=0 | same | "0"
//! 2 | | -1 | p=-1 | same | "-1"
//! 3 | | 2147483647 (max) | p=2147483647 | same | "2147483647"
//! 4 | | -2147483648 (min) | p=-2147483648 | same | "-2147483648"
//! 5 | | NULL | p=__HIVE_DEFAULT_PARTITION__ | same | null
//! 6 | BIGINT | 9223372036854775807 (max) | p=9223372036854775807 | same | "9223372036854775807"
//! 7 | | -9223372036854775808 (min) | p=-9223372036854775808 | same | "-9223372036854775808"
//! 8 | | NULL | p=__HIVE_DEFAULT_PARTITION__ | same | null
//! 9 | TINYINT | 127 (max) | p=127 | same | "127"
//! 10 | | -128 (min) | p=-128 | same | "-128"
//! 11 | | NULL | p=__HIVE_DEFAULT_PARTITION__ | same | null
//! 12 | SMALLINT | 32767 (max) | p=32767 | same | "32767"
//! 13 | | -32768 (min) | p=-32768 | same | "-32768"
//! 14 | | NULL | p=__HIVE_DEFAULT_PARTITION__ | same | null
//! 15 | DOUBLE | 0.0 | p=0.0 | same | "0.0"
//! 16 | | -0.0 | p=-0.0* | same | "-0.0"*
//! 17 | | 1.7976931348623157E308 (max) | p=1.7976931348623157E308 | same | "1.7976931348623157E308"
//! 18 | | 5E-324 (min subnormal) | p=5.0E-324* | same | "5.0E-324"*
//! 19 | | NaN | p=NaN | same | "NaN"
//! 20 | | Infinity | p=Infinity | same | "Infinity"
//! 21 | | -Infinity | p=-Infinity | same | "-Infinity"
//! 22 | | NULL | p=__HIVE_DEFAULT_PARTITION__ | same | null
//! 23 | FLOAT | 0.0 | p=0.0 | same | "0.0"
//! 24 | | NaN | p=NaN | same | "NaN"
//! 25 | | Infinity | p=Infinity | same | "Infinity"
//! 26 | | -Infinity | p=-Infinity | same | "-Infinity"
//! 27 | | NULL | p=__HIVE_DEFAULT_PARTITION__ | same | null
//! 28 | BOOLEAN | true | p=true | same | "true"
//! 29 | | false | p=false | same | "false"
//! 30 | | NULL | p=__HIVE_DEFAULT_PARTITION__ | same | null
//! 31 | DECIMAL(38,18) | 0 | p=0.000000000000000000 | same | "0.000000000000000000"
//! 32 | | 1.23 | p=1.230000000000000000 | same | "1.230000000000000000"
//! 33 | | -1.23 | p=-1.230000000000000000 | same | "-1.230000000000000000"
//! 34 | | NULL | p=__HIVE_DEFAULT_PARTITION__ | same | null
//! 35 | DATE | 2024-01-01 | p=2024-01-01 | same | "2024-01-01"
//! 36 | | 1970-01-01 (epoch) | p=1970-01-01 | same | "1970-01-01"
//! 37 | | 0001-01-01 | p=0001-01-01 | same | "0001-01-01"
//! 38 | | 9999-12-31 | p=9999-12-31 | same | "9999-12-31"
//! 39 | | NULL | p=__HIVE_DEFAULT_PARTITION__ | same | null
//! 40 | TIMESTAMP | 2024-06-15 12:30:45 | p=2024-06-15 12%3A30%3A45 | p=...%2012%253A... | "2024-06-15T19:30:45.000000Z"
//! 41 | | 1970-01-01 00:00:00 | p=1970-01-01 00%3A00%3A00 | p=...%2000%253A... | "1970-01-01T08:00:00.000000Z"
//! 42 | | 2024-06-15 23:59:59.999999 | p=2024-06-15 23%3A59%3A59.999999 | p=...%2023%253A... | "2024-06-16T06:59:59.999999Z"
//! 43 | | NULL | p=__HIVE_DEFAULT_PARTITION__ | same | null
//! 44 | TIMESTAMP_NTZ | 2024-06-15 12:30:45 | p=2024-06-15 12%3A30%3A45 | p=...%2012%253A... | "2024-06-15 12:30:45.000000"*
//! 45 | | 1970-01-01 00:00:00 | p=1970-01-01 00%3A00%3A00 | p=...%2000%253A... | "1970-01-01 00:00:00.000000"*
//! 46 | | NULL | p=__HIVE_DEFAULT_PARTITION__ | same | null
//! ```
//!
//! Notable behaviors:
//! - DECIMAL always includes the full scale (18 trailing digits for `DECIMAL(38,18)`).
//! - TIMESTAMP `partitionValues` are UTC ISO-8601 with microsecond precision. The filesystem dir
//! and `add.path` use local time without timezone.
//! - TIMESTAMP_NTZ `partitionValues` omit the timezone and use space-separated format.
//!
//! `*` = kernel diverges from Delta-Spark:
//! - Row 16: Delta-Spark normalizes `-0.0` to `"0.0"` at the SQL layer. Kernel preserves the sign,
//! producing `"-0.0"` (matching Java's `Double.toString(-0.0)`).
//! - Row 18: Delta-Spark formats the smallest subnormal as `"4.9E-324"`. Rust's shortest-
//! representation algorithm produces `"5.0E-324"`. Both parse to the same `f64`.
//! - Rows 44-45: Delta-Spark omits `.000000` when sub-seconds are zero. Kernel always includes
//! microsecond precision. Both formats round-trip through `parse_scalar`.
//!
//! # String and binary encoding
//!
//! String values are the interesting case because they can contain characters that
//! require encoding. Binary values are interpreted as raw bytes and treated as strings
//! for path construction.
//!
//! ```text
//! Row | Input Value | Filesystem Dir | add.path | partitionValues.p
//! ----+--------------------------+--------------------------------+------------------------------+------------------
//! 47 | chr(0) (NUL) | WRITE FAILS* | N/A | N/A
//! 48 | before\0after | WRITE FAILS* | N/A | N/A
//! 49 | X'' (empty binary) | p=__HIVE_DEFAULT_PARTITION__ | same | null
//! 50 | X'DEADBEEF' | p=<raw high bytes> | same | <invalid UTF-8>**
//! 51 | X'48454C4C4F' ("HELLO") | p=HELLO | same | "HELLO"
//! 52 | X'00FF' (has NUL) | WRITE FAILS* | N/A | N/A
//! 53 | X'2F3D25' (/=%) | p=%2F%3D%25 | p=%252F%253D%2525 | "/=%"
//! 54 | a{b | p=a%7Bb | p=a%257Bb | "a{b"
//! 55 | a}b | p=a}b | p=a%7Db | "a}b"
//! 56 | hello world | p=hello world | p=hello%20world | "hello world"
//! 57 | M\u{00FC}nchen | p=M\u{00FC}nchen | same | "M\u{00FC}nchen"
//! 58 | 日本語 | p=日本語 | same | "日本語"
//! 59 | 🎵🎶 | p=🎵🎶 | same | "🎵🎶"
//! 60 | a<b>c|d | p=a<b>c|d | p=a%3Cb%3Ec%7Cd | "a<b>c|d"
//! 61 | a@b!c(d) | p=a@b!c(d) | same | "a@b!c(d)"
//! 62 | a&b+c$d;e,f | p=a&b+c$d;e,f | same | "a&b+c$d;e,f"
//! 63 | Serbia/srb% | p=Serbia%2Fsrb%25 | p=Serbia%252Fsrb%2525 | "Serbia/srb%"
//! 64 | 100%25 (literal) | p=100%2525 | p=100%252525 | "100%25"
//! 65 | '' (empty string) | p=__HIVE_DEFAULT_PARTITION__ | same | null
//! 66 | NULL | p=__HIVE_DEFAULT_PARTITION__ | same | null
//! 67 | ' ' (1 space) | p= (trailing space) | p=%20 | " "
//! 68 | ' ' (2 spaces) | p= (trailing spaces) | p=%20%20 | " "
//! ```
//!
//! *WRITE FAILS: Delta-Spark fails at the filesystem `mkdirs` stage when the
//! Hive-encoded directory name contains NUL bytes. The encoding itself succeeds
//! (NUL encodes to `%00`), but the filesystem rejects the path.
//!
//! Key observations:
//!
//! - `add.path` applies an additional layer of URI encoding on top of the Hive-encoded directory
//! name. Characters that Hive leaves raw (space, `<`, `>`, `}`, `|`) get percent-encoded in
//! `add.path`. Characters that Hive already encoded (`/` -> `%2F`) get double-encoded in
//! `add.path` (`%2F` -> `%252F`).
//! - `partitionValues` stores the raw value with no encoding. Null and empty string are both
//! represented as JSON `null`.
//! - NUL bytes cause write failures at the filesystem level (`mkdirs` error).
//! - Non-empty binary values are interpreted as raw bytes. If the bytes happen to be valid UTF-8,
//! they work (e.g., `X'48454C4C4F'` = "HELLO").
//!
//! **Spark bug: non-UTF-8 binary (e.g., `X'DEADBEEF'`) passes raw bytes through to the
//! filesystem path and `partitionValues` JSON, producing invalid UTF-8 in both. The JSON
//! becomes unparseable. In kernel, `escape_partition_value` takes `&str`, so Rust's type
//! system prevents non-UTF-8 bytes from reaching the encoding layer.
pub
pub