rds2rust 0.1.21

A pure Rust library for reading and writing R's RDS (R Data Serialization) files without requiring an R runtime.
Documentation
# RDS File Format Documentation

This document describes the RDS (R Data Serialization) file format used by R's `saveRDS()` and `readRDS()` functions.

## File Structure

An RDS file consists of:
1. Header (13-14 bytes)
2. Serialized R object

## Header Format

The header contains:

```
Byte 0-1:   Magic bytes (format identifier)
            'X\n' (0x58 0x0a) - XDR format (big-endian)
            'A\n' (0x41 0x0a) - ASCII format
            'B\n' (0x42 0x0a) - Binary format (little-endian)

Byte 2-5:   Format version (32-bit big-endian integer)
            Version 2 is most common (0x00000002)
            Version 3 adds support for custom reference hooks

Byte 6-9:   R version that wrote the file (32-bit big-endian integer)
            Encoded as: (major << 16) | (minor << 8) | patch
            Example: R 3.5.0 = 0x00030500

Byte 10-13: Minimum R version needed to read (32-bit big-endian integer)
            Same encoding as above
```

## SEXP Types

R objects are called SEXPs (S-EXPressions). Each object has a type tag:

```
Type    Value   Description
----    -----   -----------
NILSXP  0       NULL
SYMSXP  1       Symbol
LISTSXP 2       Pairlist
CLOSXP  3       Closure (function)
ENVSXP  4       Environment
PROMSXP 5       Promise
LANGSXP 6       Language construct
SPECIALSXP 7    Special function
BUILTINSXP 8    Built-in function
CHARSXP 9       Internal character string
LGLSXP  10      Logical vector
INTSXP  13      Integer vector
REALSXP 14      Real (double) vector
CPLXSXP 15      Complex vector
STRSXP  16      Character vector
DOTSXP  17      Dot-dot-dot object
ANYSXP  18      Any type (for matching)
VECSXP  19      List (generic vector)
EXPRSXP 20      Expression vector (vector of language objects)
BCODESXP 21     Byte code
EXTPTRSXP 22    External pointer
WEAKREFSXP 23   Weak reference
RAWSXP  24      Raw vector
S4SXP   25      S4 object
```

## Object Format

Each object in the stream has:

```
Byte 0-3:   Flags and type (32-bit big-endian integer)
            Bits 0-7:   SEXP type (see table above)
            Bit 8:      Has attributes flag
            Bit 9:      Has tag flag (for pairlists)
            Bits 10-15: GP field (general purpose)
            Bits 16-23: Levels (for reference tracking)
            Bits 24-31: Object flag and other flags
```

### Flag Bits Detail

```
IS_OBJECT_BIT_MASK  (1 << 8)   = 0x00000100 - Object has S3/S4 class
HAS_ATTR_BIT_MASK   (1 << 9)   = 0x00000200 - Object has attributes
HAS_TAG_BIT_MASK    (1 << 10)  = 0x00000400 - Pairlist node has tag
```

### Reference Tracking

Format version 2 introduced reference tracking to handle shared/recursive structures:

- When an object might be referenced again, it's added to a reference table
- Subsequent references use a REFSXP pseudo-type with an index
- REFSXP = 0xFF (255) is a special marker for a reference

## Vector Formats

### Integer Vector (INTSXP)

```
4 bytes:  Length (N) as big-endian int32
N*4 bytes: N integers as big-endian int32
```

Special value: `NA_integer_` = 0x80000000 (INT32_MIN)

### Real Vector (REALSXP)

```
4 bytes:    Length (N) as big-endian int32
N*8 bytes:  N doubles as big-endian IEEE 754 double
```

Special values:
- `NA_real_`: Specific NaN bit pattern (0x7FF00000000007A2)
- `Inf`: IEEE 754 infinity
- `-Inf`: IEEE 754 negative infinity
- `NaN`: IEEE 754 NaN (various bit patterns)

### Logical Vector (LGLSXP)

```
4 bytes:  Length (N) as big-endian int32
N*4 bytes: N logicals as big-endian int32
           0 = FALSE
           1 = TRUE
           INT32_MIN (0x80000000) = NA
```

### Character Vector (STRSXP)

```
4 bytes:  Length (N) as big-endian int32
N*X bytes: N CHARSXP objects (see below)
```

### CHARSXP (Internal String)

```
4 bytes:  Length (L) in bytes as big-endian int32
          -1 (0xFFFFFFFF) = NA_character_
L bytes:  UTF-8 encoded string data (no null terminator)
```

Special handling:
- Length includes encoding byte if present
- Encoding can be UTF-8, Latin1, or native

### Raw Vector (RAWSXP)

```
4 bytes:  Length (N) as big-endian int32
N bytes:  N raw bytes
```

### Complex Vector (CPLXSXP)

```
4 bytes:     Length (N) as big-endian int32
N*16 bytes:  N complex numbers
             Each: 8 bytes real (double) + 8 bytes imaginary (double)
             Both big-endian
```

### List (VECSXP)

```
4 bytes:   Length (N) as big-endian int32
N*X bytes: N serialized R objects (recursive)
```

### Expression Vector (EXPRSXP)

```
4 bytes:   Length (N) as big-endian int32
N*X bytes: N serialized R objects (recursive)
```

Note: EXPRSXP is structurally identical to VECSXP. The difference is semantic - EXPRSXP represents collections of unevaluated expressions (typically language objects), most often as the result of `parse()` or `expression()` in R.

## Attributes

If the HAS_ATTR_BIT_MASK flag is set, attributes follow the object data:

```
X bytes:  Pairlist (LISTSXP) of attributes
          Each attribute is a pair: tag (symbol) + value (object)
```

Common attributes:
- `names`: Character vector of element names
- `dim`: Integer vector for matrix/array dimensions
- `dimnames`: List of dimension names
- `class`: Character vector of class names
- `row.names`: Row names for data frames

## Pairlist (LISTSXP)

```
For each node:
  4 bytes:   Flags and type (with HAS_TAG_BIT if tagged)
  X bytes:   CAR (value object)
  [X bytes:  TAG (symbol) - only if HAS_TAG_BIT set]
  X bytes:   CDR (rest of list or NILSXP if end)
```

## Compression

RDS files can be compressed. The compression is applied to the entire file after serialization:

- **gzip**: Most common, uses deflate algorithm
- **bzip2**: Better compression, slower
- **xz**: Best compression, slowest

Compression is transparent - the header magic bytes appear after decompression.

## Implementation Notes

1. All multi-byte integers are big-endian (XDR format is most common)
2. Strings are UTF-8 encoded by default
3. Reference tracking must be maintained during parsing
4. Attributes are stored separately from object data
5. Special NA values must be preserved exactly
6. Float special values (NaN, Inf) follow IEEE 754

## Example: Simple Integer Vector

For `saveRDS(42L, "test.rds")`:

```
58 0a                 # Magic: 'X\n' (XDR format)
00 00 00 02           # Format version 2
00 04 04 01           # R version 4.4.1
00 03 05 00           # Min R version 3.5.0
00 00 00 0d           # Type: INTSXP (13)
00 00 00 01           # Length: 1
00 00 00 2a           # Value: 42
```

## References

- R source code: `src/main/serialize.c`
- R Internals manual: Section on serialization
- rds2cpp: C++ implementation by LTLA