riegeli 0.1.0 - Docs.rs

# riegeli-rs

A pure-Rust, byte-level compatible implementation of
[Google's Riegeli/records](https://github.com/google/riegeli) file format —
a high-performance, seekable, compressed record store used in machine learning
and data pipelines.

## Why Riegeli?

Riegeli files combine fast sequential writes, random-access reads, and
transparent compression into a single container format. Records are grouped into
chunks, chunks are aligned to 64 KiB blocks with integrity hashes at every
boundary, and the whole file can be seeked by numeric position without scanning
from the start. This crate brings that format to the Rust ecosystem with no C++
dependency.

## Quick start

```toml
[dependencies]
riegeli = "0.1"
```

Write three records and read them back:

```rust
use std::io::Cursor;
use riegeli::{RecordWriter, RecordReader, WriterOptions, ReaderOptions, CompressionType};

// Write
let mut buf = Vec::new();
let opts = WriterOptions::new().compression(CompressionType::Zstd);
let mut writer = RecordWriter::new(Cursor::new(&mut buf), opts).unwrap();
writer.write_record(b"alpha").unwrap();
writer.write_record(b"bravo").unwrap();
writer.write_record(b"charlie").unwrap();
writer.close().unwrap();

// Read
let mut reader = RecordReader::new(Cursor::new(&buf), ReaderOptions::new()).unwrap();
while let Some(record) = reader.read_record().unwrap() {
    println!("{}", std::str::from_utf8(&record).unwrap());
}
```

## File format overview

A Riegeli file is divided into fixed-size **blocks** of 65 536 bytes. Every
block boundary carries a 24-byte `BlockHeader` containing a HighwayHash-64
integrity hash and `previous_chunk` / `next_chunk` file-offset pointers
(little-endian u64). These pointers allow a reader that lands on any block
boundary to locate the nearest chunk without scanning.

The first block header is at offset 0. Immediately after it, at offset 24, sits
a 40-byte **file-signature chunk** (chunk type `'s'`).

Data is stored in **chunks**. Each chunk begins with a 40-byte `ChunkHeader`:

| Field                       | Size    | Description                              |
|-----------------------------|---------|------------------------------------------|
| `header_hash`               | 8 bytes | HighwayHash-64 of the remaining 32 bytes |
| `data_size`                 | 8 bytes | Byte length of the chunk data            |
| `data_hash`                 | 8 bytes | HighwayHash-64 of the chunk data         |
| `chunk_type_and_num_records`| 8 bytes | Low 8 bits = chunk type, high 56 = count |
| `decoded_data_size`         | 8 bytes | Uncompressed payload size                |

Chunk types:

| Type          | Byte  | Purpose                                   |
|---------------|-------|-------------------------------------------|
| Simple        | `'r'` | Records stored sequentially               |
| FileSignature | `'s'` | Marks the start of a valid Riegeli file   |
| FileMetadata  | `'m'` | Optional serialized proto metadata        |
| Padding       | `'p'` | Alignment padding                         |
| Transposed    | `'t'` | Columnar proto decomposition              |

Compression (first byte of chunk data for compressed types):

| Algorithm | Byte   |
|-----------|--------|
| None      | `0x00` |
| Brotli    | `'b'`  |
| Zstd      | `'z'`  |
| Snappy    | `'s'`  |

## Public API

### `RecordWriter`

```rust
let opts = WriterOptions::new()
    .compression(CompressionType::Zstd)
    .transpose(true)          // enable columnar encoding for proto records
    .chunk_size(1 << 20)      // flush every ~1 MiB of record data
    .initial_padding(65536);  // pad file size to a multiple on close

let mut writer = RecordWriter::new(dest, opts)?;
writer.write_record(b"data")?;
writer.flush()?;   // ensure buffered records are written
writer.close()?;   // finalize the file; further writes return Err
```

### `RecordReader`

```rust
let mut reader = RecordReader::new(source, ReaderOptions::new())?;

// Sequential read
while let Some(record) = reader.read_record()? {
    process(&record);
}

// Position and seek
let pos = reader.last_pos();      // RecordPosition of last-read record
let n = pos.numeric();            // u64 suitable for storage
reader.seek_numeric(n)?;          // return to that record
let same = reader.read_record()?; // re-read it

// Metadata
if let Some(meta) = reader.read_metadata()? {
    // meta is a RecordsMetadata proto
}

// Field projection (transpose chunks only)
use riegeli::{Field, FieldProjection};
let proj = FieldProjection::new()
    .add_field(Field::new(vec![1]))   // include proto field 1
    .add_field(Field::new(vec![2]));  // include proto field 2
let opts = ReaderOptions::new().field_projection(proj);
let mut reader = RecordReader::new(source, opts)?;
```

### `RecordPosition`

Returned by `RecordReader::pos()` and `last_pos()`. Call `.numeric()` to obtain
a `u64` suitable for persistence, then pass to `seek_numeric` to restore
position.

## Cargo features

| Feature  | Default | Crate    | Description        |
|----------|---------|----------|--------------------|
| `brotli` | no      | `brotli` | Brotli compression |
| `zstd`   | yes     | `zstd`   | Zstd compression   |
| `snappy` | no      | `snap`   | Snappy compression |

To use Brotli and Zstd with no Snappy:

```toml
riegeli = { git = "...", features = ["brotli", "zstd"] }
```

## Implementation status

All phases are complete and byte-level compatible with the C++ reference
implementation.

| Phase | Sprints | Scope                                                                                       |
|-------|---------|---------------------------------------------------------------------------------------------|
| 1     | 1–7     | Varint, headers, hashing, simple chunks (all compression codecs), `RecordWriter`, `RecordReader` (seek, recovery, block-boundary handling) |
| 2     | 8–13    | Proto wire parsing, transpose encoder + decoder (full state machine with NoOp bridging, implicit transitions), interop hardening |
| 3     | 14–24   | Conformance suite, performance tuning, field projection, API restriction                    |

Conformance is verified against the C++ reference implementation via an FFI
test harness (`riegeli-ffi`) that calls the C++ reader and writer over a
[cxx](https://cxx.rs) bridge. The test suite uses golden files produced by C++
to validate byte-level interoperability in both directions.

## Build requirements

This crate generates Rust code from `.proto` files at build time using
`protobuf-codegen`, which requires a compatible `protoc` binary on your `PATH`.
Download the latest release from the
[protobuf releases page](https://github.com/protocolbuffers/protobuf/releases)
(look for `protoc-<version>-<platform>.zip`).

## Dependencies

| Crate     | Version | Required | Purpose        |
|-----------|---------|----------|----------------|
| `highway` | 1.3     | always   | HighwayHash-64 |
| `brotli`  | 8       | optional | Brotli codec   |
| `zstd`    | 0.13    | optional | Zstd codec     |
| `snap`    | 1       | optional | Snappy codec   |

Dev dependencies: `proptest`, `criterion`.

## Benchmarks

See [`riegeli/benches/README.md`](riegeli/benches/README.md) for the full
head-to-head Rust vs. C++ benchmark matrix. Representative results on Linux
x86-64 (10 000 records, large payload):

| Config             | Rust write | Rust read | C++ write | C++ read |
|--------------------|----------:|----------:|----------:|---------:|
| simple+none        | 1348 MB/s | 2832 MB/s |  747 MB/s | 1343 MB/s |
| simple+zstd:3      | 2123 MB/s | 3070 MB/s | 3693 MB/s | 5914 MB/s |
| transpose+none     |  956 MB/s |  808 MB/s |  693 MB/s | 1149 MB/s |
| transpose+zstd:3   | 1605 MB/s |  845 MB/s | 3142 MB/s | 4123 MB/s |

C++ read throughput is measured through the FFI bridge and includes a per-record
copy across the boundary, making it lower than native C++ performance.

## License

Apache-2.0