riegeli 0.1.0

Rust implementation of the Riegeli/records file format
docs.rs failed to build riegeli-0.1.0
Please check the build logs for more information.
See Builds for ideas on how to fix a failed build, or Metadata for how to configure docs.rs builds.
If you believe this is docs.rs' fault, open an issue.

riegeli-rs

A pure-Rust, byte-level compatible implementation of Google's Riegeli/records file format — a high-performance, seekable, compressed record store used in machine learning and data pipelines.

Why Riegeli?

Riegeli files combine fast sequential writes, random-access reads, and transparent compression into a single container format. Records are grouped into chunks, chunks are aligned to 64 KiB blocks with integrity hashes at every boundary, and the whole file can be seeked by numeric position without scanning from the start. This crate brings that format to the Rust ecosystem with no C++ dependency.

Quick start

[dependencies]
riegeli = "0.1"

Write three records and read them back:

use std::io::Cursor;
use riegeli::{RecordWriter, RecordReader, WriterOptions, ReaderOptions, CompressionType};

// Write
let mut buf = Vec::new();
let opts = WriterOptions::new().compression(CompressionType::Zstd);
let mut writer = RecordWriter::new(Cursor::new(&mut buf), opts).unwrap();
writer.write_record(b"alpha").unwrap();
writer.write_record(b"bravo").unwrap();
writer.write_record(b"charlie").unwrap();
writer.close().unwrap();

// Read
let mut reader = RecordReader::new(Cursor::new(&buf), ReaderOptions::new()).unwrap();
while let Some(record) = reader.read_record().unwrap() {
    println!("{}", std::str::from_utf8(&record).unwrap());
}

File format overview

A Riegeli file is divided into fixed-size blocks of 65 536 bytes. Every block boundary carries a 24-byte BlockHeader containing a HighwayHash-64 integrity hash and previous_chunk / next_chunk file-offset pointers (little-endian u64). These pointers allow a reader that lands on any block boundary to locate the nearest chunk without scanning.

The first block header is at offset 0. Immediately after it, at offset 24, sits a 40-byte file-signature chunk (chunk type 's').

Data is stored in chunks. Each chunk begins with a 40-byte ChunkHeader:

Field Size Description
header_hash 8 bytes HighwayHash-64 of the remaining 32 bytes
data_size 8 bytes Byte length of the chunk data
data_hash 8 bytes HighwayHash-64 of the chunk data
chunk_type_and_num_records 8 bytes Low 8 bits = chunk type, high 56 = count
decoded_data_size 8 bytes Uncompressed payload size

Chunk types:

Type Byte Purpose
Simple 'r' Records stored sequentially
FileSignature 's' Marks the start of a valid Riegeli file
FileMetadata 'm' Optional serialized proto metadata
Padding 'p' Alignment padding
Transposed 't' Columnar proto decomposition

Compression (first byte of chunk data for compressed types):

Algorithm Byte
None 0x00
Brotli 'b'
Zstd 'z'
Snappy 's'

Public API

RecordWriter

let opts = WriterOptions::new()
    .compression(CompressionType::Zstd)
    .transpose(true)          // enable columnar encoding for proto records
    .chunk_size(1 << 20)      // flush every ~1 MiB of record data
    .initial_padding(65536);  // pad file size to a multiple on close

let mut writer = RecordWriter::new(dest, opts)?;
writer.write_record(b"data")?;
writer.flush()?;   // ensure buffered records are written
writer.close()?;   // finalize the file; further writes return Err

RecordReader

let mut reader = RecordReader::new(source, ReaderOptions::new())?;

// Sequential read
while let Some(record) = reader.read_record()? {
    process(&record);
}

// Position and seek
let pos = reader.last_pos();      // RecordPosition of last-read record
let n = pos.numeric();            // u64 suitable for storage
reader.seek_numeric(n)?;          // return to that record
let same = reader.read_record()?; // re-read it

// Metadata
if let Some(meta) = reader.read_metadata()? {
    // meta is a RecordsMetadata proto
}

// Field projection (transpose chunks only)
use riegeli::{Field, FieldProjection};
let proj = FieldProjection::new()
    .add_field(Field::new(vec![1]))   // include proto field 1
    .add_field(Field::new(vec![2]));  // include proto field 2
let opts = ReaderOptions::new().field_projection(proj);
let mut reader = RecordReader::new(source, opts)?;

RecordPosition

Returned by RecordReader::pos() and last_pos(). Call .numeric() to obtain a u64 suitable for persistence, then pass to seek_numeric to restore position.

Cargo features

Feature Default Crate Description
brotli no brotli Brotli compression
zstd yes zstd Zstd compression
snappy no snap Snappy compression

To use Brotli and Zstd with no Snappy:

riegeli = { git = "...", features = ["brotli", "zstd"] }

Implementation status

All phases are complete and byte-level compatible with the C++ reference implementation.

Phase Sprints Scope
1 1–7 Varint, headers, hashing, simple chunks (all compression codecs), RecordWriter, RecordReader (seek, recovery, block-boundary handling)
2 8–13 Proto wire parsing, transpose encoder + decoder (full state machine with NoOp bridging, implicit transitions), interop hardening
3 14–24 Conformance suite, performance tuning, field projection, API restriction

Conformance is verified against the C++ reference implementation via an FFI test harness (riegeli-ffi) that calls the C++ reader and writer over a cxx bridge. The test suite uses golden files produced by C++ to validate byte-level interoperability in both directions.

Build requirements

This crate generates Rust code from .proto files at build time using protobuf-codegen, which requires a compatible protoc binary on your PATH. Download the latest release from the protobuf releases page (look for protoc-<version>-<platform>.zip).

Dependencies

Crate Version Required Purpose
highway 1.3 always HighwayHash-64
brotli 8 optional Brotli codec
zstd 0.13 optional Zstd codec
snap 1 optional Snappy codec

Dev dependencies: proptest, criterion.

Benchmarks

See riegeli/benches/README.md for the full head-to-head Rust vs. C++ benchmark matrix. Representative results on Linux x86-64 (10 000 records, large payload):

Config Rust write Rust read C++ write C++ read
simple+none 1348 MB/s 2832 MB/s 747 MB/s 1343 MB/s
simple+zstd:3 2123 MB/s 3070 MB/s 3693 MB/s 5914 MB/s
transpose+none 956 MB/s 808 MB/s 693 MB/s 1149 MB/s
transpose+zstd:3 1605 MB/s 845 MB/s 3142 MB/s 4123 MB/s

C++ read throughput is measured through the FFI bridge and includes a per-record copy across the boundary, making it lower than native C++ performance.

License

Apache-2.0