Module byte_stream_split

Module byte_stream_split 

Source
Expand description

§Byte Stream Split (BSS) Miniblock Format

Byte Stream Split is a data transformation technique that improves compression by reorganizing multi-byte values to group bytes from the same position together. This is particularly effective for data where some byte positions have low entropy.

§How It Works

BSS splits multi-byte values by byte position, creating separate streams for each byte position across all values. This transformation is most beneficial when certain byte positions have low entropy (e.g., high-order bytes that are mostly zeros, sign-extended bytes, or floating-point sign/exponent bytes that cluster around common values).

§Example

Input data (f32): [1.0, 2.0, 3.0, 4.0]

In little-endian bytes:

  • 1.0 = [00, 00, 80, 3F]
  • 2.0 = [00, 00, 00, 40]
  • 3.0 = [00, 00, 40, 40]
  • 4.0 = [00, 00, 80, 40]

After BSS transformation:

  • Byte stream 0: [00, 00, 00, 00] (all first bytes)
  • Byte stream 1: [00, 00, 00, 00] (all second bytes)
  • Byte stream 2: [80, 00, 40, 80] (all third bytes)
  • Byte stream 3: [3F, 40, 40, 40] (all fourth bytes)

Output: [00, 00, 00, 00, 00, 00, 00, 00, 80, 00, 40, 80, 3F, 40, 40, 40]

§Compression Benefits

BSS itself doesn’t compress data - it reorders it. The compression benefit comes when BSS is combined with general-purpose compression (e.g., LZ4):

  1. Timestamps: Sequential timestamps have similar high-order bytes
  2. Sensor data: Readings often vary in a small range, sharing exponent bits
  3. Financial data: Prices may cluster around certain values

§Supported Types

  • 32-bit floating point (f32)
  • 64-bit floating point (f64)

§Chunk Handling

  • Maximum chunk size depends on data type:
    • f32: 1024 values (4KB per chunk)
    • f64: 512 values (4KB per chunk)
  • All chunks share a single global buffer
  • Non-last chunks always contain power-of-2 values

Structs§

ByteStreamSplitDecompressor
Byte Stream Split decompressor
ByteStreamSplitEncoder
Byte Stream Split encoder for floating point values

Functions§

should_use_bss
Determine if BSS should be used based on mode and data characteristics