Skip to main content

Module vlen

Module vlen 

Source
Expand description

The vlen array to bytes codec (Experimental).

Encodes the offsets and bytes of variable-sized data through independent codec chains. This codec is compatible with any variable-sized data type.

This codec is experimental and may be incompatible with other Zarr V3 implementations.

§Compatible Implementations

None

§Specification

Based on https://github.com/zarr-developers/zeps/pull/47#issuecomment-1710505141 by Jeremy Maitin-Shepard. Additional discussion:

This is an alternative vlen codec to the vlen-utf8, vlen-bytes, and vlen-array codecs that were introduced in Zarr V2. Rather than interleaving element bytes and lengths, element bytes (data) and offsets (indexes) are encoded separately and concatenated. Unlike the legacy vlen-* codecs, this new vlen codec is suited to partial decoding. Additionally, it it is not coupled to the array data type and can utilise the full potential of the Zarr V3 codec system.

Before encoding, the index is structured using the Apache arrow variable-size binary layout with the validity bitmap elided. The index has length + 1 offsets which are monotonically increasing such that

element_position = offsets[j]
element_length = offsets[j + 1] - offsets[j]  // (for 0 <= j < length)

where length is the number of chunk elements. The index can be encoded with either uint32 or uint64 offsets dependent on the index_data_type configuration parameter.

The data and index can use their own independent codec chain with support for any Zarr V3 codecs. The codecs are specified by data_codecs and index_codecs parameters in the codec configuration.

The index length and index can be encoded at the start or end of each chunk. If index_location is start:

  • The first 8 bytes hold a u64 little-endian indicating the length of the encoded index.
  • This is followed by the encoded index and then the encoded bytes with no padding.

If index_location is end:

  • The last 8 bytes hold the length of the encoded index.
  • The encoded index lies between the encoded data and the index length.

§Codec name Aliases (Zarr V3)

  • zarrs.vlen
  • https://codec.zarrs.dev/array_to_bytes/vlen

§Codec id Aliases (Zarr V2)

None

§Codec configuration Example - VlenCodecConfiguration:

{
  "data_codecs": [
    {
      "name": "bytes"
    },
    {
      "name": "blosc",
      "configuration": {
        "cname": "zstd",
        "clevel": 5,
        "shuffle": "bitshuffle",
        "typesize": 1,
        "blocksize": 0
      }
    }
  ],
  "index_codecs": [
    {
      "name": "bytes",
      "configuration": {
        "endian": "little"
      }
    },
    {
      "name": "blosc",
      "configuration": {
        "cname": "zstd",
        "clevel": 5,
        "shuffle": "shuffle",
        "typesize": 4,
        "blocksize": 0
      }
    }
  ],
  "index_data_type": "uint32",
  "index_location": "end"
}

Structs§

VlenCodec
A vlen codec implementation.
VlenCodecConfigurationV0
vlen codec configuration parameters (version 0.0 draft).
VlenCodecConfigurationV0_1
vlen codec configuration parameters (version 0.1 draft).

Enums§

VlenCodecConfiguration
A wrapper to handle various versions of vlen codec configuration parameters.