zarr Driver

Zarr v2 is a chunked array storage format based on the NumPy data type model.

The zarr driver provides access to Zarr-v2-format arrays backed by any supported Key-Value Storage Layer. It supports reading, writing, creating new arrays, and resizing arrays.

Zarr supports arrays with structured data types specifying multiple named fields that are packed together. TensorStore fully supports such arrays, but each field must be opened separately.

json driver/zarr : object
Extends:
Required members:
driver : "zarr"
kvstore : KvStore | KvStoreUrl

Specifies the underlying storage mechanism.

Optional members:
context : Context

Specifies context resources that augment/override the parent context.

dtype : dtype

Specifies the data type.

rank : integer[0, 32]

Specifies the rank of the TensorStore.

If transform is also specified, the input rank must match. Otherwise, the rank constraint applies to the driver directly.

transform : IndexTransform

Specifies a transform.

schema : Schema

Specifies constraints on the schema.

When opening an existing array, specifies constraints on the existing schema; opening will fail if the constraints do not match. Any soft constraints specified in the chunk_layout are ignored. When creating a new array, a suitable schema will be selected automatically based on the specified schema constraints in combination with any driver-specific constraints.

path : string = ""

Additional path within the KvStore specified by kvstore.

This is joined as an additional "/"-separated path component after any path member directly within kvstore. This is supported for backwards compatibility only; the KvStore.path member should be used instead.

Example

"path/to/data"
open : boolean

Open an existing TensorStore. If neither open nor create is specified, defaults to true.

create : boolean = false

Create a new TensorStore. Specify true for both open and create to permit either opening an existing TensorStore or creating a new TensorStore if it does not already exist.

delete_existing : boolean = false

Delete any existing data at the specified path before creating a new TensorStore. Requires that create is true, and that open is false.

assume_metadata : boolean = false

Neither read nor write stored metadata. Instead, just assume any necessary metadata based on constraints in the spec, using the same defaults for any unspecified metadata as when creating a new TensorStore. The stored metadata need not even exist. Operations such as resizing that modify the stored metadata are not supported. Requires that open is true and delete_existing is false. This option takes precedence over assume_cached_metadata if that option is also specified.

Warning

This option can lead to data corruption if the assumed metadata does not match the stored metadata, or multiple concurrent writers use different assumed metadata.

assume_cached_metadata : boolean = false

Skip reading the metadata when opening. Instead, just assume any necessary metadata based on constraints in the spec, using the same defaults for any unspecified metadata as when creating a new TensorStore. The stored metadata may still be accessed by subsequent operations that need to re-validate or modify the metadata. Requires that open is true and delete_existing is false. The assume_metadata option takes precedence if also specified.

Note

Unlike the assume_metadata option, operations such as resizing that modify the stored metadata are supported (and access the stored metadata).

Warning

This option can lead to data corruption if the assumed metadata does not match the stored metadata, or multiple concurrent writers use different assumed metadata.

cache_pool : ContextResource = "cache_pool"

Cache pool for data.

Specifies or references a previously defined Context.cache_pool. It is normally more convenient to specify a default cache_pool in the context.

metadata_cache_pool : ContextResource

Cache pool for metadata only.

Specifies or references a previously defined Context.cache_pool. If not specified, defaults to the value of cache_pool.

data_copy_concurrency : ContextResource = "data_copy_concurrency"

Specifies or references a previously defined Context.data_copy_concurrency. It is normally more convenient to specify a default data_copy_concurrency in the context.

recheck_cached_metadata : CacheRevalidationBound = "open"

Time after which cached metadata is assumed to be fresh. Cached metadata older than the specified time is revalidated prior to use. The metadata is used to check the bounds of every read or write operation.

Specifying true means that the metadata will be revalidated prior to every read or write operation. With the default value of "open", any cached metadata is revalidated when the TensorStore is opened but is not rechecked for each read or write operation.

recheck_cached_data : CacheRevalidationBound = true

Time after which cached data is assumed to be fresh. Cached data older than the specified time is revalidated prior to being returned from a read operation. Partial chunk writes are always consistent regardless of the value of this option.

The default value of true means that cached data is revalidated on every read. To enable in-memory data caching, you must both specify a cache_pool with a non-zero total_bytes_limit and also specify false, "open", or an explicit time bound for recheck_cached_data.

fill_missing_data_reads = true

Replace missing chunks with the fill value when reading.

If disabled, reading a missing chunk will result in an error. Note that the fill value may still be used when writing a partial chunk. Typically this should only be set to false in the case that store_data_equal_to_fill_value was enabled when writing.

store_data_equal_to_fill_value = false

Store all explicitly written data, even if it is equal to the fill value.

This ensures that explicitly written data, even if it is equal to the fill value, can be distinguished from missing data. If disabled, chunks equal to the fill value may be represented as missing chunks.

field : string | null = null

Name of field to open.

Must be specified if the metadata.dtype specified in the array metadata has more than one field.

metadata : object

Zarr array metadata.

Specifies constraints on the metadata, exactly as in the .zarray metadata file, except that all members are optional. When creating a new array, the new metadata is obtained by combining these metadata constraints with any Schema constraints.

Optional members:
zarr_format : 2
shape : array of integer[0, +∞)

Chunked dimensions of the array.

Required when creating a new array if the Schema.domain is not otherwise specified.

Example

[500, 500, 500]
chunks : array of integer[1, +∞)

Chunk dimensions.

Specifies the chunk size for each chunked dimension. Must have the same length as shape. If not specified when creating a new array, the chunk dimensions are chosen automatically according to the Schema.chunk_layout.

Example

[64, 64, 64]
dtype

Specifies the scalar or structured data type.

Refer to the Zarr data type encoding specification. As an extension, TensorStore also supports "bfloat16" for specifying the bfloat16 data type with little endian byte order. TensorStore also supports experimental 8-bit floating-point types including "float8_e4m3fn", "float8_e4m3fnuz", "float8_e4m3b11fnuz", "float8_e5m2", "float8_e5m2fnuz" described in ml_dtypes with little endian byte order. 4-bit integer type padded to 1 byte is supported as "int4" by specifying int4.

fill_value

Specifies the fill value.

When creating a new array, defaults to null.

order : "C" | "F"

Specifies the data layout for encoded chunks.

"C" for C order, "F" for Fortran order. When creating a new array, defaults to "C".

compressor : driver/zarr/Compressor | null

Specifies the chunk compression method.

Specifies the chunk compressor. Specifying null disables compression. When creating a new array, if not specified, the default compressor of {"id": "blosc"} is used.

filters : null

Specifies the filters to apply to chunks.

When encoding chunk, filters are applied before the compressor. Currently, filters are not supported.

dimension_separator : "." | "/"

Specifies the encoding of chunk indices into key-value store keys.

The default value of "." corresponds to the default encoding used by Zarr, while "/" corresponds to the encoding used by zarr.storage.NestedDirectoryStore.

metadata_key : string = ".zarray"

Specifies the key under which to store the array metadata in JSON format.

By default, the array metadata is stored under the .zarray key as required by the Zarr storage specification. In rare cases it may be useful to specify a non-default value, e.g. "zarray" to avoid problems caused by the leading dot. However, be aware that specifying a non-default value breaks compatibility with other zarr implementations.

key_encoding : "." | "/" = "."

Specifies the encoding of chunk indices into key-value store keys.

Deprecated. Equivalent to specifying metadata.dimension_separator.

Example

{
  "driver": "zarr",
  "kvstore": {"driver": "gcs", "bucket": "my-bucket", "path": "path/to/array/"},
  "key_encoding": ".",
  "metadata": {
    "shape": [1000, 1000],
    "chunks": [100, 100],
    "dtype": "<i2",
    "order": "C",
    "compressor": {"id": "blosc", "shuffle": -1, "clevel": 5, "cname": "lz4"}
  }
}

Compressors

Chunk data is encoded according to the driver/zarr.metadata.compressor specified in the metadata.

json driver/zarr/Compressor : object

Compressor

The id member identifies the compressor. The remaining members are specific to the compressor.

Subtypes:
Required members:
id : string

Identifies the compressor.

The following compressors are supported:

json driver/zarr/Compressor/zlib : object

Specifies zlib compression with a zlib or gzip header.

Extends:
Required members:
id : "zlib" | "gzip"
Optional members:
level : integer[0, 9] = 1

Specifies the zlib compression level to use.

Level 0 indicates no compression (fastest), while level 9 indicates the best compression ratio (slowest).

Example

{"id": "gzip", "level": 9}
json driver/zarr/Compressor/blosc : object

Specifies Blosc compression.

Extends:
Required members:
id : "blosc"
Optional members:
cname : "blosclz" | "lz4" | "lz4hc" | "snappy" | "zlib" | "zstd" = "lz4"

Specifies the compression method used by Blosc.

clevel : integer[0, 9] = 5

Specifies the Blosc compression level to use.

Higher values are slower but achieve a higher compression ratio.

shuffle : -1 | 0 | 1 | 2 = -1
One of:
-1

Automatic shuffle.

Bit-wise shuffle if the element size is 1 byte, otherwise byte-wise shuffle.

0

No shuffle

1

Byte-wise shuffle

2

Bit-wise shuffle

blocksize : integer[0, +∞)

Specifies the Blosc blocksize.

The default value of 0 causes the block size to be chosen automatically.

Example

{"id": "blosc", "cname": "blosclz", "clevel": 9, "shuffle": 2}
json driver/zarr/Compressor/bz2 : object

Specifies bzip2 compression.

Extends:
Required members:
id : "bz2"
Optional members:
level : integer[1, 9] = 1

Specifies the bzip2 buffer size/compression level to use.

A level of 1 indicates the smallest buffer (fastest), while level 9 indicates the best compression ratio (slowest).

json driver/zarr/Compressor/zstd : object

Specifies Zstd compression.

Extends:
Required members:
id : "zstd"
Optional members:
level : integer[-131072, 22] = 1

Specifies the compression level to use.

A higher compression level provides improved density but reduced compression speed.

Example

{"id": "zstd", "level": 6}

Mapping to TensorStore Schema

Example with scalar data type

For the following zarr metadata:

{
  "zarr_format": 2,
  "shape": [1000, 2000, 3000],
  "chunks": [100, 200, 300],
  "dtype": "<u2",
  "compressor": null,
  "fill_value": 42,
  "order": "C",
  "filters": null
}

the corresponding Schema is:

{
  "chunk_layout": {
    "grid_origin": [0, 0, 0],
    "inner_order": [0, 1, 2],
    "read_chunk": {"shape": [100, 200, 300]},
    "write_chunk": {"shape": [100, 200, 300]}
  },
  "codec": {"compressor": null, "driver": "zarr", "filters": null},
  "domain": {"exclusive_max": [[1000], [2000], [3000]], "inclusive_min": [0, 0, 0]},
  "dtype": "uint16",
  "fill_value": 42,
  "rank": 3
}

Example with structured data type

For the following zarr metadata:

{
  "zarr_format": 2,
  "shape": [1000, 2000, 3000],
  "chunks": [100, 200, 300],
  "dtype": [["x", "<u2", [2, 3]], ["y", "<f4", [5]]],
  "compressor": {"id": "blosc", "cname": "lz4", "clevel": 5, "shuffle": 1},
  "fill_value": "AQACAAMABAAFAAYAAAAgQQAAMEEAAEBBAABQQQAAYEE=",
  "order": "F",
  "filters": null
}

the corresponding Schema for the "x" field is:

{
  "chunk_layout": {
    "grid_origin": [0, 0, 0, 0, 0],
    "inner_order": [2, 1, 0, 3, 4],
    "read_chunk": {"shape": [100, 200, 300, 2, 3]},
    "write_chunk": {"shape": [100, 200, 300, 2, 3]}
  },
  "codec": {
    "compressor": {"blocksize": 0, "clevel": 5, "cname": "lz4", "id": "blosc", "shuffle": 1},
    "driver": "zarr",
    "filters": null
  },
  "domain": {
    "exclusive_max": [[1000], [2000], [3000], 2, 3],
    "inclusive_min": [0, 0, 0, 0, 0]
  },
  "dtype": "uint16",
  "fill_value": [[1, 2, 3], [4, 5, 6]],
  "rank": 5
}

and the corresponding Schema for the "y" field is:

{
  "chunk_layout": {
    "grid_origin": [0, 0, 0, 0],
    "inner_order": [2, 1, 0, 3],
    "read_chunk": {"shape": [100, 200, 300, 5]},
    "write_chunk": {"shape": [100, 200, 300, 5]}
  },
  "codec": {
    "compressor": {"blocksize": 0, "clevel": 5, "cname": "lz4", "id": "blosc", "shuffle": 1},
    "driver": "zarr",
    "filters": null
  },
  "domain": {"exclusive_max": [[1000], [2000], [3000], 5], "inclusive_min": [0, 0, 0, 0]},
  "dtype": "float32",
  "fill_value": [10.0, 11.0, 12.0, 13.0, 14.0],
  "rank": 4
}

Data type

Zarr scalar data types map to TensorStore data types as follows:

Supported data types

TensorStore data type

Zarr data type

Little endian

Big endian

bool

"|b1"

int4

"int4"

uint8

"|u1"

int8

"|i1"

uint16

"<u2"

">u2"

int16

"<i2"

">i2"

uint32

"<u2"

">u2"

int32

"<i4"

">i4"

uint64

"<u8"

">u8"

int64

"<i8"

">i8"

float8_e4m3fn

"float8_e4m3fn"

float8_e4m3fnuz

"float8_e4m3fnuz"

float8_e4m3b11fnuz

"float8_e4m3b11fnuz"

float8_e5m2

"float8_e5m2"

float8_e5m2fnuz

"float8_e5m2fnuz"

float16

"<f2"

">f2"

bfloat16

"bfloat16"

float32

"<f4"

">f4"

float64

"<f8"

">f8"

complex64

"<c8"

">c8"

complex128

"<c16"

">c16"

Zarr structured data types are supported, but are represented in TensorStore as scalar arrays with additional dimensions.

When creating a new array, if a driver/zarr.metadata.dtype is not specified explicitly, a scalar Zarr data type with the native endianness is chosen based on the Schema.dtype. To create an array with non-native endianness or a structured data type, the zarr driver/zarr.metadata.dtype must be specified explicitly.

Note

TensorStore supports the non-standard bfloat16 data type as an extension. On little endian platforms, the official Zarr Python library is capable of reading arrays created with the bfloat16 data type provided that a bfloat16 numpy data type has been registered. The TensorStore Python library registers such a data type, as does TensorFlow and JAX.

Warning

zarr datetime/timedelta data types are not currently supported.

Domain

The Schema.domain includes both the chunked dimensions as well as any subarray dimensions in the case of a structured data type.

Example with scalar data type

If the driver/zarr.metadata.dtype is "<u2" and the driver/zarr.metadata.shape is [100, 200], then the Schema.domain is {"shape": [[100], [200]]}.

Example with structured data type

If the driver/zarr.metadata.dtype is [["x", "<u2", [2, 3]]], and the driver/zarr.metadata.shape is [100, 200], then the Schema.domain is {"shape": [[100], [200], 2, 3]}.

As zarr does not natively support a non-zero origin, the underlying domain always has a zero origin (IndexDomain.inclusive_min is all zero), but it may be translated by the transform.

The upper bounds of the chunked dimensions are resizable (i.e. implicit), while the upper bounds of any subarray dimensions are not resizable.

The zarr metadata format does not support persisting dimension labels, but dimension labels may still be specified when opening using a transform.

Chunk layout

The zarr format supports a single driver/zarr.metadata.chunks property that corresponds to the ChunkLayout/Grid.shape constraint. As with the Schema.domain, the Schema.chunk_layout includes both the chunked dimensions as well as any subarray dimensions in the case of a structured data type. The chunk size for subarray dimensions is always the full extent.

Example with scalar data type

If the driver/zarr.metadata.dtype is "<u2" and driver/zarr.metadata.chunks is [100, 200], then the ChunkLayout/Grid.shape is [100, 200].

Example with structured data type

If the driver/zarr.metadata.dtype is [["x", "<u2", [2, 3]]], and driver/zarr.metadata.chunks is [100, 200], then the ChunkLayout/Grid.shape is [100, 200, 2, 3].

As the zarr format supports only a single level of chunking, the ChunkLayout.read_chunk and ChunkLayout.write_chunk constraints are combined, and hard constraints on ChunkLayout.codec_chunk must not be specified.

The ChunkLayout.grid_origin is always all-zero.

The ChunkLayout.inner_order corresponds to driver/zarr.metadata.order, but also includes the subarray dimensions, which are always the inner-most dimensions.

Example with scalar data type and C order

If the driver/zarr.metadata.dtype is "<u2", driver/zarr.metadata.order is "C", and there are 3 chunked dimensions, then the ChunkLayout.inner_order is [0, 1, 2].

Example with scalar data type and Fortran order

If the driver/zarr.metadata.dtype is "<u2", driver/zarr.metadata.order is "F", and there are 3 chunked dimensions, then the ChunkLayout.inner_order is [2, 1, 0].

Example with structured data type and C order

If the driver/zarr.metadata.dtype is [["x", "<u2", [2, 3]]], driver/zarr.metadata.order is "C", and there are 3 chunked dimensions, then the ChunkLayout.inner_order is [0, 1, 2, 3, 4].

Example with structured data type and Fortran order

If the driver/zarr.metadata.dtype is [["x", "<u2", [2, 3]]], driver/zarr.metadata.order is "F", and there are 3 chunked dimensions, then the ChunkLayout.inner_order is [2, 1, 0, 3, 4].

Selection of chunk layout when creating a new array

When creating a new array, the chunk sizes may be specified explicitly via ChunkLayout/Grid.shape or implicitly via ChunkLayout/Grid.aspect_ratio and ChunkLayout/Grid.elements. In the latter case, a suitable chunk shape is chosen automatically. If ChunkLayout/Grid.elements is not specified, the default is 1 million elements per chunk:

Example of unconstrained chunk layout

>>> ts.open({
...     'driver': 'zarr',
...     'kvstore': {
...         'driver': 'memory'
...     }
... },
...         create=True,
...         dtype=ts.uint16,
...         shape=[1000, 2000, 3000]).result().chunk_layout
ChunkLayout({
  'grid_origin': [0, 0, 0],
  'inner_order': [0, 1, 2],
  'read_chunk': {'shape': [101, 101, 101]},
  'write_chunk': {'shape': [101, 101, 101]},
})

Example of explicit chunk shape constraint

>>> ts.open({
...     'driver': 'zarr',
...     'kvstore': {
...         'driver': 'memory'
...     }
... },
...         create=True,
...         dtype=ts.uint16,
...         shape=[1000, 2000, 3000],
...         chunk_layout=ts.ChunkLayout(
...             chunk_shape=[100, 200, 300])).result().chunk_layout
ChunkLayout({
  'grid_origin': [0, 0, 0],
  'inner_order': [0, 1, 2],
  'read_chunk': {'shape': [100, 200, 300]},
  'write_chunk': {'shape': [100, 200, 300]},
})

Example of chunk aspect ratio constraint

>>> ts.open({
...     'driver': 'zarr',
...     'kvstore': {
...         'driver': 'memory'
...     }
... },
...         create=True,
...         dtype=ts.uint16,
...         shape=[1000, 2000, 3000],
...         chunk_layout=ts.ChunkLayout(
...             chunk_aspect_ratio=[1, 2, 2])).result().chunk_layout
ChunkLayout({
  'grid_origin': [0, 0, 0],
  'inner_order': [0, 1, 2],
  'read_chunk': {'shape': [64, 128, 128]},
  'write_chunk': {'shape': [64, 128, 128]},
})

Example of chunk aspect ratio and elements constraint

>>> ts.open({
...     'driver': 'zarr',
...     'kvstore': {
...         'driver': 'memory'
...     }
... },
...         create=True,
...         dtype=ts.uint16,
...         shape=[1000, 2000, 3000],
...         chunk_layout=ts.ChunkLayout(
...             chunk_aspect_ratio=[1, 2, 2],
...             chunk_elements=2000000)).result().chunk_layout
ChunkLayout({
  'grid_origin': [0, 0, 0],
  'inner_order': [0, 1, 2],
  'read_chunk': {'shape': [79, 159, 159]},
  'write_chunk': {'shape': [79, 159, 159]},
})

Codec

Within the Schema.codec, the compression parameters are represented in the same way as in the metadata:

json driver/zarr/Codec : object
Extends:
Required members:
driver : "zarr"
Optional members:
compressor : driver/zarr/Compressor | null

Specifies the chunk compression method.

Specifies the chunk compressor. Specifying null disables compression. When creating a new array, if not specified, the default compressor of {"id": "blosc"} is used.

filters : null

Specifies the filters to apply to chunks.

When encoding chunk, filters are applied before the compressor. Currently, filters are not supported.

It is an error to specify any other Codec.driver.

Fill value

For scalar zarr data types, the Schema.fill_value must be a scalar (rank 0). For structured data types, the Schema.fill_value must be broadcastable to the subarray shape.

As an optimization, chunks that are entirely equal to the fill value are not stored.

The zarr format allows the fill value to be unspecified, indicated by a driver/zarr.metadata.fill_value of null. In that case, TensorStore always uses a fill value of 0. However, in this case explicitly-written all-zero chunks are still stored.

Limitations

Filters are not supported.