zarr2 Driver¶
Zarr v2 is a chunked array storage format based on the NumPy data type model.
The zarr2 driver provides access to Zarr-v2-format arrays backed by any
supported Key-Value Storage Layer. It supports reading, writing, creating new
arrays, and resizing arrays.
Zarr supports arrays with structured data types specifying multiple named fields that are packed together. TensorStore fully supports such arrays, but each field must be opened separately.
- json driver/zarr2 : object¶
- Extends:¶
- Required members:¶
-
driver :
"zarr2"|"zarr"¶
- kvstore : KvStore | KvStoreUrl¶
Base key-value store for the TensorStore.
-
driver :
- Optional members:¶
-
-
rank : integer[
0,32]¶ Specifies the rank of the TensorStore.
If
transformis also specified, the input rank must match. Otherwise, the rank constraint applies to the driver directly.
- transform : IndexTransform¶
Specifies a transform.
- schema : Schema¶
Specifies constraints on the schema.
When opening an existing array, specifies constraints on the existing schema; opening will fail if the constraints do not match. Any soft constraints specified in the
chunk_layoutare ignored. When creating a new array, a suitable schema will be selected automatically based on the specified schema constraints in combination with any driver-specific constraints.
-
path : string =
""¶ Additional path relative to
kvstore.This is joined as an additional
"/"-separated path component after anypathmember directly withinkvstore. This is supported for backwards compatibility only; theKvStore.pathmember should be used instead.Example
"path/to/data"
-
cache_pool : ContextResource =
"cache_pool"¶ Cache pool for data.
Specifies or references a previously defined
Context.cache_pool. It is normally more convenient to specify a defaultcache_poolin thecontext.
-
data_copy_concurrency : ContextResource =
"data_copy_concurrency"¶ Specifies or references a previously defined
Context.data_copy_concurrency. It is normally more convenient to specify a defaultdata_copy_concurrencyin thecontext.
-
recheck_cached_data =
true¶ Time after which cached data is assumed to be fresh. Cached data older than the specified time is revalidated prior to being returned from a read operation. Partial chunk writes are always consistent regardless of the value of this option.
The default value of
truemeans that cached data is revalidated on every read. To enable in-memory data caching, you must both specify acache_poolwith a non-zerototal_bytes_limitand also specifyfalse,"open", or an explicit time bound forrecheck_cached_data.
- open : boolean¶
Open an existing TensorStore. If neither
opennorcreateis specified, defaults totrue.
-
create : boolean =
false¶ Create a new TensorStore. Specify
truefor bothopenandcreateto permit either opening an existing TensorStore or creating a new TensorStore if it does not already exist.
-
delete_existing : boolean =
false¶ Delete any existing data at the specified path before creating a new TensorStore. Requires that
createistrue, and thatopenisfalse.
-
assume_metadata : boolean =
false¶ Neither read nor write stored metadata. Instead, just assume any necessary metadata based on constraints in the spec, using the same defaults for any unspecified metadata as when creating a new TensorStore. The stored metadata need not even exist. Operations such as resizing that modify the stored metadata are not supported. Requires that
openistrueanddelete_existingisfalse. This option takes precedence overassume_cached_metadataif that option is also specified.Warning
This option can lead to data corruption if the assumed metadata does not match the stored metadata, or multiple concurrent writers use different assumed metadata.
-
assume_cached_metadata : boolean =
false¶ Skip reading the metadata when opening. Instead, just assume any necessary metadata based on constraints in the spec, using the same defaults for any unspecified metadata as when creating a new TensorStore. The stored metadata may still be accessed by subsequent operations that need to re-validate or modify the metadata. Requires that
openistrueanddelete_existingisfalse. Theassume_metadataoption takes precedence if also specified.Note
Unlike the
assume_metadataoption, operations such as resizing that modify the stored metadata are supported (and access the stored metadata).Warning
This option can lead to data corruption if the assumed metadata does not match the stored metadata, or multiple concurrent writers use different assumed metadata.
- metadata_cache_pool : ContextResource¶
Cache pool for metadata only.
Specifies or references a previously defined
Context.cache_pool. If not specified, defaults to the value ofcache_pool.
-
recheck_cached_metadata : CacheRevalidationBound =
"open"¶ Time after which cached metadata is assumed to be fresh. Cached metadata older than the specified time is revalidated prior to use. The metadata is used to check the bounds of every read or write operation.
Specifying
truemeans that the metadata will be revalidated prior to every read or write operation. With the default value of"open", any cached metadata is revalidated when the TensorStore is opened but is not rechecked for each read or write operation.
-
fill_missing_data_reads =
true¶ Replace missing chunks with the fill value when reading.
If disabled, reading a missing chunk will result in an error. Note that the fill value may still be used when writing a partial chunk. Typically this should only be set to
falsein the case thatstore_data_equal_to_fill_valuewas enabled when writing.
-
store_data_equal_to_fill_value =
false¶ Store all explicitly written data, even if it is equal to the fill value.
This ensures that explicitly written data, even if it is equal to the fill value, can be distinguished from missing data. If disabled, chunks equal to the fill value may be represented as missing chunks.
-
field : string |
null=null¶ Name of field to open.
Must be specified if the
metadata.dtypespecified in the array metadata has more than one field.
- metadata : object¶
Zarr array metadata.
Specifies constraints on the metadata, exactly as in the .zarray metadata file, except that all members are optional. When creating a new array, the new metadata is obtained by combining these metadata constraints with any
Schemaconstraints.- Optional members:¶
-
zarr_format :
2¶
-
shape : array of integer[
0, +∞)¶ Chunked dimensions of the array.
Required when creating a new array if the
Schema.domainis not otherwise specified.Example
[500, 500, 500]
-
chunks : array of integer[
1, +∞)¶ Chunk dimensions.
Specifies the chunk size for each chunked dimension. Must have the same length as
shape. If not specified when creating a new array, the chunk dimensions are chosen automatically according to theSchema.chunk_layout.Example
[64, 64, 64]
- dtype¶
Specifies the scalar or structured data type.
Refer to the Zarr data type encoding specification. As an extension, TensorStore also supports
"bfloat16"for specifying the bfloat16 data type with little endian byte order. TensorStore also supports experimental 8-bit floating-point types including"float8_e3m4","float8_e4m3fn","float8_e4m3fnuz","float8_e4m3b11fnuz","float8_e5m2","float8_e5m2fnuz"described in ml_dtypes with little endian byte order. 2/4-bit integer type padded to 1 byte is supported as"int2"and"int4"by specifying int2/int4.
- fill_value¶
Specifies the fill value.
When creating a new array, defaults to
null.
-
order :
"C"|"F"¶ Specifies the data layout for encoded chunks.
"C"for C order,"F"for Fortran order. When creating a new array, defaults to"C".
-
compressor : driver/zarr2/Compressor |
null¶ Specifies the chunk compression method.
Specifies the chunk compressor. Specifying
nulldisables compression. When creating a new array, if not specified, the default compressor of{"id": "blosc"}is used.
-
filters :
null¶ Specifies the filters to apply to chunks.
When encoding chunk, filters are applied before the compressor. Currently, filters are not supported.
-
dimension_separator :
"."|"/"¶ Specifies the encoding of chunk indices into key-value store keys.
The default value of
"."corresponds to the default encoding used by Zarr, while"/"corresponds to the encoding used byzarr.storage.NestedDirectoryStore.
-
zarr_format :
-
metadata_key : string =
".zarray"¶ Specifies the key under which to store the array metadata in JSON format.
By default, the array metadata is stored under the
.zarraykey as required by the Zarr storage specification. In rare cases it may be useful to specify a non-default value, e.g."zarray"to avoid problems caused by the leading dot. However, be aware that specifying a non-default value breaks compatibility with other zarr implementations.
-
key_encoding :
"."|"/"="."¶ Specifies the encoding of chunk indices into key-value store keys.
Deprecated. Equivalent to specifying
metadata.dimension_separator.
-
rank : integer[
Example
{ "driver": "zarr2", "kvstore": {"driver": "gcs", "bucket": "my-bucket", "path": "path/to/array/"}, "key_encoding": ".", "metadata": { "shape": [1000, 1000], "chunks": [100, 100], "dtype": "<i2", "order": "C", "compressor": {"id": "blosc", "shuffle": -1, "clevel": 5, "cname": "lz4"} } }
- json TensorStoreUrl/zarr2 : string¶
zarr2:TensorStore URL schemezarr v2 arrays may be specified using the
zarr2:pathURL syntax.The
path, if any, specified within thezarr2:pathURL component is simply joined with the path specified by the base KvStore URL, but is intended to be used only for specifying the path to an array within a zarr v2 hierarchy.Examples
URL representation
JSON representation
"file:///tmp/dataset.zarr/|zarr2:"{"driver": "zarr2", "kvstore": {"driver": "file", "path": "/tmp/dataset.zarr/"} }"file:///tmp/dataset.zarr|zarr2:path/within/hierarchy"{"driver": "zarr2", "kvstore": {"driver": "file", "path": "/tmp/dataset.zarr/path/within/hierarchy/"} }- Extends:¶
TensorStoreUrl— URL representation of a TensorStore to open.
Compressors¶
Chunk data is encoded according to the
driver/zarr2.metadata.compressor specified in the metadata.
- json driver/zarr2/Compressor : object¶
Compressor
The
idmember identifies the compressor. The remaining members are specific to the compressor.- Subtypes:¶
The following compressors are supported:
- json driver/zarr2/Compressor/zlib : object¶
Specifies zlib compression with a zlib or gzip header.
- Extends:¶
driver/zarr2/Compressor— Compressor
- Optional members:¶
-
level : integer[
0,9] =1¶ Specifies the zlib compression level to use.
Level 0 indicates no compression (fastest), while level 9 indicates the best compression ratio (slowest).
-
level : integer[
Example
{"id": "gzip", "level": 9}
- json driver/zarr2/Compressor/blosc : object¶
Specifies Blosc compression.
- Extends:¶
driver/zarr2/Compressor— Compressor
- Optional members:¶
-
cname :
"blosclz"|"lz4"|"lz4hc"|"snappy"|"zlib"|"zstd"="lz4"¶ Specifies the compression method used by Blosc.
-
clevel : integer[
0,9] =5¶ Specifies the Blosc compression level to use.
Higher values are slower but achieve a higher compression ratio.
-
shuffle :
-1|0|1|2=-1¶ - One of:¶
-
-1 Automatic shuffle.
Bit-wise shuffle if the element size is 1 byte, otherwise byte-wise shuffle.
-
0 No shuffle
-
1 Byte-wise shuffle
-
2 Bit-wise shuffle
-
-
blocksize : integer[
0, +∞)¶ Specifies the Blosc blocksize.
The default value of 0 causes the block size to be chosen automatically.
-
cname :
Example
{"id": "blosc", "cname": "blosclz", "clevel": 9, "shuffle": 2}
- json driver/zarr2/Compressor/bz2 : object¶
Specifies bzip2 compression.
- Extends:¶
driver/zarr2/Compressor— Compressor
- json driver/zarr2/Compressor/zstd : object¶
Specifies Zstd compression.
- Extends:¶
driver/zarr2/Compressor— Compressor
- Optional members:¶
-
level : integer[
-131072,22] =1¶ Specifies the compression level to use.
A higher compression level provides improved density but reduced compression speed.
-
level : integer[
Example
{"id": "zstd", "level": 6}
Mapping to TensorStore Schema¶
Example with scalar data type
For the following zarr metadata:
{
"zarr_format": 2,
"shape": [1000, 2000, 3000],
"chunks": [100, 200, 300],
"dtype": "<u2",
"compressor": null,
"fill_value": 42,
"order": "C",
"filters": null
}
the corresponding Schema is:
{
"chunk_layout": {
"grid_origin": [0, 0, 0],
"inner_order": [0, 1, 2],
"read_chunk": {"shape": [100, 200, 300]},
"write_chunk": {"shape": [100, 200, 300]}
},
"codec": {"compressor": null, "driver": "zarr", "filters": null},
"domain": {"exclusive_max": [[1000], [2000], [3000]], "inclusive_min": [0, 0, 0]},
"dtype": "uint16",
"fill_value": 42,
"rank": 3
}
Example with structured data type
For the following zarr metadata:
{
"zarr_format": 2,
"shape": [1000, 2000, 3000],
"chunks": [100, 200, 300],
"dtype": [["x", "<u2", [2, 3]], ["y", "<f4", [5]]],
"compressor": {"id": "blosc", "cname": "lz4", "clevel": 5, "shuffle": 1},
"fill_value": "AQACAAMABAAFAAYAAAAgQQAAMEEAAEBBAABQQQAAYEE=",
"order": "F",
"filters": null
}
the corresponding Schema for the "x"
field is:
{
"chunk_layout": {
"grid_origin": [0, 0, 0, 0, 0],
"inner_order": [2, 1, 0, 3, 4],
"read_chunk": {"shape": [100, 200, 300, 2, 3]},
"write_chunk": {"shape": [100, 200, 300, 2, 3]}
},
"codec": {
"compressor": {"blocksize": 0, "clevel": 5, "cname": "lz4", "id": "blosc", "shuffle": 1},
"driver": "zarr",
"filters": null
},
"domain": {
"exclusive_max": [[1000], [2000], [3000], 2, 3],
"inclusive_min": [0, 0, 0, 0, 0]
},
"dtype": "uint16",
"fill_value": [[1, 2, 3], [4, 5, 6]],
"rank": 5
}
and the corresponding Schema for the "y"
field is:
{
"chunk_layout": {
"grid_origin": [0, 0, 0, 0],
"inner_order": [2, 1, 0, 3],
"read_chunk": {"shape": [100, 200, 300, 5]},
"write_chunk": {"shape": [100, 200, 300, 5]}
},
"codec": {
"compressor": {"blocksize": 0, "clevel": 5, "cname": "lz4", "id": "blosc", "shuffle": 1},
"driver": "zarr",
"filters": null
},
"domain": {"exclusive_max": [[1000], [2000], [3000], 5], "inclusive_min": [0, 0, 0, 0]},
"dtype": "float32",
"fill_value": [10.0, 11.0, 12.0, 13.0, 14.0],
"rank": 4
}
Data type¶
Zarr scalar data types map to TensorStore data types as follows:
TensorStore data type |
Zarr data type |
|
|---|---|---|
Little endian |
Big endian |
|
|
||
|
||
|
||
|
||
|
||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||
|
||
|
||
|
||
|
||
|
||
|
|
|
|
||
|
|
|
|
|
|
|
|
|
|
|
|
Zarr structured data types are supported, but are represented in TensorStore as scalar arrays with additional dimensions.
When creating a new array, if a driver/zarr2.metadata.dtype is not
specified explicitly, a scalar Zarr data type with the native endianness is
chosen based on the Schema.dtype. To create an array with
non-native endianness or a structured data type, the
zarr driver/zarr2.metadata.dtype must be specified explicitly.
Note
TensorStore supports the non-standard bfloat16 data type as
an extension. On little endian platforms, the official Zarr Python library is capable of reading
arrays created with the bfloat16 data type provided that a bfloat16 numpy
data type has been registered. The TensorStore Python library registers such
a data type, as does TensorFlow and JAX.
Warning
zarr datetime/timedelta data types are not currently supported.
Domain¶
The Schema.domain includes both the chunked dimensions as well as
any subarray dimensions in the case of a structured data type.
Example with scalar data type
If the driver/zarr2.metadata.dtype is "<u2" and the
driver/zarr2.metadata.shape is [100, 200], then the
Schema.domain is {"shape": [[100], [200]]}.
Example with structured data type
If the driver/zarr2.metadata.dtype is [["x", "<u2", [2,
3]]], and the driver/zarr2.metadata.shape is [100,
200], then the Schema.domain is {"shape": [[100],
[200], 2, 3]}.
As zarr does not natively support a non-zero origin, the underlying domain
always has a zero origin (IndexDomain.inclusive_min is all zero),
but it may be translated by the transform.
The upper bounds of the chunked dimensions are resizable (i.e. implicit), while the upper bounds of any subarray dimensions are not resizable.
The zarr metadata format does not support persisting dimension
labels, but dimension labels may still be specified when
opening using a transform.
Chunk layout¶
The zarr format supports a single driver/zarr2.metadata.chunks
property that corresponds to the ChunkLayout/Grid.shape
constraint. As with the Schema.domain, the
Schema.chunk_layout includes both the chunked dimensions as well
as any subarray dimensions in the case of a structured data type. The
chunk size for subarray dimensions is always the full extent.
Example with scalar data type
If the driver/zarr2.metadata.dtype is "<u2" and
driver/zarr2.metadata.chunks is [100, 200], then the
ChunkLayout/Grid.shape is [100, 200].
Example with structured data type
If the driver/zarr2.metadata.dtype is [["x", "<u2", [2,
3]]], and driver/zarr2.metadata.chunks is [100, 200],
then the ChunkLayout/Grid.shape is [100, 200, 2, 3].
As the zarr format supports only a single level of chunking, the
ChunkLayout.read_chunk and ChunkLayout.write_chunk
constraints are combined, and hard constraints on
ChunkLayout.codec_chunk must not be specified.
The ChunkLayout.grid_origin is always all-zero.
The ChunkLayout.inner_order corresponds to
driver/zarr2.metadata.order, but also includes the subarray
dimensions, which are always the inner-most dimensions.
Example with scalar data type and C order
If the driver/zarr2.metadata.dtype is "<u2",
driver/zarr2.metadata.order is "C", and there are 3
chunked dimensions, then the ChunkLayout.inner_order is
[0, 1, 2].
Example with scalar data type and Fortran order
If the driver/zarr2.metadata.dtype is "<u2",
driver/zarr2.metadata.order is "F", and there are 3
chunked dimensions, then the ChunkLayout.inner_order is
[2, 1, 0].
Example with structured data type and C order
If the driver/zarr2.metadata.dtype is [["x", "<u2", [2,
3]]], driver/zarr2.metadata.order is "C", and there
are 3 chunked dimensions, then the ChunkLayout.inner_order is
[0, 1, 2, 3, 4].
Example with structured data type and Fortran order
If the driver/zarr2.metadata.dtype is [["x", "<u2", [2,
3]]], driver/zarr2.metadata.order is "F", and there
are 3 chunked dimensions, then the ChunkLayout.inner_order is
[2, 1, 0, 3, 4].
Selection of chunk layout when creating a new array¶
When creating a new array, the chunk sizes may be specified explicitly via
ChunkLayout/Grid.shape or implicitly via
ChunkLayout/Grid.aspect_ratio and
ChunkLayout/Grid.elements. In the latter case, a suitable chunk
shape is chosen automatically. If ChunkLayout/Grid.elements is
not specified, the default is 1 million elements per chunk:
Example of unconstrained chunk layout
>>> ts.open({
... 'driver': 'zarr',
... 'kvstore': {
... 'driver': 'memory'
... }
... },
... create=True,
... dtype=ts.uint16,
... shape=[1000, 2000, 3000]).result().chunk_layout
ChunkLayout({
'grid_origin': [0, 0, 0],
'inner_order': [0, 1, 2],
'read_chunk': {'shape': [101, 101, 101]},
'write_chunk': {'shape': [101, 101, 101]},
})
Example of explicit chunk shape constraint
>>> ts.open({
... 'driver': 'zarr',
... 'kvstore': {
... 'driver': 'memory'
... }
... },
... create=True,
... dtype=ts.uint16,
... shape=[1000, 2000, 3000],
... chunk_layout=ts.ChunkLayout(
... chunk_shape=[100, 200, 300])).result().chunk_layout
ChunkLayout({
'grid_origin': [0, 0, 0],
'inner_order': [0, 1, 2],
'read_chunk': {'shape': [100, 200, 300]},
'write_chunk': {'shape': [100, 200, 300]},
})
Example of chunk aspect ratio constraint
>>> ts.open({
... 'driver': 'zarr',
... 'kvstore': {
... 'driver': 'memory'
... }
... },
... create=True,
... dtype=ts.uint16,
... shape=[1000, 2000, 3000],
... chunk_layout=ts.ChunkLayout(
... chunk_aspect_ratio=[1, 2, 2])).result().chunk_layout
ChunkLayout({
'grid_origin': [0, 0, 0],
'inner_order': [0, 1, 2],
'read_chunk': {'shape': [64, 128, 128]},
'write_chunk': {'shape': [64, 128, 128]},
})
Example of chunk aspect ratio and elements constraint
>>> ts.open({
... 'driver': 'zarr',
... 'kvstore': {
... 'driver': 'memory'
... }
... },
... create=True,
... dtype=ts.uint16,
... shape=[1000, 2000, 3000],
... chunk_layout=ts.ChunkLayout(
... chunk_aspect_ratio=[1, 2, 2],
... chunk_elements=2000000)).result().chunk_layout
ChunkLayout({
'grid_origin': [0, 0, 0],
'inner_order': [0, 1, 2],
'read_chunk': {'shape': [79, 159, 159]},
'write_chunk': {'shape': [79, 159, 159]},
})
Codec¶
Within the Schema.codec, the compression parameters are
represented in the same way as in the metadata:
- json driver/zarr2/Codec : object¶
-
- Optional members:¶
-
compressor : driver/zarr2/Compressor |
null¶ Specifies the chunk compression method.
Specifies the chunk compressor. Specifying
nulldisables compression. When creating a new array, if not specified, the default compressor of{"id": "blosc"}is used.
-
filters :
null¶ Specifies the filters to apply to chunks.
When encoding chunk, filters are applied before the compressor. Currently, filters are not supported.
-
compressor : driver/zarr2/Compressor |
It is an error to specify any other Codec.driver.
Fill value¶
For scalar zarr data types, the Schema.fill_value must be a
scalar (rank 0). For structured data types, the
Schema.fill_value must be broadcastable to the subarray shape.
As an optimization, chunks that are entirely equal to the fill value are not stored.
The zarr format allows the fill value to be unspecified, indicated by a
driver/zarr2.metadata.fill_value of null. In that case,
TensorStore always uses a fill value of 0. However, in this case
explicitly-written all-zero chunks are still stored.
Auto detection¶
This driver supports auto-detection based on the
presence of the .zarray file.
Limitations¶
Filters are not supported.