zarr
Driver¶
Zarr v2 is a chunked array storage format based on the NumPy data type model.
The zarr
driver provides access to Zarr-v2-format arrays backed by any
supported Key-Value Storage Layer. It supports reading, writing, creating new
arrays, and resizing arrays.
Zarr supports arrays with structured data types specifying multiple named fields that are packed together. TensorStore fully supports such arrays, but each field must be opened separately.
- json driver/zarr : object¶
- Extends:¶
- Required members:¶
-
driver :
"zarr"
¶
- kvstore : KvStore | KvStoreUrl¶
Specifies the underlying storage mechanism.
-
driver :
- Optional members:¶
-
-
rank : integer[
0
,32
]¶ Specifies the rank of the TensorStore.
If
transform
is also specified, the input rank must match. Otherwise, the rank constraint applies to the driver directly.
- transform : IndexTransform¶
Specifies a transform.
- schema : Schema¶
Specifies constraints on the schema.
When opening an existing array, specifies constraints on the existing schema; opening will fail if the constraints do not match. Any soft constraints specified in the
chunk_layout
are ignored. When creating a new array, a suitable schema will be selected automatically based on the specified schema constraints in combination with any driver-specific constraints.
-
path : string =
""
¶ -
This is joined as an additional
"/"
-separated path component after anypath
member directly withinkvstore
. This is supported for backwards compatibility only; theKvStore.path
member should be used instead.Example
"path/to/data"
- open : boolean¶
Open an existing TensorStore. If neither
open
norcreate
is specified, defaults totrue
.
-
create : boolean =
false
¶ Create a new TensorStore. Specify
true
for bothopen
andcreate
to permit either opening an existing TensorStore or creating a new TensorStore if it does not already exist.
-
delete_existing : boolean =
false
¶ Delete any existing data at the specified path before creating a new TensorStore. Requires that
create
istrue
, and thatopen
isfalse
.
-
assume_metadata : boolean =
false
¶ Neither read nor write stored metadata. Instead, just assume any necessary metadata based on constraints in the spec, using the same defaults for any unspecified metadata as when creating a new TensorStore. The stored metadata need not even exist. Operations such as resizing that modify the stored metadata are not supported. Requires that
open
istrue
anddelete_existing
isfalse
. This option takes precedence overassume_cached_metadata
if that option is also specified.Warning
This option can lead to data corruption if the assumed metadata does not match the stored metadata, or multiple concurrent writers use different assumed metadata.
-
assume_cached_metadata : boolean =
false
¶ Skip reading the metadata when opening. Instead, just assume any necessary metadata based on constraints in the spec, using the same defaults for any unspecified metadata as when creating a new TensorStore. The stored metadata may still be accessed by subsequent operations that need to re-validate or modify the metadata. Requires that
open
istrue
anddelete_existing
isfalse
. Theassume_metadata
option takes precedence if also specified.Note
Unlike the
assume_metadata
option, operations such as resizing that modify the stored metadata are supported (and access the stored metadata).Warning
This option can lead to data corruption if the assumed metadata does not match the stored metadata, or multiple concurrent writers use different assumed metadata.
-
cache_pool : ContextResource =
"cache_pool"
¶ Cache pool for data.
Specifies or references a previously defined
Context.cache_pool
. It is normally more convenient to specify a defaultcache_pool
in thecontext
.
- metadata_cache_pool : ContextResource¶
Cache pool for metadata only.
Specifies or references a previously defined
Context.cache_pool
. If not specified, defaults to the value ofcache_pool
.
-
data_copy_concurrency : ContextResource =
"data_copy_concurrency"
¶ Specifies or references a previously defined
Context.data_copy_concurrency
. It is normally more convenient to specify a defaultdata_copy_concurrency
in thecontext
.
-
recheck_cached_metadata : CacheRevalidationBound =
"open"
¶ Time after which cached metadata is assumed to be fresh. Cached metadata older than the specified time is revalidated prior to use. The metadata is used to check the bounds of every read or write operation.
Specifying
true
means that the metadata will be revalidated prior to every read or write operation. With the default value of"open"
, any cached metadata is revalidated when the TensorStore is opened but is not rechecked for each read or write operation.
-
recheck_cached_data : CacheRevalidationBound =
true
¶ Time after which cached data is assumed to be fresh. Cached data older than the specified time is revalidated prior to being returned from a read operation. Partial chunk writes are always consistent regardless of the value of this option.
The default value of
true
means that cached data is revalidated on every read. To enable in-memory data caching, you must both specify acache_pool
with a non-zerototal_bytes_limit
and also specifyfalse
,"open"
, or an explicit time bound forrecheck_cached_data
.
-
fill_missing_data_reads =
true
¶ Replace missing chunks with the fill value when reading.
If disabled, reading a missing chunk will result in an error. Note that the fill value may still be used when writing a partial chunk. Typically this should only be set to
false
in the case thatstore_data_equal_to_fill_value
was enabled when writing.
-
store_data_equal_to_fill_value =
false
¶ Store all explicitly written data, even if it is equal to the fill value.
This ensures that explicitly written data, even if it is equal to the fill value, can be distinguished from missing data. If disabled, chunks equal to the fill value may be represented as missing chunks.
-
field : string |
null
=null
¶ Name of field to open.
Must be specified if the
metadata.dtype
specified in the array metadata has more than one field.
- metadata : object¶
Zarr array metadata.
Specifies constraints on the metadata, exactly as in the .zarray metadata file, except that all members are optional. When creating a new array, the new metadata is obtained by combining these metadata constraints with any
Schema
constraints.- Optional members:¶
-
zarr_format :
2
¶
-
shape : array of integer[
0
, +∞)¶ Chunked dimensions of the array.
Required when creating a new array if the
Schema.domain
is not otherwise specified.Example
[500, 500, 500]
-
chunks : array of integer[
1
, +∞)¶ Chunk dimensions.
Specifies the chunk size for each chunked dimension. Must have the same length as
shape
. If not specified when creating a new array, the chunk dimensions are chosen automatically according to theSchema.chunk_layout
.Example
[64, 64, 64]
- dtype¶
Specifies the scalar or structured data type.
Refer to the Zarr data type encoding specification. As an extension, TensorStore also supports
"bfloat16"
for specifying the bfloat16 data type with little endian byte order. TensorStore also supports experimental 8-bit floating-point types including"float8_e4m3fn"
,"float8_e4m3fnuz"
,"float8_e4m3b11fnuz"
,"float8_e5m2"
,"float8_e5m2fnuz"
described in ml_dtypes with little endian byte order. 4-bit integer type padded to 1 byte is supported as"int4"
by specifying int4.
- fill_value¶
Specifies the fill value.
When creating a new array, defaults to
null
.
-
order :
"C"
|"F"
¶ Specifies the data layout for encoded chunks.
"C"
for C order,"F"
for Fortran order. When creating a new array, defaults to"C"
.
-
compressor : driver/zarr/Compressor |
null
¶ Specifies the chunk compression method.
Specifies the chunk compressor. Specifying
null
disables compression. When creating a new array, if not specified, the default compressor of{"id": "blosc"}
is used.
-
filters :
null
¶ Specifies the filters to apply to chunks.
When encoding chunk, filters are applied before the compressor. Currently, filters are not supported.
-
dimension_separator :
"."
|"/"
¶ Specifies the encoding of chunk indices into key-value store keys.
The default value of
"."
corresponds to the default encoding used by Zarr, while"/"
corresponds to the encoding used byzarr.storage.NestedDirectoryStore
.
-
zarr_format :
-
metadata_key : string =
".zarray"
¶ Specifies the key under which to store the array metadata in JSON format.
By default, the array metadata is stored under the
.zarray
key as required by the Zarr storage specification. In rare cases it may be useful to specify a non-default value, e.g."zarray"
to avoid problems caused by the leading dot. However, be aware that specifying a non-default value breaks compatibility with other zarr implementations.
-
key_encoding :
"."
|"/"
="."
¶ Specifies the encoding of chunk indices into key-value store keys.
Deprecated. Equivalent to specifying
metadata.dimension_separator
.
-
rank : integer[
Example
{ "driver": "zarr", "kvstore": {"driver": "gcs", "bucket": "my-bucket", "path": "path/to/array/"}, "key_encoding": ".", "metadata": { "shape": [1000, 1000], "chunks": [100, 100], "dtype": "<i2", "order": "C", "compressor": {"id": "blosc", "shuffle": -1, "clevel": 5, "cname": "lz4"} } }
Compressors¶
Chunk data is encoded according to the
driver/zarr.metadata.compressor
specified in the metadata.
- json driver/zarr/Compressor : object¶
Compressor
The
id
member identifies the compressor. The remaining members are specific to the compressor.- Subtypes:¶
The following compressors are supported:
- json driver/zarr/Compressor/zlib : object¶
Specifies zlib compression with a zlib or gzip header.
- Extends:¶
driver/zarr/Compressor
— Compressor
- Optional members:¶
-
level : integer[
0
,9
] =1
¶ Specifies the zlib compression level to use.
Level 0 indicates no compression (fastest), while level 9 indicates the best compression ratio (slowest).
-
level : integer[
Example
{"id": "gzip", "level": 9}
- json driver/zarr/Compressor/blosc : object¶
Specifies Blosc compression.
- Extends:¶
driver/zarr/Compressor
— Compressor
- Optional members:¶
-
cname :
"blosclz"
|"lz4"
|"lz4hc"
|"snappy"
|"zlib"
|"zstd"
="lz4"
¶ Specifies the compression method used by Blosc.
-
clevel : integer[
0
,9
] =5
¶ Specifies the Blosc compression level to use.
Higher values are slower but achieve a higher compression ratio.
-
shuffle :
-1
|0
|1
|2
=-1
¶ - One of:¶
-
-1
Automatic shuffle.
Bit-wise shuffle if the element size is 1 byte, otherwise byte-wise shuffle.
-
0
No shuffle
-
1
Byte-wise shuffle
-
2
Bit-wise shuffle
-
-
blocksize : integer[
0
, +∞)¶ Specifies the Blosc blocksize.
The default value of 0 causes the block size to be chosen automatically.
-
cname :
Example
{"id": "blosc", "cname": "blosclz", "clevel": 9, "shuffle": 2}
- json driver/zarr/Compressor/bz2 : object¶
Specifies bzip2 compression.
- Extends:¶
driver/zarr/Compressor
— Compressor
- json driver/zarr/Compressor/zstd : object¶
Specifies Zstd compression.
- Extends:¶
driver/zarr/Compressor
— Compressor
- Optional members:¶
-
level : integer[
-131072
,22
] =1
¶ Specifies the compression level to use.
A higher compression level provides improved density but reduced compression speed.
-
level : integer[
Example
{"id": "zstd", "level": 6}
Mapping to TensorStore Schema¶
Example with scalar data type
For the following zarr metadata
:
{
"zarr_format": 2,
"shape": [1000, 2000, 3000],
"chunks": [100, 200, 300],
"dtype": "<u2",
"compressor": null,
"fill_value": 42,
"order": "C",
"filters": null
}
the corresponding Schema
is:
{
"chunk_layout": {
"grid_origin": [0, 0, 0],
"inner_order": [0, 1, 2],
"read_chunk": {"shape": [100, 200, 300]},
"write_chunk": {"shape": [100, 200, 300]}
},
"codec": {"compressor": null, "driver": "zarr", "filters": null},
"domain": {"exclusive_max": [[1000], [2000], [3000]], "inclusive_min": [0, 0, 0]},
"dtype": "uint16",
"fill_value": 42,
"rank": 3
}
Example with structured data type
For the following zarr metadata
:
{
"zarr_format": 2,
"shape": [1000, 2000, 3000],
"chunks": [100, 200, 300],
"dtype": [["x", "<u2", [2, 3]], ["y", "<f4", [5]]],
"compressor": {"id": "blosc", "cname": "lz4", "clevel": 5, "shuffle": 1},
"fill_value": "AQACAAMABAAFAAYAAAAgQQAAMEEAAEBBAABQQQAAYEE=",
"order": "F",
"filters": null
}
the corresponding Schema
for the "x"
field
is:
{
"chunk_layout": {
"grid_origin": [0, 0, 0, 0, 0],
"inner_order": [2, 1, 0, 3, 4],
"read_chunk": {"shape": [100, 200, 300, 2, 3]},
"write_chunk": {"shape": [100, 200, 300, 2, 3]}
},
"codec": {
"compressor": {"blocksize": 0, "clevel": 5, "cname": "lz4", "id": "blosc", "shuffle": 1},
"driver": "zarr",
"filters": null
},
"domain": {
"exclusive_max": [[1000], [2000], [3000], 2, 3],
"inclusive_min": [0, 0, 0, 0, 0]
},
"dtype": "uint16",
"fill_value": [[1, 2, 3], [4, 5, 6]],
"rank": 5
}
and the corresponding Schema
for the "y"
field
is:
{
"chunk_layout": {
"grid_origin": [0, 0, 0, 0],
"inner_order": [2, 1, 0, 3],
"read_chunk": {"shape": [100, 200, 300, 5]},
"write_chunk": {"shape": [100, 200, 300, 5]}
},
"codec": {
"compressor": {"blocksize": 0, "clevel": 5, "cname": "lz4", "id": "blosc", "shuffle": 1},
"driver": "zarr",
"filters": null
},
"domain": {"exclusive_max": [[1000], [2000], [3000], 5], "inclusive_min": [0, 0, 0, 0]},
"dtype": "float32",
"fill_value": [10.0, 11.0, 12.0, 13.0, 14.0],
"rank": 4
}
Data type¶
Zarr scalar data types map to TensorStore data types as follows:
TensorStore data type |
Zarr data type |
|
---|---|---|
Little endian |
Big endian |
|
|
||
|
||
|
||
|
||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||
|
||
|
||
|
||
|
||
|
|
|
|
||
|
|
|
|
|
|
|
|
|
|
|
Zarr structured data types are supported, but are represented in TensorStore as scalar arrays with additional dimensions.
When creating a new array, if a driver/zarr.metadata.dtype
is not
specified explicitly, a scalar Zarr data type with the native endianness is
chosen based on the Schema.dtype
. To create an array with
non-native endianness or a structured data type, the
zarr driver/zarr.metadata.dtype
must be specified explicitly.
Note
TensorStore supports the non-standard bfloat16 data type as
an extension. On little endian platforms, the official Zarr Python library is capable of reading
arrays created with the bfloat16
data type provided that a bfloat16 numpy
data type has been registered. The TensorStore Python library registers such
a data type, as does TensorFlow and JAX.
Warning
zarr datetime/timedelta data types are not currently supported.
Domain¶
The Schema.domain
includes both the chunked dimensions as well as
any subarray dimensions in the case of a structured data type.
Example with scalar data type
If the driver/zarr.metadata.dtype
is "<u2"
and the
driver/zarr.metadata.shape
is [100, 200]
, then the
Schema.domain
is {"shape": [[100], [200]]}
.
Example with structured data type
If the driver/zarr.metadata.dtype
is [["x", "<u2", [2,
3]]]
, and the driver/zarr.metadata.shape
is [100,
200]
, then the Schema.domain
is {"shape": [[100],
[200], 2, 3]}
.
As zarr does not natively support a non-zero origin, the underlying domain
always has a zero origin (IndexDomain.inclusive_min
is all zero),
but it may be translated by the transform
.
The upper bounds of the chunked dimensions are resizable (i.e. implicit), while the upper bounds of any subarray dimensions are not resizable.
The zarr metadata format does not support persisting dimension
labels, but dimension labels may still be specified when
opening using a transform
.
Chunk layout¶
The zarr format supports a single driver/zarr.metadata.chunks
property that corresponds to the ChunkLayout/Grid.shape
constraint. As with the Schema.domain
, the
Schema.chunk_layout
includes both the chunked dimensions as well
as any subarray dimensions in the case of a structured data type. The
chunk size for subarray dimensions is always the full extent.
Example with scalar data type
If the driver/zarr.metadata.dtype
is "<u2"
and
driver/zarr.metadata.chunks
is [100, 200]
, then the
ChunkLayout/Grid.shape
is [100, 200]
.
Example with structured data type
If the driver/zarr.metadata.dtype
is [["x", "<u2", [2,
3]]]
, and driver/zarr.metadata.chunks
is [100, 200]
,
then the ChunkLayout/Grid.shape
is [100, 200, 2, 3]
.
As the zarr format supports only a single level of chunking, the
ChunkLayout.read_chunk
and ChunkLayout.write_chunk
constraints are combined, and hard constraints on
ChunkLayout.codec_chunk
must not be specified.
The ChunkLayout.grid_origin
is always all-zero.
The ChunkLayout.inner_order
corresponds to
driver/zarr.metadata.order
, but also includes the subarray
dimensions, which are always the inner-most dimensions.
Example with scalar data type and C order
If the driver/zarr.metadata.dtype
is "<u2"
,
driver/zarr.metadata.order
is "C"
, and there are 3
chunked dimensions, then the ChunkLayout.inner_order
is
[0, 1, 2]
.
Example with scalar data type and Fortran order
If the driver/zarr.metadata.dtype
is "<u2"
,
driver/zarr.metadata.order
is "F"
, and there are 3
chunked dimensions, then the ChunkLayout.inner_order
is
[2, 1, 0]
.
Example with structured data type and C order
If the driver/zarr.metadata.dtype
is [["x", "<u2", [2,
3]]]
, driver/zarr.metadata.order
is "C"
, and there
are 3 chunked dimensions, then the ChunkLayout.inner_order
is
[0, 1, 2, 3, 4]
.
Example with structured data type and Fortran order
If the driver/zarr.metadata.dtype
is [["x", "<u2", [2,
3]]]
, driver/zarr.metadata.order
is "F"
, and there
are 3 chunked dimensions, then the ChunkLayout.inner_order
is
[2, 1, 0, 3, 4]
.
Selection of chunk layout when creating a new array¶
When creating a new array, the chunk sizes may be specified explicitly via
ChunkLayout/Grid.shape
or implicitly via
ChunkLayout/Grid.aspect_ratio
and
ChunkLayout/Grid.elements
. In the latter case, a suitable chunk
shape is chosen automatically. If ChunkLayout/Grid.elements
is
not specified, the default is 1 million elements per chunk:
Example of unconstrained chunk layout
>>> ts.open({
... 'driver': 'zarr',
... 'kvstore': {
... 'driver': 'memory'
... }
... },
... create=True,
... dtype=ts.uint16,
... shape=[1000, 2000, 3000]).result().chunk_layout
ChunkLayout({
'grid_origin': [0, 0, 0],
'inner_order': [0, 1, 2],
'read_chunk': {'shape': [101, 101, 101]},
'write_chunk': {'shape': [101, 101, 101]},
})
Example of explicit chunk shape constraint
>>> ts.open({
... 'driver': 'zarr',
... 'kvstore': {
... 'driver': 'memory'
... }
... },
... create=True,
... dtype=ts.uint16,
... shape=[1000, 2000, 3000],
... chunk_layout=ts.ChunkLayout(
... chunk_shape=[100, 200, 300])).result().chunk_layout
ChunkLayout({
'grid_origin': [0, 0, 0],
'inner_order': [0, 1, 2],
'read_chunk': {'shape': [100, 200, 300]},
'write_chunk': {'shape': [100, 200, 300]},
})
Example of chunk aspect ratio constraint
>>> ts.open({
... 'driver': 'zarr',
... 'kvstore': {
... 'driver': 'memory'
... }
... },
... create=True,
... dtype=ts.uint16,
... shape=[1000, 2000, 3000],
... chunk_layout=ts.ChunkLayout(
... chunk_aspect_ratio=[1, 2, 2])).result().chunk_layout
ChunkLayout({
'grid_origin': [0, 0, 0],
'inner_order': [0, 1, 2],
'read_chunk': {'shape': [64, 128, 128]},
'write_chunk': {'shape': [64, 128, 128]},
})
Example of chunk aspect ratio and elements constraint
>>> ts.open({
... 'driver': 'zarr',
... 'kvstore': {
... 'driver': 'memory'
... }
... },
... create=True,
... dtype=ts.uint16,
... shape=[1000, 2000, 3000],
... chunk_layout=ts.ChunkLayout(
... chunk_aspect_ratio=[1, 2, 2],
... chunk_elements=2000000)).result().chunk_layout
ChunkLayout({
'grid_origin': [0, 0, 0],
'inner_order': [0, 1, 2],
'read_chunk': {'shape': [79, 159, 159]},
'write_chunk': {'shape': [79, 159, 159]},
})
Codec¶
Within the Schema.codec
, the compression parameters are
represented in the same way as in the metadata
:
- json driver/zarr/Codec : object¶
-
- Optional members:¶
-
compressor : driver/zarr/Compressor |
null
¶ Specifies the chunk compression method.
Specifies the chunk compressor. Specifying
null
disables compression. When creating a new array, if not specified, the default compressor of{"id": "blosc"}
is used.
-
filters :
null
¶ Specifies the filters to apply to chunks.
When encoding chunk, filters are applied before the compressor. Currently, filters are not supported.
-
compressor : driver/zarr/Compressor |
It is an error to specify any other Codec.driver
.
Fill value¶
For scalar zarr data types, the Schema.fill_value
must be a
scalar (rank 0). For structured data types, the
Schema.fill_value
must be broadcastable to the subarray shape.
As an optimization, chunks that are entirely equal to the fill value are not stored.
The zarr format allows the fill value to be unspecified, indicated by a
driver/zarr.metadata.fill_value
of null
. In that case,
TensorStore always uses a fill value of 0
. However, in this case
explicitly-written all-zero chunks are still stored.
Limitations¶
Filters are not supported.