ocdbt
Key-Value Store driver¶
The ocdbt
driver implements an Optionally-Cooperative Distributed B+Tree
(OCDBT) on top of a base key-value store.
- json kvstore/ocdbt : object¶
Read/write adapter for the OCDBT format.
JSON specification of the key-value store.
- Optional members:¶
- path : string¶
Key prefix within the key-value store.
If the prefix is intended to correspond to a Unix-style directory path, it should end with
"/"
.
- coordinator : ContextResource¶
Specifies or references a previously defined
Context.ocdbt_coordinator
.
- config : object¶
Constrains the database configuration.
When creating a new database, a configuration is chosen subject to any constraints specified, using the default values indicated below for any unspecified options. When opening an existing database, the configuration must match any specified constraints.
- Optional members:¶
- uuid : string¶
Unique 128-bit identifier for the database, specified as 32 hex digits.
If not specified, it is randomly generated when the database is first created.
-
manifest_kind :
"single"
|"numbered"
¶ Manifest format to use.
If not specified, the format is chosen automatically when creating the database based on the capabilities of the underlying kvstore. If the capabilities of the underlying kvstore are such that no format safely supports concurrent writes, then it is an error not to specify a value when creating the database.
-
max_inline_value_bytes : integer[
0
,1048576
] =100
¶ Maximum number of value bytes to store inline in a B+tree leaf node.
Values larger than the specified limit are stored out-of-line in a data file.
-
max_decoded_node_bytes : integer[
0
,4294967295
] =83951616
¶ Maximum size of an (uncompressed) B+tree node.
Nodes are split to ensure they do not exceed this limit.
-
version_tree_arity_log2 : integer[
1
,16
] =4
¶ Base-2 logarithm of the arity of the tree of versions.
The list of database versions is stored as a tree data structure of the specified arity.
-
compression : kvstore/ocdbt/Compression/zstd |
null
={"id": "zstd", "level": 0}
¶ Compression method used to encode the manifest and B+Tree nodes.
- assume_config : boolean¶
Permits data files to be written before the initial manifest.
Normally, even when performing unconditional writes to an OCDBT database, the existing manifest is first read before writing any data files, in order to determine the configuration (as this affects how data files are written).
If there is no existing manifest, an initial manifest containing the configuration but no B+tree versions is written before any data files are written. This ensures that, in the case of multiple concurrent writers attempting to create the OCDBT database with inconsistent configurations, the correct configuration is used to write any data files.
For many applications, the additional one-time latency from reading or writing the initial manifest is negligible, but for applications sensitive to the latency of initially writing and/or creating an OCDBT database,
assume_config
may be specified.In this case, data files may be written before confirming the configuration in the manifest. Instead, an assumed configuration, equal to the configuration that would be used for a new OCDBT database according to the constraints specified in
config
, is used.If this option is specified and the assumed configuration does not match the actual, existing configuration, reading and writing will fail with an error. In the case of writes, the OCDBT database will remain unmodified, but unreferenced garbage data files may be left behind.
-
value_data_prefix : string =
"d/"
¶ Prefix for writing data files containing indirect values.
Values that are not stored inline within B+tree nodes are written to data files under this prefix. This option has no effect when reading, and different (even concurrent) writers to the same OCDBT database can safely use different values of
value_data_prefix
.Shared data files are used if
value_data_prefix
is equal tobtree_node_data_prefix
orversion_tree_node_data_prefix
.
-
btree_node_data_prefix : string =
"d/"
¶ Prefix for writing data files containing B+tree nodes.
-
version_tree_node_data_prefix : string =
"d/"
¶ Prefix for writing data files containing version tree nodes.
-
target_data_file_size : integer =
2147483648
¶ Target size of each ocdbt data file.
OCDBT will flush data files to the base key-value store once they reach the target size. When set to 0, data flles may be an arbitrary size.
-
cache_pool : ContextResource =
"cache_pool"
¶ Specifies or references a previously defined
Context.cache_pool
. It is typically more convenient to specify a defaultcache_pool
in thecontext
.
-
data_copy_concurrency : ContextResource =
"data_copy_concurrency"
¶ Specifies or references a previously defined
Context.data_copy_concurrency
. It is typically more convenient to specify a defaultdata_copy_concurrency
in thecontext
.
- json Context.ocdbt_coordinator : object¶
Enables distributed coordination for OCDBT.
Note
Atomic multi-key transactions are supported when not using
Context.ocdbt_coordinator
.
Concepts¶
An OCDBT database is stored as a collection of entries/files within a prefix/directory of a base key-value store.
It is a versioned key-value store:
Each version is identified by:
a 64-bit generation number, which increases sequentially from 1, and;
a nanosecond-precision commit timestamp that increases monotonically with the generation number.
Versioning is managed automatically: each batch of writes results in the creation of a new version.
Storage format¶
The database is represented by the following entries within a prefix/directory of an underlying key-value store:
manifest.ocdbt
Stores the encoded manifest, which specifies the database configuration and, depending on the manifest kind, optionally the tree of versions.
manifest.xxxxxxxxxxxxxxxx
Stores encoded manifests when using the numbered manifest kind. The
xxxxxxxxxxxxxxxx
portion of the filename specifies the latest generation number referenced from the stored manifest, as a 16-digit (0-padded) lowercase hexadecimal number.d/xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Log-structured data files that store:
Out-of-line raw values too large to store inline in a B+tree node;
Encoded version tree nodes;
Encoded B+Tree nodes.
The
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
portion of the filename is the lowercase hex representation of a 128-bit random identifier.The
d/
prefix is the default used when writing but can be overridden by thekvstore/ocdbt.value_data_prefix
,kvstore/ocdbt.btree_node_data_prefix
, andkvstore/ocdbt.version_tree_node_data_prefix
options. When reading, the format allows the data files to have any arbitrary relative path and these options have no effect.
To read a key from the database, a client first reads the manifest file, then traverses the version tree to locate the root B+tree node of the desired version, then traverses the B+tree to locate the leaf node entry for the desired key. If the value is small and is stored inline in the leaf B+tree node, it is immediately available from the leaf node. Otherwise, the leaf node contains a pointer to the value and it must be read separately from a data file.
Manifest kinds¶
Several different ways of storing the manifest are supported, in order to support atomic updates despite the various limitations of underlying key-value stores.
Single file¶
The single file method simply stores the manifest as a single key,
manifest.ocdbt
, in the underlying key-value store, that stores both the
database configuration and the version tree. This manifest file is replaced on
each commit to the database.
This is the most efficient method, but is only safe for concurrent writes if the underlying key-value store supports atomic writes to a single key.
Supported base key-value stores include: - file - gcs
Numbered file¶
The numbered file method stores the database configuration in the
manifest.ocdbt
file, while the version tree is stored in
manifest.xxxxxxxxxxxxxxxx
files that are written for each commit.
Only a small number of manifests are retained at any given time; older manifests are deleted automatically.
This method is safe for concurrent writes if the underlying key-value store supports atomic writes to a single key, conditioned on the key not already being present.
Manifest format¶
An encoded manifest consists of:
Body compressed according to the specified compression_format:
Manifest version tree, present only if manifest_kind is 0 (single).
Manifest header¶
Field |
Binary format |
---|---|
|
|
|
|
|
|
|
magic_value
Must equal
0x0cdb3a2a
length
Length in bytes of entire manifest, including this header.
version
Must equal
0
.
crc32c_checksum
CRC32C checksum of entire manifest, Length in bytes of entire manifest, including this header.
compression_format
0
for uncompressed,1
for zstd.
Manifest configuration¶
Field |
Binary format |
---|---|
|
|
|
|
|
|
|
|
|
|
|
|
uuid
Unique 128-bit identifier for the database. If not specified explicitly, is randomly generated when the database is first created.
manifest_kind
Specifies the kind of manifest that is present. Valid values are:
0
(single
)Both the configuration and version tree are present in the manifest. When using the single file manifest kind, this is set in the
manifest.ocdbt
file. When using numbered file manifest kind, this is set in themanifest.xxxxxxxxxxxxxxxx
files.
1
(numbered
)Indicates the numbered file manifest kind. This manifest stores only the configuration. The version tree must be retrieved from the numbered
manifest.xxxxxxxxxxxxxxxx
files.
max_inline_value_bytes
Maximum size of a value to store inline within a B+Tree node.
max_decoded_note_bytes
Maximum (uncompressed) size of a B+Tree node.
version_tree_arity_log2
Base-2 logarithm of the arity of the version tree.
compression_method
0
for compressed,1
for Zstandard.
Compression configuration¶
If the compression_method is not 0
, it is followed by
the method-specific configuration.
Zstd compression configuration¶
Field |
Binary format |
---|---|
|
level
Compresion level to use when writing.
Manifest version tree¶
Following the compression configuration, the manifest specifies references to B+tree roots and version tree nodes.
Field |
Binary format |
---|---|
data_file_table
Table specifying the data files referenced by inline_versions and version_nodes.
inline_versions
References to the most recent versions.
Older versions are referenced indirectly via version_nodes.
version_nodes
References to version tree interior nodes for versions older than those referenced from inline_versions.
Manifest footer¶
Field |
Binary format |
---|---|
|
crc32c_checksum
CRC-32C checksum of the entire manifest, excluding the checksum itself.
Data file table format¶
Logically, the data file table is a list of (base_path[i], relative_path[i])
pairs of byte strings. The full data file path, relative to the root of the
OCDBT database, is full_path[i] = transitive_path + base_path[i] + relative_path[i]
, where
transitive_path
is the transitive file path specified by the parent node:
For the data file table specified in the manifest, the transitive path is the empty string.
If a node is accessed using a data file path of
transitive_path + base_path[i] + relative_path[i]
, the transitive path that applies to any child nodes is equal totransitive_path + base_path[i]
; that is, thebase_path[i]
is the additional transitive portion of the path.
The maximum length of full_path[i]
is 65535 bytes.
Prefix compression is used to encode the combined path[i] = base_path[i] +
relative_path[i]
.
Field |
Binary format |
Count |
---|---|---|
|
1 |
|
|
num_files - 1 |
|
|
||
|
||
|
num_files
Number of data files specified in the table.
path_prefix_length[i]
Length in bytes of common prefix of
path[i]
andpath[i+1]
. For the first path, no common prefix is stored, and implicitlypath_prefix_length[-1]
is defined to be0
.
path_suffix_length[i]
Length in bytes of
path_suffix[i]
. This is equal tolength(path[i]) - path_prefix_length[i-1]
.
base_path_length[i]
Length in bytes of
base_path[i]
. To simplify decoding, it is required that ifpath_prefix_length[i-1] > min(base_path_length[i], base_path_length[i-1])
, thenbase_path[i] = base_path[i-1]
. That is, the common prefix must not extend past the end of the current or previous base path unless the base path is equal to the previous base path.
path_suffix[i]
Path suffix value. This is equal to
path[i]
with the firstpath_prefix_length[i-1]
bytes excluded. Fori = 0
,path_suffix[i] = path[i]
.
Version tree node format¶
An encoded version tree node consists of:
Body compressed according to the specified compression_format:
Leaf node entries or Interior node entries, depending on the height.
Version tree node outer header¶
Field |
Binary format |
---|---|
|
|
|
|
|
|
|
magic_value
Must equal
0x0cdb1234
.
length
Length in bytes of entire version tree node, including this header.
version
Must equal
0
.
compression_format
0
for uncompressed,1
for zstd.
The remaining data is encoded according to the specified compression_format.
Version tree node inner header¶
Field |
Binary format |
---|---|
|
|
|
|
version_tree_arity_log2
Base-2 logarithm of the version tree node arity. Must match the arity specified in the manifest from which this node was reached.
height
Height of this version tree node. Leaf nodes have a height of 0.
It is required that
(height + 1) * version_tree_arity_log2 < 64
data_file_table
Table specifying the data files referenced by the node entries.
The format of the remaining data depends on the value of height
.
Version tree leaf node entries format (height = 0
)¶
The same encoded representation is used for both the entries of a leaf version tree node and for the inline_versions specified in the manfiest.
Field |
Binary format |
Count |
---|---|---|
|
1 |
|
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
num_versions
Number of B+tree roots that are referenced. The value is constrained based on the value of
generation_number[num_versions-1]
, the latest generation number referenced from the version tree node, and version_tree_arity_log2:1 <= num_versions <= (generation_number[num_versions-1] - 1) % (1 << version_tree_arity_log2) + 1
Note
The same computation of
num_versions
applies to both leaf node entries included in a version tree node, and inline_versions included in the manfiest. In the former case, the and version_tree_arity_log2 value is obtained from the version node. In the latter case, the version_tree_arity_log2 value is taken from the manifest.
generation_number[i]
Generation number of the referenced B+tree root. Must not be 0. The generation numbers must be strictly increasing, i.e. if
i < j
, thengeneration_number[i] < generation_number[j]
.
root_height[i]
Height of the referenced B+tree root. Must be 0 if there is no root node.
commit_time[i]
Time at which the generation was created, in nanoseconds since the Unix epoch (excluding leap seconds).
data_file_id[i]
Specifies the data file containing the encoded root B+tree node, as an index into the data file table.
data_file_offset[i]
Specifies the starting byte offset within data_file_id[i] of the encoded root B+tree node.
data_file_length[i]
Specifies the byte length within data_file_id[i] of the encoded root B+tree node.
num_keys[i]
Specifies the total number of (leaf-node) keys within the B+tree. Note that if there is more than one path from the root to a given leaf node, the leaf node’s keys are counted more than once.
num_tree_bytes[i]
Specifies the total encoded size in bytes of all B+tree nodes reachable from the root, including the root itself. A given node is counted once for each unique path within the tree to it; if there is more than one path to a node, its size is counted multiple times.
num_indirect_value_bytes[i]
Specifies the total size in bytes of all indirectly-stored values in the B+tree. If the same stored value is referenced from multiple keys, its size is counted multiple times.
Interior version tree node entries (height > 0
)¶
The same encoded representation is used for both the entries of an interior version tree node and for the version_nodes version tree nodes specified in the manfiest, but the interpretation differs, as described below.
Field |
Binary format |
Count |
---|---|---|
|
1 |
|
|
||
|
||
|
||
|
||
|
||
|
When the encoded representation is used to specify version_nodes in the manifest, there is one additional field:
Field |
Binary format |
Count |
---|---|---|
|
num_children
Number of version tree nodes that are referenced.
When this encoded representation is used to specify the entries of an interior version tree node,
num_children
is constrained by:1 <= num_children <= ((generation_number[num_children-1] >> (version_tree_arity_log2 * height)) - 1) % (1 << version_tree_arity_log2) + 1
version_tree_arity_log2 and height are obtained from the version tree node.
generation_number[i]
Latest B+tree root generation number referenced within this subtree. Must not be 0. The generation numbers must be strictly increasing, i.e. if
i < j
, thengeneration_number[i] < generation_number[j]
.
data_file_id[i]
Specifies the data file containing the encoded version tree node, as an index into the data file table.
data_file_offset[i]
Specifies the starting byte offset within
data_file_id[i]
of the encoded version tree node.
data_file_length[i]
Specifies the byte length within
data_file_id[i]
of the encoded version tree node.
num_generations[i]
Total number of B+tree roots referenced within this subtree.
commit_time[i]
Commit time, in milliseconds since the Unix epoch (excluding leap seconds), of the earlier B+tree root referenced within this subtree.
Note
This is the earliest commit time referenced within this subtree, in contrast with
generation_number[i]
, which specifies the latest generation number referenced within this subtree. Storing the earliest commit time, rather than the latest commit time, enables more efficient queries for the latest generation withcommit_time<=T
.
entry_height[i]
Specifies the height of the referenced version tree node.
This field is only present when the encoded representation is used to specify version_nodes in the manifest. The heights must be decreasing, i.e.
entry_height[i] > entry_height[j]
ifi < j
.When the encoded representaiton is used to specify the entries of an interior version tree node, this field is not present and instead, for the purpose of this specification,
entry_height[i]
is implicitly equal toheight - 1
, where height is obtained from the version tree node.
Version tree node footer¶
Field |
Binary format |
---|---|
|
crc32c_checksum
CRC-32C checksum of the entire version tree node, excluding the checksum itself.
B+tree node format¶
An encoded B+tree node consists of:
Body compressed according to the specified compression_format:
Leaf node entries or Interior node entries, depending on the height.
B+tree node outer header¶
Field |
Binary format |
---|---|
|
|
|
|
|
|
|
magic_value
Must equal
0x0cdb20de
.
length
Length in bytes of entire B+tree node, including this header.
version
Must equal
0
.
compression_format
0
for uncompressed,1
for zstd.
The remaining data is encoded according to the specified compression_format.
B+tree node inner header¶
Field |
Binary format |
---|---|
|
|
|
height
Height of this B+tree node. Leaf nodes have a height of 0.
data_file_table
Table specifying the data files referenced by the node entries.
num_entries
Number of children (if
height > 0
) or key/value pairs (ifheight == 0
) referenced from this node.
The format of the remaining data depends on the value of height
.
Leaf B+tree node format (height = 0
)¶
Note
B+tree leaf nodes do not directly store the full key for each key/value
entry. Instead, only the relative_key
for each entry is stored; the
prefix that must be prepended to this relative_key
to obtain the full
key
is defined by the path from the root node to the leaf node.
Field |
Binary format |
Count |
|
num_entries - 1 |
|
|
||
|
||
|
||
|
||
|
||
|
||
|
key_prefix_length[i]
Length in bytes of common prefix of
relative_key[i]
andrelative_key[i+1]
. For the first key, no common prefix is stored.
key_suffix_length[i]
Length in bytes of each relative key, excluding the length of the common prefix with the previous key. For the first key, the total length is stored, since there is no previous key.
key_suffix[i]
Relative key for the entry, excluding the common prefix with the previous key. For the first key, the entire key is stored. Note that the suffixes for all keys are concatenated in the encoded representation.
value_length[i]
Length in bytes of the value for this key/value entry.
value_kind[i]
Indicates how the value is stored:
0
if the value is stored inline in this leaf node,1
if the value is stored out-of-line.
Based on this column, the following derived values are defined:
num_direct_entries
The number of entries for which
value_kind[i] == 0
.direct_entries
The array of indices of direct values.
direct_value_length[j]
Equal to
value_length[direct_values[j]]
.
num_indirect_entries
Equal to
num_entries - num_direct_entries
.indirect_entries
The array of indices of indirect values.
data_file_id[k]
Specifies the data file containing the value for entry
indirect_values[k]
, as an index into the data file table. Only stored for entries with out-of-line values.
data_file_offset[k]
Specifies the starting byte offset within
data_file_id[k]
of the value for entryindirect_values[k]
. Only stored for entries with out-of-line values.
value[j]
Specifies the value for entry
direct_values[j]
. Only stored for entries with inline values. Note that all direct values are concatenated in the encoded representation.
Interior B+tree node format (height > 0
)¶
Note
B+tree interior nodes do not directly store the full starting key for each
child node. Instead, only the relative_key
for each child node is
stored; the prefix that must be prepended to this relative_key
to obtain
the full key
is defined by the path from the root node.
Field |
Binary format |
Count |
|
num_entries - 1 |
|
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
key_prefix_length[i]
Length in bytes of common prefix of
relative_key[i]
andrelative_key[i+1]
. For the first key, no common prefix is stored.
key_suffix_length[i]
Length in bytes of each relative key, excluding the length of the common prefix with the previous key. For the first key, the total length is stored, since there is no previous key.
subtree_common_prefix_length[i]
Length in bytes of the prefix of
relative_key[i]
that is common to all keys within the subtree rooted at this child node. This prefix serves as an implicit prefix of all keys within the subtree rooted at the child.
key_suffix[i]
Relative key for the entry, excluding the common prefix with the previous key. For the first key, the entire key is stored. Note that the suffixes for all keys are concatenated in the encoded representation.
data_file_id[i]
Specifies the data file containing the encoded child B+tree node, as an index into the data file table.
data_file_offset[i]
Specifies the starting byte offset within
data_file_id[i]
of the encoded child B+tree node.
data_file_length[i]
Specifies the byte length within
data_file_id[i]
of the encoded child B+tree node.
num_keys[i]
Specifies the total number of (leaf-node) keys within the subtree rooted at the child node. Note that if there is more than one path from the child node to a given leaf node, the leaf node’s keys are counted more than once.
num_tree_bytes[i]
Specifies the total encoded size in bytes of all B+tree nodes reachable from the child, including the child itself. A given node is counted once for each unique path within the tree to it; if there is more than one path to a node, its size is counted multiple times.
num_indirect_value_bytes[i]
Specifies the total size in bytes of all indirectly-stored values in the subtree rooted at the child node. If the same stored value is referenced from multiple keys, its size is counted multiple times.
B+tree node footer¶
Field |
Binary format |
---|---|
|
crc32c_checksum
CRC-32C checksum of the entire B+tree node, excluding the checksum itself.