ocdbt Key-Value Store driver

The ocdbt driver implements an Optionally-Cooperative Distributed B+Tree (OCDBT) on top of a base key-value store.

json kvstore/ocdbt : object

Read/write adapter for the OCDBT format.

JSON specification of the key-value store.

Extends:
  • KvStore — Key-value store specification.

Required members:
driver : "ocdbt"
base : KvStore

Underlying key-value store.

Optional members:
path : string

Key prefix within the key-value store.

If the prefix is intended to correspond to a Unix-style directory path, it should end with "/".

context : Context

Specifies context resources that augment/override the parent context.

coordinator : ContextResource

Specifies or references a previously defined Context.ocdbt_coordinator.

config : object

Constrains the database configuration.

When creating a new database, a configuration is chosen subject to any constraints specified, using the default values indicated below for any unspecified options. When opening an existing database, the configuration must match any specified constraints.

Optional members:
uuid : string

Unique 128-bit identifier for the database, specified as 32 hex digits.

If not specified, it is randomly generated when the database is first created.

manifest_kind : "single" | "numbered"

Manifest format to use.

If not specified, the format is chosen automatically when creating the database based on the capabilities of the underlying kvstore. If the capabilities of the underlying kvstore are such that no format safely supports concurrent writes, then it is an error not to specify a value when creating the database.

One of:
"single"

Single-file manifest format

"numbered"

Numbered-file manifest format

max_inline_value_bytes : integer[0, 1048576] = 100

Maximum number of value bytes to store inline in a B+tree leaf node.

Values larger than the specified limit are stored out-of-line in a data file.

max_decoded_node_bytes : integer[0, 4294967295] = 83951616

Maximum size of an (uncompressed) B+tree node.

Nodes are split to ensure they do not exceed this limit.

version_tree_arity_log2 : integer[1, 16] = 4

Base-2 logarithm of the arity of the tree of versions.

The list of database versions is stored as a tree data structure of the specified arity.

compression : kvstore/ocdbt/Compression/zstd | null = {"id": "zstd", "level": 0}

Compression method used to encode the manifest and B+Tree nodes.

assume_config : boolean

Permits data files to be written before the initial manifest.

Normally, even when performing unconditional writes to an OCDBT database, the existing manifest is first read before writing any data files, in order to determine the configuration (as this affects how data files are written).

If there is no existing manifest, an initial manifest containing the configuration but no B+tree versions is written before any data files are written. This ensures that, in the case of multiple concurrent writers attempting to create the OCDBT database with inconsistent configurations, the correct configuration is used to write any data files.

For many applications, the additional one-time latency from reading or writing the initial manifest is negligible, but for applications sensitive to the latency of initially writing and/or creating an OCDBT database, assume_config may be specified.

In this case, data files may be written before confirming the configuration in the manifest. Instead, an assumed configuration, equal to the configuration that would be used for a new OCDBT database according to the constraints specified in config, is used.

If this option is specified and the assumed configuration does not match the actual, existing configuration, reading and writing will fail with an error. In the case of writes, the OCDBT database will remain unmodified, but unreferenced garbage data files may be left behind.

value_data_prefix : string = "d/"

Prefix for writing data files containing indirect values.

Values that are not stored inline within B+tree nodes are written to data files under this prefix. This option has no effect when reading, and different (even concurrent) writers to the same OCDBT database can safely use different values of value_data_prefix.

Shared data files are used if value_data_prefix is equal to btree_node_data_prefix or version_tree_node_data_prefix.

btree_node_data_prefix : string = "d/"

Prefix for writing data files containing B+tree nodes.

version_tree_node_data_prefix : string = "d/"

Prefix for writing data files containing version tree nodes.

target_data_file_size : integer = 2147483648

Target size of each ocdbt data file.

OCDBT will flush data files to the base key-value store once they reach the target size. When set to 0, data flles may be an arbitrary size.

cache_pool : ContextResource = "cache_pool"

Specifies or references a previously defined Context.cache_pool. It is typically more convenient to specify a default cache_pool in the context.

data_copy_concurrency : ContextResource = "data_copy_concurrency"

Specifies or references a previously defined Context.data_copy_concurrency. It is typically more convenient to specify a default data_copy_concurrency in the context.

json kvstore/ocdbt/Compression/zstd : object

Specifies Zstandard compression.

Required members:
id : "zstd"
Optional members:
level : integer

Compression level.

json Context.ocdbt_coordinator : object

Enables distributed coordination for OCDBT.

Optional members:
address : string

Address of gRPC coordinator server.

Must be specified to use distributed coordination.

lease_duration : string = "10s"

Duration of lease to request from coordinator for B+tree key ranges.

Note

Atomic multi-key transactions are supported when not using Context.ocdbt_coordinator.

Concepts

An OCDBT database is stored as a collection of entries/files within a prefix/directory of a base key-value store.

It is a versioned key-value store:

  • Each version is identified by:

    • a 64-bit generation number, which increases sequentially from 1, and;

    • a nanosecond-precision commit timestamp that increases monotonically with the generation number.

  • Versioning is managed automatically: each batch of writes results in the creation of a new version.

Storage format

The database is represented by the following entries within a prefix/directory of an underlying key-value store:

  • manifest.ocdbt

    Stores the encoded manifest, which specifies the database configuration and, depending on the manifest kind, optionally the tree of versions.

  • manifest.xxxxxxxxxxxxxxxx

    Stores encoded manifests when using the numbered manifest kind. The xxxxxxxxxxxxxxxx portion of the filename specifies the latest generation number referenced from the stored manifest, as a 16-digit (0-padded) lowercase hexadecimal number.

  • d/xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

    Log-structured data files that store:

    • Out-of-line raw values too large to store inline in a B+tree node;

    • Encoded version tree nodes;

    • Encoded B+Tree nodes.

    The xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx portion of the filename is the lowercase hex representation of a 128-bit random identifier.

    The d/ prefix is the default used when writing but can be overridden by the kvstore/ocdbt.value_data_prefix, kvstore/ocdbt.btree_node_data_prefix, and kvstore/ocdbt.version_tree_node_data_prefix options. When reading, the format allows the data files to have any arbitrary relative path and these options have no effect.

To read a key from the database, a client first reads the manifest file, then traverses the version tree to locate the root B+tree node of the desired version, then traverses the B+tree to locate the leaf node entry for the desired key. If the value is small and is stored inline in the leaf B+tree node, it is immediately available from the leaf node. Otherwise, the leaf node contains a pointer to the value and it must be read separately from a data file.

Manifest kinds

Several different ways of storing the manifest are supported, in order to support atomic updates despite the various limitations of underlying key-value stores.

Single file

The single file method simply stores the manifest as a single key, manifest.ocdbt, in the underlying key-value store, that stores both the database configuration and the version tree. This manifest file is replaced on each commit to the database.

This is the most efficient method, but is only safe for concurrent writes if the underlying key-value store supports atomic writes to a single key.

Supported base key-value stores include: - file - gcs

Numbered file

The numbered file method stores the database configuration in the manifest.ocdbt file, while the version tree is stored in manifest.xxxxxxxxxxxxxxxx files that are written for each commit.

Only a small number of manifests are retained at any given time; older manifests are deleted automatically.

This method is safe for concurrent writes if the underlying key-value store supports atomic writes to a single key, conditioned on the key not already being present.

Manifest format

An encoded manifest consists of:

Manifest header

Field

Binary format

magic_value

uint32be

length

uint64le

version

varint

compression_format

varint

magic_value

Must equal 0x0cdb3a2a

length

Length in bytes of entire manifest, including this header.

version

Must equal 0.

crc32c_checksum

CRC32C checksum of entire manifest, Length in bytes of entire manifest, including this header.

compression_format

0 for uncompressed, 1 for zstd.

Manifest configuration

Field

Binary format

uuid

ubyte[16]

manifest_kind

varint

max_inline_value_bytes

varint

max_decoded_note_bytes

varint

version_tree_arity_log2

uint8

compression_method

varint

Compression configuration

uuid

Unique 128-bit identifier for the database. If not specified explicitly, is randomly generated when the database is first created.

manifest_kind

Specifies the kind of manifest that is present. Valid values are:

0 (single)

Both the configuration and version tree are present in the manifest. When using the single file manifest kind, this is set in the manifest.ocdbt file. When using numbered file manifest kind, this is set in the manifest.xxxxxxxxxxxxxxxx files.

1 (numbered)

Indicates the numbered file manifest kind. This manifest stores only the configuration. The version tree must be retrieved from the numbered manifest.xxxxxxxxxxxxxxxx files.

max_inline_value_bytes

Maximum size of a value to store inline within a B+Tree node.

max_decoded_note_bytes

Maximum (uncompressed) size of a B+Tree node.

version_tree_arity_log2

Base-2 logarithm of the arity of the version tree.

compression_method

0 for compressed, 1 for Zstandard.

Compression configuration

If the compression_method is not 0, it is followed by the method-specific configuration.

Zstd compression configuration

Field

Binary format

level

int32le

level

Compresion level to use when writing.

Manifest version tree

Following the compression configuration, the manifest specifies references to B+tree roots and version tree nodes.

Field

Binary format

data_file_table

Data file table format

inline_versions

Version tree leaf node entries format (height = 0)

version_nodes

Interior version tree node entries (height > 0)

data_file_table

Table specifying the data files referenced by inline_versions and version_nodes.

inline_versions

References to the most recent versions.

Older versions are referenced indirectly via version_nodes.

version_nodes

References to version tree interior nodes for versions older than those referenced from inline_versions.

Field

Binary format

crc32c_checksum

uint32le

crc32c_checksum

CRC-32C checksum of the entire manifest, excluding the checksum itself.

Data file table format

Logically, the data file table is a list of (base_path[i], relative_path[i]) pairs of byte strings. The full data file path, relative to the root of the OCDBT database, is full_path[i] = transitive_path + base_path[i] + relative_path[i], where transitive_path is the transitive file path specified by the parent node:

  • For the data file table specified in the manifest, the transitive path is the empty string.

  • If a node is accessed using a data file path of transitive_path + base_path[i] + relative_path[i], the transitive path that applies to any child nodes is equal to transitive_path + base_path[i]; that is, the base_path[i] is the additional transitive portion of the path.

The maximum length of full_path[i] is 65535 bytes.

Prefix compression is used to encode the combined path[i] = base_path[i] + relative_path[i].

Field

Binary format

Count

num_files

varint

1

path_prefix_length[i]

varint

num_files - 1

path_suffix_length[i]

varint

num_files

base_path_length[i]

varint

num_files

path_suffix[i]

byte[path_suffix_length[i]]

num_files

num_files

Number of data files specified in the table.

path_prefix_length[i]

Length in bytes of common prefix of path[i] and path[i+1]. For the first path, no common prefix is stored, and implicitly path_prefix_length[-1] is defined to be 0.

path_suffix_length[i]

Length in bytes of path_suffix[i]. This is equal to length(path[i]) - path_prefix_length[i-1].

base_path_length[i]

Length in bytes of base_path[i]. To simplify decoding, it is required that if path_prefix_length[i-1] > min(base_path_length[i], base_path_length[i-1]), then base_path[i] = base_path[i-1]. That is, the common prefix must not extend past the end of the current or previous base path unless the base path is equal to the previous base path.

path_suffix[i]

Path suffix value. This is equal to path[i] with the first path_prefix_length[i-1] bytes excluded. For i = 0, path_suffix[i] = path[i].

Version tree node format

An encoded version tree node consists of:

Version tree node outer header

Field

Binary format

magic_value

uint32be

length

uint64le

version

varint

compression_format

varint

magic_value

Must equal 0x0cdb1234.

length

Length in bytes of entire version tree node, including this header.

version

Must equal 0.

compression_format

0 for uncompressed, 1 for zstd.

The remaining data is encoded according to the specified compression_format.

Version tree node inner header

Field

Binary format

version_tree_arity_log2

uint8

height

uint8

data_file_table

Data file table format

version_tree_arity_log2

Base-2 logarithm of the version tree node arity. Must match the arity specified in the manifest from which this node was reached.

height

Height of this version tree node. Leaf nodes have a height of 0.

It is required that

(height + 1) * version_tree_arity_log2 < 64
data_file_table

Table specifying the data files referenced by the node entries.

The format of the remaining data depends on the value of height.

Version tree leaf node entries format (height = 0)

The same encoded representation is used for both the entries of a leaf version tree node and for the inline_versions specified in the manfiest.

Field

Binary format

Count

num_versions

varint

1

generation_number[i]

varint

num_versions

root_height[i]

uint8

num_versions

data_file_id[i]

varint

num_versions

data_file_offset[i]

varint

num_versions

data_file_length[i]

varint

num_versions

num_keys[i]

varint

num_versions

num_tree_bytes[i]

varint

num_versions

num_indirect_value_bytes[i]

varint

num_versions

commit_time[i]

uint64le

num_versions

num_versions

Number of B+tree roots that are referenced. The value is constrained based on the value of generation_number[num_versions-1], the latest generation number referenced from the version tree node, and version_tree_arity_log2:

1 <= num_versions <= (generation_number[num_versions-1] - 1)
                   % (1 << version_tree_arity_log2)
                   + 1

Note

The same computation of num_versions applies to both leaf node entries included in a version tree node, and inline_versions included in the manfiest. In the former case, the and version_tree_arity_log2 value is obtained from the version node. In the latter case, the version_tree_arity_log2 value is taken from the manifest.

generation_number[i]

Generation number of the referenced B+tree root. Must not be 0. The generation numbers must be strictly increasing, i.e. if i < j, then generation_number[i] < generation_number[j].

root_height[i]

Height of the referenced B+tree root. Must be 0 if there is no root node.

commit_time[i]

Time at which the generation was created, in nanoseconds since the Unix epoch (excluding leap seconds).

data_file_id[i]

Specifies the data file containing the encoded root B+tree node, as an index into the data file table.

data_file_offset[i]

Specifies the starting byte offset within data_file_id[i] of the encoded root B+tree node.

data_file_length[i]

Specifies the byte length within data_file_id[i] of the encoded root B+tree node.

num_keys[i]

Specifies the total number of (leaf-node) keys within the B+tree. Note that if there is more than one path from the root to a given leaf node, the leaf node’s keys are counted more than once.

num_tree_bytes[i]

Specifies the total encoded size in bytes of all B+tree nodes reachable from the root, including the root itself. A given node is counted once for each unique path within the tree to it; if there is more than one path to a node, its size is counted multiple times.

num_indirect_value_bytes[i]

Specifies the total size in bytes of all indirectly-stored values in the B+tree. If the same stored value is referenced from multiple keys, its size is counted multiple times.

Interior version tree node entries (height > 0)

The same encoded representation is used for both the entries of an interior version tree node and for the version_nodes version tree nodes specified in the manfiest, but the interpretation differs, as described below.

Field

Binary format

Count

num_children

varint

1

generation_number[i]

varint

num_children

data_file_id[i]

varint

num_children

data_file_offset[i]

varint

num_children

data_file_length[i]

varint

num_children

num_generations[i]

varint

num_children

commit_time[i]

uint64le

num_children

When the encoded representation is used to specify version_nodes in the manifest, there is one additional field:

Field

Binary format

Count

entry_height[i]

uint8

num_children

num_children

Number of version tree nodes that are referenced.

  • When this encoded representation is used to specify the entries of an interior version tree node, num_children is constrained by:

    1 <= num_children
      <= ((generation_number[num_children-1]
           >> (version_tree_arity_log2 * height))
          - 1)
          % (1 << version_tree_arity_log2)
         + 1
    

    version_tree_arity_log2 and height are obtained from the version tree node.

generation_number[i]

Latest B+tree root generation number referenced within this subtree. Must not be 0. The generation numbers must be strictly increasing, i.e. if i < j, then generation_number[i] < generation_number[j].

data_file_id[i]

Specifies the data file containing the encoded version tree node, as an index into the data file table.

data_file_offset[i]

Specifies the starting byte offset within data_file_id[i] of the encoded version tree node.

data_file_length[i]

Specifies the byte length within data_file_id[i] of the encoded version tree node.

num_generations[i]

Total number of B+tree roots referenced within this subtree.

commit_time[i]

Commit time, in milliseconds since the Unix epoch (excluding leap seconds), of the earlier B+tree root referenced within this subtree.

Note

This is the earliest commit time referenced within this subtree, in contrast with generation_number[i], which specifies the latest generation number referenced within this subtree. Storing the earliest commit time, rather than the latest commit time, enables more efficient queries for the latest generation with commit_time<=T.

entry_height[i]

Specifies the height of the referenced version tree node.

This field is only present when the encoded representation is used to specify version_nodes in the manifest. The heights must be decreasing, i.e. entry_height[i] > entry_height[j] if i < j.

When the encoded representaiton is used to specify the entries of an interior version tree node, this field is not present and instead, for the purpose of this specification, entry_height[i] is implicitly equal to height - 1, where height is obtained from the version tree node.

Field

Binary format

crc32c_checksum

uint32le

crc32c_checksum

CRC-32C checksum of the entire version tree node, excluding the checksum itself.

B+tree node format

An encoded B+tree node consists of:

B+tree node outer header

Field

Binary format

magic_value

uint32be

length

uint64le

version

varint

compression_format

varint

magic_value

Must equal 0x0cdb20de.

length

Length in bytes of entire B+tree node, including this header.

version

Must equal 0.

compression_format

0 for uncompressed, 1 for zstd.

The remaining data is encoded according to the specified compression_format.

B+tree node inner header

Field

Binary format

height

uint8

data_file_table

Data file table format

num_entries

varint

height

Height of this B+tree node. Leaf nodes have a height of 0.

data_file_table

Table specifying the data files referenced by the node entries.

num_entries

Number of children (if height > 0) or key/value pairs (if height == 0) referenced from this node.

The format of the remaining data depends on the value of height.

Leaf B+tree node format (height = 0)

Note

B+tree leaf nodes do not directly store the full key for each key/value entry. Instead, only the relative_key for each entry is stored; the prefix that must be prepended to this relative_key to obtain the full key is defined by the path from the root node to the leaf node.

Field

Binary format

Count

key_prefix_length[i]

varint

num_entries - 1

key_suffix_length[i]

varint

num_entries

key_suffix[i]

byte[key_suffix_length[i]]

num_entries

value_length[i]

varint

num_entries

value_kind[i]

varint

num_entries

data_file_id[k]

varint

num_indirect_entries

data_file_offset[k]

varint

num_indirect_entries

value[j]

byte[direct_value_length[j]]

num_direct_entries

key_prefix_length[i]

Length in bytes of common prefix of relative_key[i] and relative_key[i+1]. For the first key, no common prefix is stored.

key_suffix_length[i]

Length in bytes of each relative key, excluding the length of the common prefix with the previous key. For the first key, the total length is stored, since there is no previous key.

key_suffix[i]

Relative key for the entry, excluding the common prefix with the previous key. For the first key, the entire key is stored. Note that the suffixes for all keys are concatenated in the encoded representation.

value_length[i]

Length in bytes of the value for this key/value entry.

value_kind[i]

Indicates how the value is stored:

  • 0 if the value is stored inline in this leaf node,

  • 1 if the value is stored out-of-line.

Based on this column, the following derived values are defined:

num_direct_entries

The number of entries for which value_kind[i] == 0.

direct_entries

The array of indices of direct values.

direct_value_length[j]

Equal to value_length[direct_values[j]].

num_indirect_entries

Equal to num_entries - num_direct_entries.

indirect_entries

The array of indices of indirect values.

data_file_id[k]

Specifies the data file containing the value for entry indirect_values[k], as an index into the data file table. Only stored for entries with out-of-line values.

data_file_offset[k]

Specifies the starting byte offset within data_file_id[k] of the value for entry indirect_values[k]. Only stored for entries with out-of-line values.

value[j]

Specifies the value for entry direct_values[j]. Only stored for entries with inline values. Note that all direct values are concatenated in the encoded representation.

Interior B+tree node format (height > 0)

Note

B+tree interior nodes do not directly store the full starting key for each child node. Instead, only the relative_key for each child node is stored; the prefix that must be prepended to this relative_key to obtain the full key is defined by the path from the root node.

Field

Binary format

Count

key_prefix_length[i]

varint

num_entries - 1

key_suffix_length[i]

varint

num_entries

subtree_common_prefix_length[i]

varint

num_entries

key_suffix[i]

byte[key_suffix_length[i]]

num_entries

data_file_id[i]

varint

num_entries

data_file_offset[i]

varint

num_entries

data_file_length[i]

varint

num_entries

num_keys[i]

varint

num_entries

num_tree_bytes[i]

varint

num_entries

num_indirect_value_bytes[i]

varint

num_entries

key_prefix_length[i]

Length in bytes of common prefix of relative_key[i] and relative_key[i+1]. For the first key, no common prefix is stored.

key_suffix_length[i]

Length in bytes of each relative key, excluding the length of the common prefix with the previous key. For the first key, the total length is stored, since there is no previous key.

subtree_common_prefix_length[i]

Length in bytes of the prefix of relative_key[i] that is common to all keys within the subtree rooted at this child node. This prefix serves as an implicit prefix of all keys within the subtree rooted at the child.

key_suffix[i]

Relative key for the entry, excluding the common prefix with the previous key. For the first key, the entire key is stored. Note that the suffixes for all keys are concatenated in the encoded representation.

data_file_id[i]

Specifies the data file containing the encoded child B+tree node, as an index into the data file table.

data_file_offset[i]

Specifies the starting byte offset within data_file_id[i] of the encoded child B+tree node.

data_file_length[i]

Specifies the byte length within data_file_id[i] of the encoded child B+tree node.

num_keys[i]

Specifies the total number of (leaf-node) keys within the subtree rooted at the child node. Note that if there is more than one path from the child node to a given leaf node, the leaf node’s keys are counted more than once.

num_tree_bytes[i]

Specifies the total encoded size in bytes of all B+tree nodes reachable from the child, including the child itself. A given node is counted once for each unique path within the tree to it; if there is more than one path to a node, its size is counted multiple times.

num_indirect_value_bytes[i]

Specifies the total size in bytes of all indirectly-stored values in the subtree rooted at the child node. If the same stored value is referenced from multiple keys, its size is counted multiple times.

Field

Binary format

crc32c_checksum

uint32le

crc32c_checksum

CRC-32C checksum of the entire B+tree node, excluding the checksum itself.