koladata

Home
Overview
Fundamentals
Glossary
Cheatsheet
API Reference
Quick Recipes
Deep Dive
Common Pitfalls and Gotchas
Persistent Storage

View the Project on GitHub google/koladata

Koda Overview

This guide gives a quick 60-minute overview of Koda, geared mainly for new users. See the 10 Minute Introduction for a short introduction walking through an example, and the Koda Fundamentals guide for a more detailed introduction.

Also see Koda Cheatsheet for quick references.

Why Koda

Koda Distinguishing Features

Why to Use Koda

DataSlices and Items

Koda enables vectorization by utilizing DataSlices. DataSlices are arrays with partition trees. They are stored and manipulated as jagged arrays (irregular multi-dimensional arrays). Such partition trees are called JaggedShape.

For example, the following DataSlice has 2 dimensions and 5 items. The first dimension has 2 items and the second dimension has 5 items partitioned as [3, 2].

ds = kd.slice([["one", "two", "three"], ["four", "five"]])

ds.get_ndim() # 2
ds.get_shape() # JaggedShape(2, [3, 2])

Conceptually, it can be thought as partition tree + flattened array as shown in the graph below.

digraph {
  Root -> "dim_1:0"
  Root -> "dim_1:1"
  "dim_1:0" -> "dim_2:0"
  "dim_1:0" -> "dim_2:1"
  "dim_1:0" -> "dim_2:2"
  "dim_2:0 bis" [label = "dim_2:0"]
  "dim_2:1 bis" [label = "dim_2:1"]
  "dim_1:1" -> "dim_2:0 bis"
  "dim_1:1" -> "dim_2:1 bis"
  "dim_2:0" -> one
  "dim_2:1" -> two
  "dim_2:2" -> three
  "dim_2:0 bis" -> four
  "dim_2:1 bis" -> five

  subgraph cluster_x {
    graph [style="dashed", label="Partition tree"]
    Root;"dim_1:0";"dim_1:1";"dim_2:0";"dim_2:1";"dim_2:2";"dim_2:0 bis";"dim_2:1 bis";
  }

  subgraph cluster_y {
    graph [style="dashed", label="Flattened array"]
    one;two;three;four;five
  }
}

Items are the elements of a DataSlice, and can be primitives (e.g. integers or strings), or more complex data structures (e.g. lists, dicts and entities).

A zero-dimensional DataSlice is a scalar item. It is called a DataItem.

kd.item(1, schema=kd.FLOAT32)  # 1.
kd.item(1, schema=kd.FLOAT32).to_py()  # python 1.
kd.list([10, 20, 30, 40])[2]  # 30
kd.dict({1: 'a', 2: 'b'})[2]  # 'b'
kd.from_py([{'a': [1, 2, 3], 'b': [4, 5, 6]}, {'a': 3, 'b': 4}])
kd.to_py(kd.dict({'a': [1, 2, 3], 'b': [4, 5, 6]})) # {'a': [1, 2, 3], 'b': [4, 5, 6]}

kd.slice([kd.list([1, 2, 3]), kd.list([4, 5])])  # DataSlice of lists
kd.slice([kd.dict({'a':1, 'b':2}), kd.dict({'c':3})])  # DataSlice of dicts

The DataSlice kd.slice([kd.list([1, 2, 3]), kd.list([4, 5])]) is different from the DataSlice kd.slice([[1, 2, 3], [4, 5]]), as the following example shows. However, they can be converted from/to each other as we will see later.

l1 = kd.list([1, 2, 3])
l2 = kd.list([4, 5])
list_ds = kd.slice([l1, l2])
list_ds.get_ndim()  # 1
list_ds.get_size()  # 2

int_ds = kd.slice([[1, 2, 3], [4, 5]])
int_ds.get_ndim()  # 2
int_ds.get_size()  # 5

DataSlices have different mechanisms around accessing and broadcasting compared to tensors and nested lists. That is, they specialize in aggregation from inner dimensions into outer ones.

ds = kd.slice([[1, 2, 3], [4, 5]])
# Use .S for indexing
ds.S[1, 0]  # 4
ds.S[..., :2]  # [[1, 2],[4, 5]]

# Use .take for getting items in the last dimension
ds.take(1)  # [2, 5] - 1st item in the last dimension

# Use .L to work with DataSlices as with python lists
ds.L[0]  # kd.slice([1, 2, 3])
[int(y) for x in ds.L for y in x.L]  # [1, 2, 3, 4, 5]

kd.slice([5, 6]).expand_to(ds)  # [[5, 5, 5], [6, 6]]
ds + kd.slice([6, 7]) + kd.item(8)  # [[15, 16, 17], [19, 20]]
kd.agg_max(ds)  # [3, 5]
kd.map_py(lambda x, y: x*y, ds, 2)  # [[2, 4, 6], [8, 10]]

ds = kd.slice([4, 3, 4, 2, 2, 1, 4, 1, 2])
kd.group_by(ds)  # [[4, 4, 4], [3], [2, 2, 2], [1, 1]]
kd.group_by(ds).take(0)  # [4, 3, 2, 1]
kd.unique(ds)  # the same as above

# Group_by can be used to swap dimensions, which can be used to transpose the
# final dimension of a slice.
ds = kd.slice([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
kd.group_by(ds.flatten(-2), kd.index(ds).flatten(-2))  # [[1, 4, 7], [2, 5, 8], [3, 6, 9]]

Primitives

Koda supports INT32, INT64, FLOAT32, FLOAT64, STRING, BYTES, BOOLEAN, MASK as primitive types.

kd.int32(1) # INT32 DataItem
kd.int64([2, 3]) # 1-dim INT64 DataSlice
kd.float32([[1., 2.], [3.]]) # 2-dim FLOAT32 DataSlice
kd.str('string')
kd.bytes(b'bytes')
kd.bool(True)

# There is no native MASK type in Python, thus we have them directly in kd
kd.present
kd.missing

Entities, Schemas, ItemIds and Universally-Unique ItemIds

To work with structured data, Koda utilizes a concept of entities that are identified by ItemIds (128-bit ids) and can have attributes, which can be primitives or other entities. Entities can have schema assigned, and multiple entities can share the same schema, which improves performance for vectorized operations.

# kd.new creates new entities and assigns schemas to them
kd.new(x=1, y=2, schema='Point')

# Can also explicitly create schema with attributes before using.
my_schema = kd.named_schema('Point', x=kd.INT32, y=kd.INT32)
x = kd.new(x=1, y=2, schema=my_schema)
x.get_schema() == my_schema  # yes

# When converting from py, can specify schema
kd.from_py({'x': 1, 'y': 2}, schema=my_schema)

# It's possible to create nested entities
x = kd.new(a=1, b=kd.new(c=3, schema='Inner'), schema='Outer')

Entities are immutable by default, and it’s possible to create multiple slightly different versions with O(1) cost.

x = x.with_attrs(d=4)  # add an attribute
x = x.updated(kd.attrs(x.b, c=5))  # update nested attribute 'c'

# Entities can be cloned or deep cloned
x = kd.new(a=1, b=kd.new(c=3, schema='Inner'), schema='Outer')
x1 = x.clone(a=2)
x1.get_itemid() != x.get_itemid()  # yes
x1.b.get_itemid() == x.b.get_itemid()  # yes
x2 = x.deep_clone(a=2)
x2.b.get_itemid() != x.b.get_itemid()  # yes

Both entities and schemas can be dynamically allocated or be universally-unique (i.e. have the same ItemIds across processes/machines).

# Instead of specifying schemas, can auto-allocate them
x = kd.new(a=1, b=kd.new(c=3))

# Entities with auto-allocated schemas cannot be mixed together in vectorized ops
kd.new(x=1).get_schema() != kd.new(x=1).get_schema()  # yes

# Auto-allocated schemas can be cast to have the same schema
x, y = kd.new(a=1), kd.new(b=2)  # two entites with different schemas
kd.slice([x, y.with_schema(x.get_schema())])

# Universally unique entities can be used similarly to named tuples
kd.uu(x=1, y=kd.uu(z=3))
kd.dict({kd.uu(x=1, y=2): 10, kd.uu(x=2, y=3): 20})[kd.uu(x=1, y=2)]  # 10

# Dynamically allocated entities have different ids
kd.new(x=1, y=2).get_itemid() != kd.new(x=1, y=2).get_itemid()  # yes
# Universally-uniquely allocated entities have always the same ids
kd.uu(x=1, y=2).get_itemid() == kd.uu(x=1, y=2).get_itemid()  # yes

# Can encode itemid's into strings
kd.encode_itemid(kd.new(x=1, y=2))  # always different, as ids are allocated
kd.encode_itemid(kd.uu(x=1, y=2)) == '07aXeaqDy6UJNv8EUfA0jz'  # always the same

DataSlices of Structured Data, Explosion/Implosion and Attribute Access

When working with DataSlices of structured data (entities, lists, and dicts), the operation is applied to all the items in the DataSlice simultaneously.

# Root
# ├── dim_1:0
# │   ├── dim_2:0 -> kd.new(x=1, y=20, schema='Point')
# │   └── dim_2:1 -> kd.new(x=2, y=30, schema='Point')
# └── dim_1:1
#     ├── dim_2:0 -> kd.new(x=3, y=40, schema='Point')
#     ├── dim_2:1 -> kd.new(x=4, y=50, schema='Point')
#     └── dim_2:2 -> kd.new(x=5, y=60, schema='Point')
kd.slice([
    [
      kd.new(x=1, y=20, schema='Point'),
      kd.new(x=2, y=30, schema='Point')
    ],
    [
      kd.new(x=3, y=40, schema='Point'),
      kd.new(x=4, y=50, schema='Point'),
      kd.new(x=5, y=60, schema='Point')
    ]
])

# Root
# ├── dim_1:0 -> kd.list([20, 30])
# └── dim_1:1 -> kd.list([40, 50, 60])
kd.slice([kd.list([20, 30]), kd.list([40, 50, 60])])

# Root
# ├── dim_1:0
# │   ├── dim_2:0 -> kd.dict({'a': 1,'b': 2})
# │   └── dim_2:1 -> kd.dict({'b': 3,'c': 4})
# └── dim_1:1
#     └── dim_2:0 -> kd.dict({'a': 5,'b': 6,'c': 7})
kd.slice([[kd.dict({'a': 1,'b': 2}), kd.dict({'b': 3,'c': 4})],
 [kd.dict({'a': 5,'b': 6,'c': 7})]])

As a result of attribute access of a DataSlice of entities, a new DataSlice is returned, which contains attributes of every corresponding entity in the original DataSlice.

a = kd.slice([kd.new(x=1, schema='Foo'),
              kd.new(x=2, schema='Foo'),
              kd.new(x=3, schema='Foo')])
a  # [Entity(x=1), Entity(x=2), Entity(x=3)]
a.x  # [1, 2, 3]

a = kd.new(x=kd.slice([1, 2, 3]), schema='Foo')  # The same as above, but more compact

b = kd.slice([kd.new(x=1, schema='Foo'),
              kd.new(schema='Foo'),
              kd.new(x=3, schema='Foo')])
b  # [Entity(x=1), Entity(), Entity(x=3)]

b.maybe('x')  # [1, None, 3] - only the first one has an attribute 'x'

When accessing a single element of a DataSlice of lists or a key of a DataSlice of dicts, a new DataSlice is returned with the corresponding values in the original lists and dicts.

a = kd.slice([kd.list([1, 2, 3]), kd.list([4, 5])])
# Access 1st item in each list
a[1]  # [2, 5] == [list0[1], list1[1]]

a = kd.slice([kd.dict({'a': 1, 'b': 2}), kd.dict({'b': 3, 'c': 4})])
a['c']  # [None, 4] == [dict0['c'], dict1['c']]

A common operation is explosion of DataSlices of lists, when we return a new DataSlice with an extra dimension, where the innermost dimension is composed of the values of the original lists.

a = kd.slice([kd.list([1, 2, 3]), kd.list([4, 5])])
# Access 1st item in each list
a[1]  # [2, 5] == [list0[1], list1[1]]

# "Explosion": add another dimension to the DataSlice
# That is, a 1-dim DataSlice of lists becomes a 2-dim DataSlice
a[:]  # [[1, 2, 3],[4, 5]]

# "Explosion" of the first two items in each list
a[:2]  # [[1, 2], [4, 5]]
a[:].get_ndim() == a.get_ndim() + 1  # explosion adds one dimension

An opposite operation is implosion, when we return a DataSlice of lists with one fewer dimension, where each list contains the values of the innermost dimension of the original DataSlice.

# Implode replaces the last dimension with lists
a = kd.slice([[1, 2, 3], [4, 5]])
kd.implode(a)  # kd.slice([kd.list([1,2,3]), kd.list([4,5])])
kd.implode(a)[:]  # == a

Getting all keys or values of a DataSlice of dicts will return a DataSlice with one more dimension.

a = kd.slice([kd.dict({'a': 1, 'b': 2}), kd.dict({'b': 3, 'c': 4})])

a.get_keys() # [['a', 'b'], ['b', 'c']]
a.get_values() # [[1, 2], [3, 4]]
# shortcut for get_value
a[:] # [[1, 2], [3, 4]]

# note, get_keys() doesn't guarantee to preserve the order, but we can sort before lookup
a[kd.sort(a.get_keys())]  # [[1, 2], [3, 4]]
a.get_keys().get_ndim() == a.get_ndim() + 1  # the keys DataSlice has one more dimension

Here is an example that puts everything together.

a = kd.from_py([{'x': 1}, {'x': 3}], dict_as_obj=True)
b = kd.from_py([{'y': 2}, {'y': 4}])

a[:].x + b[:]['y']  # [3, 7]
kd.zip(kd.agg_sum(a[:].x), kd.agg_sum(b[:]['y']))  # [4, 6]

Objects

To make possible mixing different primitives or entities/lists/dicts with different schemas in a single DataSlice, Koda uses objects, which store their schema in their data.

There are two main kinds of objects in Koda:

kd.obj(x=2, y=kd.obj(z=3))

x = kd.uuobj(x=2, y=kd.uuobj(z=3))  # universally unique (always the same id)
kd.encode_itemid(x) == '07ZWVFWxz9lNirDW8RXBlw'  # always the same id

x = kd.from_py([{'a': 1, 'b': 2}, {'c': 3, 'd': 4}], dict_as_obj=True)
x[0]  # Obj(a=1, b=2)
x[:].maybe('a')  # [1, None]

# Mix objects with different schemas
kd.slice([kd.obj(1), kd.obj("hello"), kd.obj([1, 2, 3])])
kd.slice([kd.obj(x=1, y=2), kd.obj(x="hello", y="world"), kd.obj(1)])

kd.obj(x=1).get_schema() # kd.OBJECT
kd.obj(x=1).get_schema() == kd.obj(1).get_schema()  # yes

# Get per-item schemas stored in every object
kd.obj(x=1).get_obj_schema() # IMPLICIT_SCHEMA(x=INT32)
kd.obj(x=1).get_obj_schema() != kd.obj(1).get_obj_schema()  # yes, different actual schemas
kd.slice([kd.obj(x=1,y=2), kd.obj(x="hello", y="world"), kd.obj(1)]).get_obj_schema()
# [IMPLICIT_SCHEMA(x=INT32, y=INT32), IMPLICIT_SCHEMA(x=STRING, y=STRING), INT32]

Similar to entities, objects can be modified with a cost of O(1), cloned or deep cloned.

x = kd.obj(x=2, y=kd.obj(z=3))
x = x.with_attrs(a=4)  # add attribute
x = x.updated(kd.attrs(x.y, z=5))  # update nested attribute

x1 = x.clone(z=4)
x2 = x.deep_clone(z=5)

Entities and objects can be converted to each other.

x, y = kd.new(a=1), kd.new(b=2)
kd.slice([kd.obj(x), kd.obj(y)])  # convert both entities to objects

# Objects can be converted to entities
my_schema = kd.named_schema('Point', x=kd.INT32, y=kd.INT32)
kd.obj(x=1, y=2).with_schema(my_schema)
kd.from_py({'x': 1, 'y': 2}, dict_as_obj=True).with_schema(my_schema)  # the same as above

Note: Compared to entities, objects have a higher performance overhead during vectorized operations, as each object in a DataSlice has its own schema, and different objects in the same DataSlice might have different sets of attributes. For large data, the use of entities with explicit schemas is recommended for faster execution.

Similar to entities, lists and dicts can be objects too.

l1 = kd.list([1, 2])
l2 = kd.list(['3', '4'])
l_objs = kd.slice([kd.obj(l1), kd.obj(l2)])
l_objs[:]  # [[1, 2], ['3', '4']]
assert l_objs.get_schema() == kd.OBJECT
l_objs.get_obj_schema()  # [LIST[INT32], LIST[STRING]]

d1 = kd.dict({'a': 1})
d2 = kd.dict({2: True})
d_objs = kd.slice([kd.obj(d1), kd.obj(d2)])
d_objs[:]  # [{'a': 1}, {2: True}]
assert d_objs.get_schema() == kd.OBJECT
d_objs.get_obj_schema()  # [DICT{STRING, INT32}, DICT{INT32, BOOLEAN}]

Primitives are also objects. Their schemas are inferred from their values.

kd.obj(1)
kd.obj(kd.int64(1))
kd.obj('hello')

assert kd.obj(1).get_schema() == kd.OBJECT
assert kd.obj(1).get_obj_schema() == kd.INT32

# Dict values are objects
# No need to wrap them using kd.obj
d = kd.dict({'a': 1, 'b': '2'})
d.get_schema()  # DICT{STRING, OBJECT}

Sparsity and Masks

Sparsity is a first-class concept in Koda. Every item in a DataSlice can be present or missing and all operators support missing values.

a = kd.slice([[1, None], [4]])
b = kd.slice([[None, kd.obj(x=1)], [kd.obj(x=2)]])
a + b.x  # [[None, None], [6]]
kd.agg_any(kd.has(a)) # [present, present]
kd.agg_all(kd.has(a)) # [missing, present]

Masks are used to represent present/missing state. They are also used in comparison and logical operations.

# Get the sparsity of a DataSlice
kd.has(kd.slice([[1, None], [4]])) # [[present, missing], [present]]
kd.slice([1, None, 3, 4]) != 3  # [present, missing, missing, present]
kd.slice([1, 2, 3, 4]) > 2  # [missing, missing, present, present]

Using masks instead of Booleans in comparison and logical operations is useful because masks have a 2-valued logic. In the presence of missing values, the Booleans have a 3-valued logic (over True, False, missing), which is more complex and confusing.

Masks can be used to filter or select items in a DataSlice. The difference is that filtering does not change the shape of the DataSlice and filtered out items become missing, while selection changes the shape by only keeping selected items in the resulting DataSlice.

x = kd.slice([1, 2, 3, 4])
y = kd.slice([4, 5, 6, 7])

# To filter a DataSlice based on masks, use kd.apply_mask or & as shortcut
kd.apply_mask(x, y >= 6)
x & (y >= 6) # ([None, None, 3, 4]

a = kd.obj(x=kd.slice([1,2,3]))
a.x >= 2  # [missing, present, present]
a &= a.x >= 2  # [None, kd.obj(x=2), kd.obj(x=3)]

# Can use 'select' to filter DataSlices
a = kd.slice([kd.obj(x=1), kd.obj(x=2), kd.obj(x=3)])
a1 = a.select(a.x >= 2)  # [Obj(x=2), Obj(x=3)]
a1 = a.select(lambda u: u.x >= 2)  # the same as above
a1 = (a & (a.x >= 2)).select_present()  # the same as above

# Can update attributes only of selected entities/objects
a.updated(kd.attrs(a1, y=a1.x*2))  # [Obj(x=1), Obj(x=2, y=4), Obj(x=3, y=6)]

DataSlices with compatible shapes can coalesced (i.e., missing items are replaced by corresponding items of the other DataSlice).

kd.str(None)  # "None" string
kd.str(None) | 'hello'  # 'hello'

kd.coalesce(kd.slice([1, None, 3]), kd.slice([4,5,6]))  # [1,5,3]
kd.slice([1, None, 3]) | kd.slice([4,5,6])  # the same as above

Immutable Workflows and Bags (Collections of Attributes)

Immutability

Koda’s data structures are immutable. However, Koda offers various efficient ways to create modified copies of your data, including complex edits and joins. These copies share the same underlying memory whenever possible, and many operations, such as edits and joins, are performed in O(1) time. Immutability means that modifying an entity/list/dict does not automatically modify the original entity/list/dict, even if they share the same ItemId.

NOTE: Mutable APIs is available only in advanced, high-performance workflows, but with trade-offs. They require a deeper understanding of Koda data model and it is easier to make mistakes which can be hard to debug.

a = kd.new(x=2, y=kd.new(z=3))
# update existing attribute and add a new attribute
a1 = a.with_attrs(x=4, u=5)  # Entity(u=5, x=4, y=Entity(z=3))
# a stays the same as it is immutable
a # Entity(x=2, y=Entity(z=3))

b = kd.dict({'a': 1, 'b': 2})
# update with a new key/value pair
b1 = b.with_dict_update('a', 2) # Dict{'a'=2, 'b'=2}
# update with another dict
b2 = b.with_dict_update(kd.dict({'a': 3, 'c': 4})) # Dict{'c'=4, 'a'=3, 'b'=2}
# b stays the same as it is immutable
b # Dict{'a'=1, 'b'=2}

c = kd.list([1, 2, 3])
# Create a new list with a distinct ItemId by concatenating two lists
c1 = kd.concat_lists(c, kd.list([4, 5]))  # List[1, 2, 3, 4, 5]
# Or create a new list with a distinct ItemId by appending a DataSlice
c2 = kd.appended_list(c, kd.slice([4, 5]))  # List[1, 2, 3, 4, 5]
# c stays the same as it is immutable
c # List[1, 2, 3]

Bags

To support modifications and joins in an immutable environment, Koda utilizes bags. Bags are collections of attributes. Each attribute within a bag is a mapping: (itemid, attribute_name) -> value. All data structures (including entities, dicts, and lists) are represented in this manner, and associated bags are accessible via the get_bag() method. These mappings are stored using a combination of hash maps and arrays. This hybrid approach enables fast, vectorized performance for table-like data while supporting data with complex structure and sparsity. Bags are merged for O(1) by utilizing a concept of fallbacks (when we don’t find a mapping in one bag, we look it up in the other ones). Such a chain of fallback bags can be merged into a single bag when higher lookup performance is required.

NOTE: Almost all data (e.g. entities, dicts, lists, objects, schemas) are stored as attributes in bags.

a = kd.obj(x=2, y=kd.obj(z=3))

# Get the bag associated with a DataSlice
db = a.get_bag()
# Get quick stats of a bag, use its repr
db
# Print out all attributes
db.contents_repr()
# Print out only data attributes
db.data_triples_repr()
# Print out only schema attributes
db.schema_triples_repr()
# Get approximate size (e.g. number of attributes)
db.get_approx_size()

Optional: Understand how entities are represented as attributes.

a = kd.obj(x=2, y=kd.obj(z=3))

a.get_bag()
# DataBag $b933:
#   2 Entities/Objects with 3 values in 3 attrs
#   0 non empty Lists with 0 items
#   0 non empty Dicts with 0 key/value entries
#   2 schemas with 3 values
#
# Top attrs:
#   z: 1 values
#   y: 1 values
#   x: 1 values


a.get_bag().contents_repr()
# DataBag $b933:
# $004QVkgdETIyelSHQw2OHs.get_obj_schema() => #6wL3PlezPQ4FnldO95VeD6
# $004QVkgdETIyelSHQw2OHs.z => 3
# $004QVkgdETIyelSHQw2OHt.get_obj_schema() => #6wK1lU2l9FsTJjliNOPBih
# $004QVkgdETIyelSHQw2OHt.x => 2
# $004QVkgdETIyelSHQw2OHt.y => $004QVkgdETIyelSHQw2OHs
#
# SchemaBag:
# #6wK1lU2l9FsTJjliNOPBih.x => INT32
# #6wK1lU2l9FsTJjliNOPBih.y => OBJECT
# #6wL3PlezPQ4FnldO95VeD6.z => INT32

Optional: Understand how dicts are represented as attributes.

b = kd.dict({'a': 1, 'b': 2})

b.get_bag()
# DataBag $4863:
#   0 Entities/Objects with 0 values in 0 attrs
#   0 non empty Lists with 0 items
#   1 non empty Dicts with 2 key/value entries
#   1 schemas with 2 values
#
# Top attrs:


b.get_bag().contents_repr()
# DataBag $4863:
# $0UGItpaGKCXaPsOxEscFOA['a'] => 1
# $0UGItpaGKCXaPsOxEscFOA['b'] => 2
#
# SchemaBag:
# #7QRy2BAblHHNytHxFmpaGL.get_key_schema() => STRING
# #7QRy2BAblHHNytHxFmpaGL.get_value_schema() => INT32

Optional: Understand how lists are represented as attributes.

c = kd.list([1, 2, 3])

c.get_bag()
# DataBag $10cc:
#  0 Entities/Objects with 0 values in 0 attrs
#  1 non empty Lists with 3 items
#  0 non empty Dicts with 0 key/value entries
#  1 schemas with 1 values
#
# Top attrs:


c.get_bag().contents_repr()
# DataBag $10cc:
# $0FAMhn8RmKvHXJvcKuKJq3[:] => [1, 2, 3]
#
# SchemaBag:
# #7QUhePdHCvsoCyAWAHwtzx.get_item_schema() => INT32

Representing Updates as Bags

Instead of creating a modified object/dict/list directly, we typically create a bag that contains the data updates. Updates are applied in O(1) time by the fallback mechanism described above. Updates overwrite existing attributes or add new ones.

a = kd.obj(x=2, y=kd.obj(z=3))
# update existing attribute and add a new attribute
upd = kd.attrs(a, x=4, u=5)
# To see its contents, you can do
upd.contents_repr()
a1 = a.updated(upd) # Obj(u=5, x=4, y=Obj(z=3))

b = kd.dict({'a': 1, 'b': 2})
# update with a new key/value pair
upd = kd.dict_update(b, 'a', 2)
b1 = b.updated(upd) # Dict{'a'=2, 'b'=2}
# update with another dict
upd = kd.dict_update(b, kd.dict({'a': 3, 'c': 4}))
b2 = b.updated(upd) # Dict{'c'=4, 'a'=3, 'b'=2}

# Schemas are stored and can be updated in the same way
a = kd.new(x=1, schema='MySchema')
a.updated(kd.attrs(a.get_schema(), y=kd.INT32))  # update the schema

Optional: Understand how entity/object updates are represented as attributes.

a = kd.obj(x=2, y=kd.obj(z=3))
upd = kd.attrs(a, x=4, u=5)

# Only modification is stored as attributes in the update bag
upd
# DataBag $7de8:
#   1 Entities/Objects with 2 values in 2 attrs
#   0 non empty Lists with 0 items
#   0 non empty Dicts with 0 key/value entries
#   1 schemas with 2 values
#
# Top attrs:
#   x: 1 values
#   u: 1 values


upd.contents_repr()
# DataBag $7de8:
# $004QVkgdETIyelSHQw2OHz.get_obj_schema() => #6wGECIg4UYvAsDXKvqtJQC
# $004QVkgdETIyelSHQw2OHz.u => 5
# $004QVkgdETIyelSHQw2OHz.x => 4
#
# SchemaBag:
# #6wGECIg4UYvAsDXKvqtJQC.u => INT32
# #6wGECIg4UYvAsDXKvqtJQC.x => INT32

Optional: Understand how dict update is represented as attributes.

b = kd.dict({'a': 1, 'b': 2})
upd = kd.dict_update(b, kd.dict({'a': 3, 'c': 4}))

upd
# DataBag $7806:
#   0 Entities/Objects with 0 values in 0 attrs
#   0 non empty Lists with 0 items
#   1 non empty Dicts with 2 key/value entries
#  1 schemas with 2 values
#
# Top attrs:


upd.contents_repr()
# DataBag $7806:
# $0UGItpaGKCXaPsOxEscFOG['a'] => 3
# $0UGItpaGKCXaPsOxEscFOG['c'] => 4
#
# SchemaBag:
# #7QRy2BAblHHNytHxFmpaGL.get_key_schema() => STRING
# #7QRy2BAblHHNytHxFmpaGL.get_value_schema() => INT32

Here is a more complex example that puts everything together.

a = kd.obj(x=2, y=kd.obj(z=3), z=kd.dict({'a': 1, 'b': 2}), t=kd.list([1,2,3]))
upd = kd.attrs(a, x=4, u=5)  # create data update
a1 = a.updated(upd)
a1 = a.with_attrs(x=4, u=5)  # a shortcut for above
a2 = a.updated(kd.attrs(a.y, z=5))  # update a deep attribute a.y.z
a3 = a.updated(kd.dict_update(a.z, 'b', 3))  # update dict
a4 = a.updated(kd.dict_update(a.z, kd.dict({'b': 3, 'c': 4})))  # multi-update dict
a5 = a.with_attrs(t=kd.concat_lists(a.t, kd.list([4, 5])))  # lists need to be updated as whole

Instead of being applied immediately, updates can be accumulated and applied later.

# Updates can be composed using << and >>, which defines what overwrites what
kd.attrs(a, x=3, y=4) << kd.attrs(a, x=5, u=6)  # == kd.attrs(a, x=5, y=4, u=6)
kd.attrs(a, x=3, y=4) >> kd.attrs(a, x=5, u=6)  # == kd.attrs(a, x=3, y=4, u=6)

# Updates can be accumulated and applied later
a = kd.obj(x=2, y=kd.obj(z=3))
upd = kd.bag()  # empty update
upd <<= kd.attrs(a, x=a.x + 1)
upd <<= kd.attrs(a, x=a.updated(upd).x + 2)  # can use a.x with an update
upd <<= kd.attrs(a, u=a.y.z + a.updated(upd).x)
a6 = a.updated(upd)  # Obj(x=5, y=Obj(z=3), u=8)

All APIs and concepts mentioned above support vectorization using DataSlice.

a = kd.new(x=kd.slice([1, 2, 3]), y=kd.slice([4, 5, 6]))  # [Entity(x=1, y=4), Entity(x=2, y=5), Entity(x=3, y=6)]
a.with_attrs(z=kd.slice([7, 8, 9]))  # [Entity(x=1, y=4, z=7), Entity(x=2, y=5, z=8), Entity(x=3, y=6, z=9)]
a.updated(kd.attrs(a, x=kd.slice([10, 11, 12])))  # [Entity(x=10, y=4), Entity(x=11, y=5), Entity(x=12, y=6)]
# Updates can target a subset of entities by utilizing sparsity
a.updated(kd.attrs(a & (a.y >=5), z=kd.slice([7, 8, 9]))).z  # [None, 8, 9]

Enrichments

Data can be enriched with data from another source. Enrichment only adds missing attributes without overwriting existing ones. This operation is also O(1).

TIP: The key difference between update and enrichment is that update overrides existing attributes while enrichment does not. Enrichment augments the attributes using the fallback mechanism described above.

a = kd.obj(x=2, y=kd.obj(z=3))
a_attrs = kd.attrs(a, x=1, u=5)

a.updated(a_attrs) # Obj(u=5, x=1, y=Obj(z=3))
a.enriched(a_attrs) # Obj(u=5, x=2, y=Obj(z=3))

Extraction

Updates and enrichments merge multiple bags together. After enrichments and updates, merged bags may contain attributes irrelevant to a specific object or entity (e.g. enrichment with unrelated data). The ds.extract_bag() method allows extracting a bag containing only relevant attributes (those recursively accessible from ds). ds.extract() is equivalent to ds.with_bag(ds.extract_bag()).

a = kd.obj(x=2, y=kd.obj(z=3), z=kd.dict({'a': 1, 'b': 2}), t=kd.list([1, 2, 3]))
a.get_bag().get_approx_size()  # 20

ay1 = a.y.with_attrs(z=4, zz=5)  # Modify ay1 on its own
ay1.get_bag().get_approx_size()  # 25, as it contains all attributes from a

extracted_bag = ay1.extract_bag()
extracted_bag.get_approx_size()  # 5, only it only contain ay1 attributes

a.updated(extracted_bag)  # Apply the attributes from ay1 to a
a.enriched(extracted_bag)  # Instead, this augments the data without overwriting

# If data is enriched with unrelated data, we can use extract to remove it

b = kd.obj(x=1, y=2)
a9 = a.enriched(kd.attrs(b, z=3))  # adding an unrelated attribute
a9.extract()  # == a9 with inaccessible attributes removed

Cloning

Cloning is another way to work in immutable way, but it allocates new ItemIds. Thus, it is more expensive and data cannot be joined later.

a = kd.obj(x=2, y=kd.obj(z=3))
a.clone(x=3).get_itemid() != a.get_itemid()  # yes
a.with_attrs(x=3).get_itemid() == a.get_itemid()  # yes
a.updated(a.clone(x=3).get_bag())  # doesn't overwrite x
a.updated(a.with_attrs(x=3).get_bag())  # overwrites x

Functors and Lazy Evaluation

Koda supports both eager computation and lazy evaluation similar to TF, JAX and PyTorch.

Lazy evaluation has the following benefits:

Koda uses tracing and converts Python functions into Koda functors representing computational graphs. Koda functors are special Koda objects that can be used for evaluation or could be stored together with data.

As is the case in JAX, in order for tracing to work correctly, the Python function can only contain Koda operators. Python control flow (i.e. if, for and while) is executed only during tracing - the resulting functors will depend on the Python control flow but will not include operators that mimic the Python control flow. To trace a Python function, we wrap it with kd.fn(py_fn).

# Convert python functions into functors
fn = kd.fn(lambda x, y: x+y, y=1)
fn(kd.slice([1, 2, 3]))  # [2, 3, 4]
fn(kd.slice([1, 2, 3]), y=10)  # [11, 12, 13]

# Functors can be used as normal Koda objects (assigned and stored)
fns = kd.obj(fn1=fn, fn2=fn.bind(y=5))
fns.fn2(kd.slice([1, 2, 3]))  # [6, 7, 8]

# Functors can be serialized
kd.dumps(fn)

kd.py_fn can be used in interactive workflows to wrap python functions without tracing, which can make debugging in certain situations easier.

# kd.fn uses tracing, and kd.py_fn wraps a Python functions "as-is", which is
# useful because not everything can be traced
fn = kd.py_fn(lambda x, y: x if kd.sum(x) > kd.sum(y) else y)
fn(x=kd.slice([1, 2, 3]), y=kd.slice([4, 5]))  # [4, 5]

You can annotate functions that call other functors with @kd.trace_as_fn. When such an annotated function is traced to produce a functor, then the inner functors can be accessed as attributes.

# functor_factory=kd.py_fn because the Python `while` cannot be traced properly.
@kd.trace_as_fn(functor_factory=kd.py_fn)
def my_op(x, y):
  while (x > 0): y += x; x -= 1
  return y

@kd.trace_as_fn()
def final(x, y, z): return my_op(my_op(x, y), z)

fn = kd.fn(final)
fn(2, 3, 4)  # 9 = (2 + 3) + 4
fn.final(2, 3, 4)  # the same as above
fn.final.my_op(2, 3)  # 5 - access of the deeper functor
# "replace" the inner functor my_op
fn.updated(kd.attrs(fn.final, my_op=kd.fn(lambda x, y: x * y)))(2, 3, 4)  # 24 = (2 * 3) * 4

Convenience Features

Koda provides a comprehensive list of convenience features including:

String manipulations

x, y = kd.slice([1, 2, 3]), kd.slice(["a", "b", "c"])
kd.fstr(f"i{x:i}-{y:s}")  # ['i1-a', 'i2-b', 'i3-c']
kd.strings.format("i{x}-{y}", x=x, y=y)  # the same as above
kd.strings.split(kd.slice(["a b", "c d e"]))  # [['a', 'b'], ['c', 'd', 'e']]
ds = kd.slice([['aa', 'bbb'], ['ccc', 'dd']])
kd.strings.agg_join(ds, '-')  # ['aa-bbb', 'ccc-dd']
kd.strings.agg_join(ds, '-', ndim=2)  # ['aa-bbb-ccc-dd']
kd.strings.length(ds)  # [[2, 3], [3, 2]]

Math operators

# Math
x = kd.slice([[3., -1., 2.], [0.5, -0.7]])
y = kd.slice([[1., 2., 0.5], [0.9, 0.3]])
x * y
kd.math.agg_mean(x)
kd.math.log10(x)
kd.math.pow(x,y)

Random Numbers & Sampling

x = kd.slice([[1., 2., 3.], [4., 5.]])
# Generate random integers
kd.randint_like(x, seed=123)  # fix seed
kd.sample(x, ratio=0.7, seed=42)
kd.sample_n(x, n=2, seed=342)

Ranking

x = kd.slice([[5., 4., 6., 4., 5.], [8., None, 2.]])
kd.ordinal_rank(x)  # [[2, 0, 4, 1, 3], [1, None, 0]]
kd.ordinal_rank(x, descending=True)  # [[1, 3, 0, 4, 2], [0, None, 1]]
kd.dense_rank(x)  # [[1, 0, 2, 0, 1], [1, None, 0]]

Containers to manage data

# Editable containers
x = kd.container()
x.a = 1
x.d = kd.container()
x.d.e = 4
x.d.f = kd.list([1, 2, 3])

Serialization

# DataSlice
serialized_bytes = kd.dumps(ds)
ds = kd.loads(serialized_bytes)

# Bag
serialized_bytes = kd.dumps(db)
db = kd.loads(serialized_bytes)

Multi-threading

def call_slow_fn(prompt):
  return slow_fn(prompt)
kd.map_py(call_slow_fn, kd.slice(['hello', None, 'world']), max_threads=16)

Mutable Workflows

Immutable workflows are recommended for most cases, but mutable data structures can be useful in some situations. In particular, when a Koda data structure is updated frequently, then a mutable version might be more efficient. An example of this is a Koda data structure that implements a cache.

The fork_bag method returns a new mutable version of the underlying data bag at the cost of O(1) (it uses copy-on-write and does not modify the original bag), while freeze_bag returns an immutable version of the underlying data bag (also for O(1)). The extract and clone methods extract and return immutable pieces of a bag.

Please keep in mind that mutable workflows are not supported in tracing. They consequently have more limited options for productionization.

# Modify the same dict many times
d = kd.dict()  # immutable
d = d.fork_bag()  # mutable
for i in range(100):
  # Insert random x=>y mappings (10 at a time)
  k = kd.random.randint_shaped(kd.shapes.new(10))
  v = kd.random.randint_shaped(kd.shapes.new(10))
  d[k] = v
d = d.freeze_bag()  # immutable

a = kd.obj(x=1, y=2).fork_bag()
a.y = 3
a.z = 4
a = a.freeze_bag()

Interoperability

Koda can easily interoperate with normal Python, Pandas, Numpy and Proto.

From/to standard Python data structures

kd.obj(x=1, y=2).x.to_py()
kd.slice([[1, 2], [3, 4]]).to_py()
kd.obj(x=1, y=kd.obj(z=kd.obj(a=1))).to_py(max_depth=-1)
kd.list([1, 2, 3]).to_py()
kd.from_py([{'a': [1, 2, 3], 'b': [4, 5, 6]}, {'a': 3, 'b': 4}])
kd.to_py(kd.dict({'a': [1, 2, 3], 'b': [4, 5, 6]}))  # {'a': [1, 2, 3], 'b': [4, 5, 6]}

From/to Pandas DataFrames

import pandas as pd
from koladata.ext import pdkd

pdkd.from_dataframe(pd.DataFrame(dict(x=[1, 2, 3], y=[10, 20, 30])))
pdkd.to_dataframe(kd.obj(x=kd.slice([1, 2, 3]), y=kd.slice([4, 5, 6])), cols=['x', 'y'])

From/to Numpy Arrays

import numpy as np
from koladata.ext import npkd

npkd.to_array(kd.slice([1, 2, None, 3]))
npkd.from_array(np.array([1, 2, 0, 3]))

From/to Proto

message MessageA {
  optional string text = 1;
  optional MessageB b = 2;
  repeated MessageB b_list = 3;
}

message MessageB {
  optional int32 int = 1;
}
p1 = MessageA(text='txt1', b_list=[MessageB(int=1), MessageB(int=2)])
p2 = MessageA(ext='txt2', b=MessageB(int=3))

ds1 = kd.from_proto(p1) # Entity(text='txt1', b_list=List[Entity(int=1), Entity(int=2)])
ds2 = kd.from_proto([p1, None, p2])
# [
#   Entity(text='txt1', b_list=List[Entity(int=1), Entity(int=2)]),
#   missing,
#   Entity(text='txt2', b=Entity(int=3)),
# ]

p = kd.to_proto(ds1, MessageA)
p_list = kd.to_proto(ds2, MessageA)