Home
Overview
Fundamentals
Glossary
Cheatsheet
API Reference
Quick Recipes
Deep Dive
Common Pitfalls and Gotchas
Persistent Storage
Koda is a Python + C++ library that provides a data representation and domain-specific language (a.k.a DSL) for vectorized data transformations and computations on the transformed data with high performance. The data can be in the form of protos, tables, interconnected objects, graphs, tensors and more.
Such transformations and computations can be designed in an interactive REPL (e.g. Colab), as well compiled into computation graphs and served in production (i.e. C++ environments).
Here are code examples demonstrating some of Koda capabilities.
Koda supports irregular multi-dimensional arrays (jagged arrays) without padding.
>>> from koladata import kd
>>> scores = kd.slice([[85, 90], [70], [95, 100, 88]])
>>> kd.agg_max(scores)
DataSlice([90, 70, 100], schema: INT32, ...)
Koda provides C++ grade performance for vectorized operations on nested data structures and jagged arrays.
>>> users = kd.slice([
... kd.obj(name='Alice', age=20, interests=kd.list(['tennis', 'alcohol', 'chess'])),
... kd.obj(name='Bob', age=30, interests=kd.list(['music', 'chess'])),
... kd.obj(name='Charlie', age=25, interests=kd.list(['tennis', 'alcohol'])),
... ])
# Simple filtering: find users older than 21. The lambda is traced and
# executed as a vectorized Koda functor (not pointwise Python code).
>>> users.select(lambda x: x.age > 21).name
DataSlice(['Bob', 'Charlie'], schema: STRING, ...)
You can perform more advanced operations on jagged arrays, such as conditional filtering across dimensions and grouping:
# Remove 'alcohol' interest for users under 21.
>>> interests = users.interests[:]
>>> interests
DataSlice([['tennis', 'alcohol', 'chess'], ['music', 'chess'], ['tennis', 'alcohol']], schema: STRING, ...)
>>> is_underage_alcohol = (interests == 'alcohol') & (users.age < 21)
>>> filtered_interests = interests.select(~is_underage_alcohol)
>>> filtered_interests
DataSlice([['tennis', 'chess'], ['music', 'chess'], ['tennis', 'alcohol']], schema: STRING, ...)
# Group users by their filtered interests.
>>> pairs = kd.new(user=users, interest=filtered_interests).flatten()
>>> grouped = kd.group_by(pairs, pairs.interest)
>>> interest_details = kd.new(interest=kd.collapse(grouped.interest),
... names=grouped.user.name.implode())
>>> interest_details
DataSlice([
Entity(interest='tennis', names=List['Alice', 'Charlie']),
Entity(interest='chess', names=List['Alice', 'Bob']),
Entity(interest='music', names=List['Bob']),
Entity(interest='alcohol', names=List['Charlie']),
], schema: ENTITY(interest=STRING, names=LIST[STRING]), ...)
Koda structures are immutable, allowing modified versions with O(1) cost by sharing underlying data.
>>> users = kd.new(
... name=kd.slice(['Alice', 'Bob', 'Charlie']),
... age=kd.slice([30, 25, 35]),
... )
# Create a new version with updated ages for all users.
>>> aged_users = users.with_attrs(age=users.age + 1)
>>> users.age
DataSlice([30, 25, 35], schema: INT32, ...)
>>> aged_users.age
DataSlice([31, 26, 36], schema: INT32, ...)
# Alternative: `updates` can hold one or multiple updates across different
# levels and be passed around before applying with <<.
>>> updates = kd.attrs(users, age=users.age + 1)
>>> aged_users = users << updates
>>> aged_users.age
DataSlice([31, 26, 36], schema: INT32, ...)
Koda allows tracing Python functions to create reusable computational graphs (Functors). Functors are normal DataSlices and can be stored in object attributes, serialized, and served in production.
# Define a Python function for Z-score normalization.
>>> def z_score(x):
... return (x - kd.math.agg_mean(x)) / kd.math.agg_std(x)
# Trace it into a Koda functor.
>>> normalize_fn = kd.fn(z_score)
# Store the functor in a Koda object attribute.
>>> model = kd.obj(preprocess=normalize_fn)
# Evaluate the functor stored in the object.
>>> scores = kd.slice([10.0, 20.0, 30.0])
>>> model.preprocess(scores)
DataSlice([-1.0, 0.0, 1.0], schema: FLOAT32, ...)