Home
Overview
Fundamentals
Glossary
Cheatsheet
API Reference
Quick Recipes
Deep Dive
Common Pitfalls and Gotchas
Persistent Storage
This is a 10-minute introduction into Koda intended for absolute beginners. Please see the Getting Started for more in-depth guides.
DataSlice is the most basic data structure
in Koda. A DataSlice is a multi-dimensional jagged array holding data of any
type (e.g. integers, strings, structured data), including mixed data. This
section showcases this fundamental data structure through a motivating example.
Imagine a school with two classes, each having a different number of students. The scorebook, containing the scores of individual students, can be represented using a nested Python list of integers:
```py {.pycon-doctest}
from koladata import kd scores_of_students_in_a_school = [ … [10, 20, 30], # class A: three students. … [40, 50, None, 69] # class B: four students, the score for one is unknown. … ] ```
We can use this scorebook directly in Koda:
```py {.pycon-doctest}
scores_slice = kd.slice( … scores_of_students_in_a_school) scores_slice DataSlice([[10, 20, 30], [40, 50, None, 69]],…)
scores_slice.get_shape() JaggedShape(2, [3, 4])
scores_slice.get_schema() DataItem(INT32, schema: SCHEMA) ```
The scores_slice variable holds a Koda DataSlice. The DataSlice keeps
track of the scores, and also of the shape
and schema (type) of the data. The associated
JaggedShape(2, [3, 4]) shape indicates that there are two classes, with the
first class having three students and the second having four. The associated
INT32 schema indicates that the score values are 32-bit integers. Missing data
is natively supported and is represented as None.
The JaggedShape allows each row to have a different number of associated
columns. This versatility distinguishes it from a traditional multi-dimensional
matrix, which is restricted to uniform dimensions and a regular shape, and is
better viewed as a hierarchical partitioning of the data. The dimensionality
of the shape corresponds to the depth of the nested input (which is required to
be the same for all elements) and is the same as the number of arguments in the
JaggedShape(2, [3, 4]) representation: two.
Most of Koda’s operations are pointwise vectorized, meaning that they operate on each element in isolation and preserve the shape:
```py {.pycon-doctest}
scores_slice + 10 DataSlice([[20, 30, 40], [50, 60, None, 79]], schema: INT32,…) ```
Other operations make use of the hierarchical information stored in the jagged shape. For example, we can compute summary statistics for the scores:
```py {.pycon-doctest}
class_avg = kd.math.agg_mean(scores_slice) class_avg DataSlice([20.0, 53.0], schema: FLOAT32…)
class_avg_max = kd.math.agg_max(class_avg) class_avg_max DataItem(53.0, schema: FLOAT32)
ndim=2 dimensions.school_avg = kd.math.agg_mean(scores_slice, ndim=2) school_avg DataItem(36.5, schema: FLOAT32) ```
We can naturally extend the representation to cover a collection of schools:
```py {.pycon-doctest}
scores_from_school_1 = [[1, 2, 3], [4, 5, None, 6]] scores_from_school_2 = [[80, 90], [100], [110, 120]] kd.slice(scores_from_school_1).get_shape() JaggedShape(2, [3, 4])
kd.slice(scores_from_school_2).get_shape() JaggedShape(3, [2, 1, 2])
many_schools_slice = kd.slice([scores_from_school_1, scores_from_school_2]) many_schools_slice.get_shape() JaggedShape(2, [2, 3], [3, 4, 2, 1, 2]) ```
As we can see, the slice for the collection of schools has an additional outer dimension, but the inner two dimensions have the same meaning as before (they represent classes and students, respectively), and hence the exact same code used to compute summary statistics for a single school can be reused for the collection of schools:
```py {.pycon-doctest}
class_avg = kd.math.agg_mean( … many_schools_slice … ) class_avg DataSlice([[2.0, 5.0], [85.0, 100.0, 115.0]], schema: FLOAT32,…)
class_avg_max = kd.math.agg_max(class_avg) class_avg_max DataSlice([5.0, 115.0], schema: FLOAT32,…) ```
Consider the case where we again wish to compute statistics for a school. We are interested in the ratio of students per class that achieved a score higher than some minimum value. To account for various differences between classes, the threshold differs per class. In Koda, this is natural to express due to its broadcasting logic:
```py {.pycon-doctest}
scores_slice.get_shape() JaggedShape(2, [3, 4])
thresholds = kd.slice([30, 50]) thresholds.get_shape() JaggedShape(2)
thresholds is broadcasted to the shape of scores_slice, which is possiblethresholds is a “prefix” of the shape of scores_slice.thresholds are equivalent to:acceptable_scores = scores_slice >= thresholds acceptable_scores DataSlice([[missing, missing, present], [missing, present, missing, present]], schema: MASK…)
number_of_acceptable_scores = kd.agg_count(acceptable_scores) number_of_acceptable_scores DataSlice([1, 2], schema: INT64,…)
scores_slice.number_of_total_scores = kd.agg_count(scores_slice) number_of_total_scores DataSlice([3, 3], schema: INT64,…)
number_of_acceptable_scores / number_of_total_scores DataSlice([0.33…, 0.66…], schema: FLOAT32,…) ```
As seen above, Koda broadcasts data from the outer dimensions to the inner (i.e. from “left-to-right”) which requires that one shape is the prefix of the other. This plays well with Koda’s hierarchical data model, and facilitates broadcasting of jagged data which otherwise becomes tricky with alternative broadcasting rules.
DataSlices are able to represent structured data, beyond the tabular form of
pandas or the multidimensional array form of Numpy, while preserving the
parallelism in computation that ensures good performance. Elements are not
limited to primitives such as INT32 or STRING. Just like Python classes, one
can define tailored schemas for objects representing
structured data.
Continuing the previous example, a scorebook for a school’s students typically contains more data than just the bare scores. A student has a name and a score:
```py {.pycon-doctest}
Student = kd.named_schema(‘Student’, student_name=kd.STRING, score=kd.INT32) ```
Next we define the schema for a class and a school. A class has a class_name,
and a list of Students, represented through a Koda List.
```py {.pycon-doctest}
Class = kd.named_schema( … ‘Class’, class_name=kd.STRING, students=kd.list_schema(Student) … ) School = kd.named_schema( … ‘School’, school_name=kd.STRING, classes=kd.list_schema(Class) … ) ```
We can represent the scorebook of a school as follows (the syntax here is intentionally verbose for educational purposes and can be done more concisely):
```py {.pycon-doctest}
s1 = Student.new(student_name=”Alice”, score=10) s2 = Student.new(student_name=”Bob”, score=20) s3 = Student.new(student_name=”Carol”, score=30) s4 = Student.new(student_name=”Dan”, score=40) s5 = Student.new(student_name=”Erin”, score=50) s6 = Student.new(student_name=”Frank”, score=None) s7 = Student.new(student_name=”Grace”, score=70)
class_a = Class.new(class_name=”A”, students=kd.list([s1, s2, s3])) class_b = Class.new(class_name=”B”, students=kd.list([s4, s5, s6, s7]))
school_s1 = School.new(school_name=”S1”, classes=kd.list([class_a, class_b])) ```
school_s1 is a 0-dimensional DataSlice, i.e. a scalar. In Koda, this is
known as a DataItem. This approach naturally associates attributes in the same
DataItem with each other, simplifying subsequent inspection and manipulation
of the data.
The name of school_s1 is accessed as school_s1.school_name, which returns a
DataItem with schema STRING. As Koda operations are vectorized, we can
access an attribute of multiple items in a single call, where the output has the
same shape as the input:
```py {.pycon-doctest}
school_s1.school_name DataItem(‘S1’, schema: STRING, bag_id:…)
kd.slice([s1, s2]).student_name DataSlice([‘Alice’, ‘Bob’], schema: STRING,…) ```
The classes of the school are accessed as school_s1.classes. Since attribute
access preserves the shape of the input, and school_s1 is a scalar, this
returns a scalar List DataItem. The List is a container holding several
Class items and does not itself have a classes attribute. We can explode
the List with the operator [:] to get a DataSlice with the Class items
from the List. Attributes can then be accessed as usual with the dot operator,
for instance school_s1.classes[:].class_name, which returns a DataSlice of
attributes with the same shape:
```py {.pycon-doctest}
school_s1.classes.get_schema() # list with classes: DataItem(LIST[Class(class_name=STRING, students=LIST[Student(score=INT32, student_name=STRING)])], schema: SCHEMA, bag_id:…)
school_s1.classes[:] # one-dimensional DataSlice: DataSlice([ Entity( class_name=’A’, students=List[ Entity(score=10, student_name=’Alice’), Entity(score=20, student_name=’Bob’), Entity(score=30, student_name=’Carol’), ], ), Entity( class_name=’B’, students=List[ Entity(score=40, student_name=’Dan’), Entity(score=50, student_name=’Erin’), Entity(student_name=’Frank’), Entity(score=70, student_name=’Grace’), ], ), ],…)
school_s1.classes[:].class_name DataSlice([‘A’, ‘B’], schema: STRING,…)
school_s1.classes[:].students[:].score DataSlice([[10, 20, 30], [40, 50, None, 70]], schema: INT32,…) ```
Introducing Lists may initially seem redundant. Why are the classes in
school_s1 packed as a List, rather than a DataSlice, especially since a
DataSlice would conveniently allow direct access to class attributes without
an extra unpacking (“explosion”) step?
The primary purpose of List is organizational - it turns many individual items
(e.g. classes) into a single manageable item (a List of classes) that is
atomic for the purposes of vectorized computations. This allows Lists to be
counted, broadcasted, and in general to behave like other kinds of primitive
data (e.g. a scalar school item can have an attribute that stores multiple
classes in exactly the same way as it can have an attribute that stores a scalar
school_name). Additionally, it allows a single school, which is a scalar
DataItem, to have multiple classes thus modelling a one-to-many relationship.
The consequences of this are more easily shown through an example:
```py {.pycon-doctest}
students is a 1-dimensional DataSlice:students = school_s1.classes[:].students students.get_shape() JaggedShape(2)
kd.count(school_s1.classes[:].students) DataItem(2, schema: INT64)
kd.count(school_s1.classes[:].students[:]) DataItem(7, schema: INT64)
kd.agg_count we can compute aggregate statistics over the lastkd.agg_count(school_s1.classes[:].students[:]) DataSlice([3, 4], schema: INT64,…) ```
Koda Lists and DataSlices are important concepts in Koda. Users will
frequently convert between these two structures: a List explodes into a
DataSlice, and a DataSlice can conversely implode its innermost level of
items into Lists. They are perfectly isomorphic and the conversion preserves
all information. As such, a List is a “packed” version of a DataSlice, and a
DataSlice is an “unpacked” version of a List. The following example
illustrates this relationship:
```py {.pycon-doctest}
school_s1.classes.get_shape() JaggedShape()
school_s1.classes[:].get_shape() JaggedShape(2)
school_s1.classes[:].students.get_shape() JaggedShape(2)
school_s1.classes[:].students[:].get_shape() JaggedShape(2, [3, 4])
school_s1.classes[:].studentsschool_s1.classes[:].students[:].implode().get_shape() JaggedShape(2)
school_s1.classes[:].students[:].implode().implode().get_shape() JaggedShape() ```
Note that Koda supports Dicts in addition to Lists and the structured items
such as school_s1 (known as Entities).
The logic expressed using Koda operators can be packaged into servable Functors, which is Koda’s way to natively represent computations. Say, for example, that the school district regularly wants to get an overview of all the schools in the district. At the end of the year, they wish to find the student with the highest score in each class to award them with a prize.
In Koda, we can represent this through:
```py {.pycon-doctest}
def get_student_with_highest_score(schools): … students = schools.classes[:].students[:] … #
.S[...]is syntactic sugar for indexing into a DataSlice. … return students.S[kd.argmax(students.score)].student_name ```
Because Koda is vectorized, this can be evaluated with one school, a flat
DataSlice of schools, or even with a multidimensional DataSlice of schools:
```py {.pycon-doctest}
get_student_with_highest_score(school_s1) DataSlice([‘Carol’, ‘Grace’],…)
get_student_with_highest_score( … kd.slice([school_s1, school_s1])) DataSlice([[‘Carol’, ‘Grace’], [‘Carol’, ‘Grace’]]…) ```
This function can be converted into a Functor through tracing, and saved to disk. This Functor can then be loaded again and evaluated with data:
```py {.pycon-doctest}
functor = kd.fn(get_student_with_highest_score) functor(school_s1) # Can be evaluated directly DataSlice([‘Carol’, ‘Grace’],…)
serialized_functor = kd.dumps(functor) # serializes the functor to bytes. deserialized_functor = kd.loads(serialized_functor) # loads the functor. deserialized_functor(school_s1) DataSlice([‘Carol’, ‘Grace’],…) ```
Internally, the Functor specifies the Koda operations to apply. The operations are implemented in C++ and made available in Python, so the Functor can be loaded and executed in both C++ and in Python. This way, Koda supports workflows where an interactive Python environment can be used for experimenting with data modeling and data manipulation, and thereafter exactly the same computation can be executed in a production environment with C++.
The concepts shown above offer just a brief introduction to the world of Koda. Koda has many more features and operators for data modeling and manipulation. Moreover, it supports converting between other numerical libraries such as Pandas, has native support for conversion to and from Protos, and can be traced into a C++ compatible computational graph representation allowing code written in Python to be served in a production environment.
Next-up: See the Overview guide for a more comprehensive and detailed introduction.