Home
Overview
Fundamentals
Glossary
Cheatsheet
API Reference
Quick Recipes
Deep Dive
Common Pitfalls and Gotchas
Persistent Storage
This document lists public Koda APIs, including operators (accessible from
kd
and kde
packages) and methods of main abstractions (e.g.
DataSlice, DataBag, etc.).
Category | Subcategory | Description |
---|---|---|
kd | kd and kde operators |
|
allocation | Operators that allocate new ItemIds. | |
annotation | Annotation operators. | |
assertion | Operators that assert properties of DataSlices. | |
bitwise | Bitwise operators | |
bags | Operators that work on DataBags. | |
comparison | Operators that compare DataSlices. | |
core | Core operators that are not part of other categories. | |
curves | Operators working with curves. | |
dicts | Operators working with dictionaries. | |
entities | Operators that work solely with entities. | |
expr | Expr utilities. | |
extension_types | Extension type functionality. | |
functor | Operators to create and call functors. | |
ids | Operators that work with ItemIds. | |
iterables | Operators that work with iterables. These APIs are in active development and might change often. | |
json | JSON serialization operators. | |
lists | Operators working with lists. | |
masking | Masking operators. | |
math | Arithmetic operators. | |
objs | Operators that work solely with objects. | |
optools | Operator definition and registration tooling. | |
proto | Protocol buffer serialization operators. | |
parallel | Operators for parallel computation. | |
py | Operators that call Python functions. | |
random | Random and sampling operators. | |
schema | Schema-related operators. | |
shapes | Operators that work on shapes | |
slices | Operators that perform DataSlice transformations. | |
strings | Operators that work with strings data. | |
streams | Operators that work with streams of items. These APIs are in active development and might change often (b/424742492). | |
tuples | Operators to create tuples. | |
kd_ext | kd_ext operators |
|
contrib | External contributions not necessarily endorsed by Koda. | |
nested_data | Utilities for manipulating nested data. | |
npkd | Tools for Numpy <-> Koda interoperability. | |
pdkd | Tools for Pandas <-> Koda interoperability. | |
vis | Koda visualization functionality. | |
DataSlice | DataSlice methods |
|
DataBag | DataBag methods |
|
DataItem | DataItem methods |
kd
and kde
operatorskd
and kde
modules are containers for eager and lazy operators respectively.
While most of operators below have both eager and lazy versions (e.g.
kd.agg_sum
vs kde.agg_sum
), some operators (e.g. kd.sub(expr, *subs)
) only
have eager version. Such operators often take Exprs or Functors as inputs and
does not make sense to have a lazy version.
Note that operators from extension modules (e.g. kd_ext.npkd
) are not
included.
Namespaces
Operators that allocate new ItemIds.
Operators
kd.allocation.new_dictid()
Aliases:
Allocates new Dict ItemId.
kd.allocation.new_dictid_like(shape_and_mask_from)
Aliases:
Allocates new Dict ItemIds with the shape and sparsity of shape_and_mask_from.
kd.allocation.new_dictid_shaped(shape)
Aliases:
Allocates new Dict ItemIds of the given shape.
kd.allocation.new_dictid_shaped_as(shape_from)
Aliases:
Allocates new Dict ItemIds with the shape of shape_from.
kd.allocation.new_itemid()
Aliases:
Allocates new ItemId.
kd.allocation.new_itemid_like(shape_and_mask_from)
Aliases:
Allocates new ItemIds with the shape and sparsity of shape_and_mask_from.
kd.allocation.new_itemid_shaped(shape)
Aliases:
Allocates new ItemIds of the given shape without any DataBag attached.
kd.allocation.new_itemid_shaped_as(shape_from)
Aliases:
Allocates new ItemIds with the shape of shape_from.
kd.allocation.new_listid()
Aliases:
Allocates new List ItemId.
kd.allocation.new_listid_like(shape_and_mask_from)
Aliases:
Allocates new List ItemIds with the shape and sparsity of shape_and_mask_from.
kd.allocation.new_listid_shaped(shape)
Aliases:
Allocates new List ItemIds of the given shape.
kd.allocation.new_listid_shaped_as(shape_from)
Aliases:
Allocates new List ItemIds with the shape of shape_from.
Annotation operators.
Operators
kd.annotation.source_location(expr, function_name, file_name, line, column, line_text)
Annotation for source location where the expr node was created.
The annotation is considered as "best effort" so any of the
arguments may be missing.
Args:
function_name: name of the function where the expr node was
created
file_name: name of the file where the expr node was created
line: line number where the expr node was created. 0 indicates
an unknown line number.
column: column number where the expr node was created. 0
indicates an unknown line number.
line_text: text of the line where the expr node was created
kd.annotation.with_name(obj, name)
Aliases:
Checks that the `name` is a string and returns `obj` unchanged.
This method is useful in tracing workflows: when tracing, we will assign
the given name to the subexpression computing `obj`. In eager mode, this
method is effectively a no-op.
Args:
obj: Any object.
name: The name to be used for this sub-expression when tracing this code.
Must be a string.
Returns:
obj unchanged.
Operators that assert properties of DataSlices.
Operators
kd.assertion.assert_present_scalar(arg_name, ds, primitive_schema)
Returns the present scalar `ds` if it's implicitly castable to `primitive_schema`.
It raises an exception if:
1) `ds`'s schema is not primitive_schema (including NONE) or OBJECT
2) `ds` is not a scalar
3) `ds` is not present
4) `ds` is not castable to `primitive_schema`
The following examples will pass:
assert_present_scalar('x', kd.present, kd.MASK)
assert_present_scalar('x', 1, kd.INT32)
assert_present_scalar('x', 1, kd.FLOAT64)
The following examples will fail:
assert_primitive('x', kd.missing, kd.MASK)
assert_primitive('x', kd.slice([kd.present]), kd.MASK)
assert_primitive('x', kd.present, kd.INT32)
Args:
arg_name: The name of `ds`.
ds: DataSlice to assert the dtype, presence and rank of.
primitive_schema: The expected primitive schema.
kd.assertion.assert_primitive(arg_name, ds, primitive_schema)
Returns `ds` if its data is implicitly castable to `primitive_schema`.
It raises an exception if:
1) `ds`'s schema is not primitive_schema (including NONE) or OBJECT
2) `ds` has present items and not all of them are castable to
`primitive_schema`
The following examples will pass:
assert_primitive('x', kd.present, kd.MASK)
assert_primitive('x', kd.slice([kd.present, kd.missing]), kd.MASK)
assert_primitive('x', kd.slice(None, schema=kd.OBJECT), kd.MASK)
assert_primitive('x', kd.slice([], schema=kd.OBJECT), kd.MASK)
assert_primitive('x', kd.slice([1, 3.14], schema=kd.OBJECT), kd.FLOAT32)
assert_primitive('x', kd.slice([1, 2]), kd.FLOAT32)
The following examples will fail:
assert_primitive('x', 1, kd.MASK)
assert_primitive('x', kd.slice([kd.present, 1]), kd.MASK)
assert_primitive('x', kd.slice(1, schema=kd.OBJECT), kd.MASK)
Args:
arg_name: The name of `ds`.
ds: DataSlice to assert the dtype of.
primitive_schema: The expected primitive schema.
kd.assertion.with_assertion(x, condition, message_or_fn, *args)
Returns `x` if `condition` is present, else raises error `message_or_fn`.
`message_or_fn` should either be a STRING message or a functor taking the
provided `*args` and creating an error message from it. If `message_or_fn` is
a STRING, the `*args` should be omitted. If `message_or_fn` is a functor, it
will only be invoked if `condition` is `missing`.
Example:
x = kd.slice(1)
y = kd.slice(2)
kd.assertion.with_assertion(x, x < y, 'x must be less than y') # -> x.
kd.assertion.with_assertion(
x, x > y, 'x must be greater than y'
) # -> error: 'x must be greater than y'.
kd.assertion.with_assertion(
x, x > y, lambda: 'x must be greater than y'
) # -> error: 'x must be greater than y'.
kd.assertion.with_assertion(
x,
x > y,
lambda x, y: kd.format('x={x} must be greater than y={y}', x=x, y=y),
x,
y,
) # -> error: 'x=1 must be greater than y=2'.
Args:
x: The value to return if `condition` is present.
condition: A unit scalar, unit optional, or DataItem holding a mask.
message_or_fn: The error message to raise if `condition` is not present, or
a functor producing such an error message.
*args: Auxiliary data to be passed to the `message_or_fn` functor.
Bitwise operators
Operators
kd.bitwise.bitwise_and(x, y)
Aliases:
Computes pointwise bitwise x & y.
kd.bitwise.bitwise_or(x, y)
Aliases:
Computes pointwise bitwise x | y.
kd.bitwise.bitwise_xor(x, y)
Aliases:
Computes pointwise bitwise x ^ y.
kd.bitwise.count(x)
Aliases:
Computes the number of bits set to 1 in the given input.
kd.bitwise.invert(x)
Aliases:
Computes pointwise bitwise ~x.
Operators that work on DataBags.
Operators
kd.bags.enriched(*bags)
Aliases:
Creates a new immutable DataBag enriched by `bags`.
It adds `bags` as fallbacks rather than merging the underlying data thus
the cost is O(1).
Databags earlier in the list have higher priority.
`enriched_bag(bag1, bag2, bag3)` is equivalent to
`enriched_bag(enriched_bag(bag1, bag2), bag3)`, and so on for additional
DataBag args.
Args:
*bags: DataBag(s) for enriching.
Returns:
An immutable DataBag enriched by `bags`.
kd.bags.is_null_bag(bag)
Aliases:
Returns `present` if DataBag `bag` is a NullDataBag.
kd.bags.new()
Aliases:
Returns an empty immutable DataBag.
kd.bags.updated(*bags)
Aliases:
Creates a new immutable DataBag updated by `bags`.
It adds `bags` as fallbacks rather than merging the underlying data thus
the cost is O(1).
Databags later in the list have higher priority.
`updated_bag(bag1, bag2, bag3)` is equivalent to
`updated_bag(bag1, updated_bag(bag2, bag3))`, and so on for additional
DataBag args.
Args:
*bags: DataBag(s) for updating.
Returns:
An immutable DataBag updated by `bags`.
Operators that compare DataSlices.
Operators
kd.comparison.equal(x, y)
Aliases:
Returns present iff `x` and `y` are equal.
Pointwise operator which takes a DataSlice and returns a MASK indicating
iff `x` and `y` are equal. Returns `kd.present` for equal items and
`kd.missing` in other cases.
Args:
x: DataSlice.
y: DataSlice.
kd.comparison.full_equal(x, y)
Aliases:
Returns present iff all present items in `x` and `y` are equal.
The result is a zero-dimensional DataItem. Note that it is different from
`kd.all(x == y)`.
For example,
kd.full_equal(kd.slice([1, 2, 3]), kd.slice([1, 2, 3])) -> kd.present
kd.full_equal(kd.slice([1, 2, 3]), kd.slice([1, 2, None])) -> kd.missing
kd.full_equal(kd.slice([1, 2, None]), kd.slice([1, 2, None])) -> kd.present
Args:
x: DataSlice.
y: DataSlice.
kd.comparison.greater(x, y)
Aliases:
Returns present iff `x` is greater than `y`.
Pointwise operator which takes a DataSlice and returns a MASK indicating
iff `x` is greater than `y`. Returns `kd.present` when `x` is greater and
`kd.missing` when `x` is less than or equal to `y`.
Args:
x: DataSlice.
y: DataSlice.
kd.comparison.greater_equal(x, y)
Aliases:
Returns present iff `x` is greater than or equal to `y`.
Pointwise operator which takes a DataSlice and returns a MASK indicating
iff `x` is greater than or equal to `y`. Returns `kd.present` when `x` is
greater than or equal to `y` and `kd.missing` when `x` is less than `y`.
Args:
x: DataSlice.
y: DataSlice.
kd.comparison.less(x, y)
Aliases:
Returns present iff `x` is less than `y`.
Pointwise operator which takes a DataSlice and returns a MASK indicating
iff `x` is less than `y`. Returns `kd.present` when `x` is less and
`kd.missing` when `x` is greater than or equal to `y`.
Args:
x: DataSlice.
y: DataSlice.
kd.comparison.less_equal(x, y)
Aliases:
Returns present iff `x` is less than or equal to `y`.
Pointwise operator which takes a DataSlice and returns a MASK indicating
iff `x` is less than or equal to `y`. Returns `kd.present` when `x` is
less than or equal to `y` and `kd.missing` when `x` is greater than `y`.
Args:
x: DataSlice.
y: DataSlice.
kd.comparison.not_equal(x, y)
Aliases:
Returns present iff `x` and `y` are not equal.
Pointwise operator which takes a DataSlice and returns a MASK indicating
iff `x` and `y` are not equal. Returns `kd.present` for not equal items and
`kd.missing` in other cases.
Args:
x: DataSlice.
y: DataSlice.
Core operators that are not part of other categories.
Operators
kd.core.attr(x, attr_name, value, overwrite_schema=False)
Aliases:
Returns a new DataBag containing attribute `attr_name` update for `x`.
This operator is useful if attr_name cannot be used as a key in keyword
arguments. E.g.: "123-f", "5", "%#$", etc. It still has to be a valid utf-8
unicode.
See kd.attrs docstring for more details on the rules and regarding `overwrite`
argument.
Args:
x: Entity / Object for which the attribute update is being created.
attr_name: utf-8 unicode representing the attribute name.
value: new value for attribute `attr_name`.
overwrite_schema: if True, schema for attribute is always updated.
kd.core.attrs(x, /, *, overwrite_schema=False, **attrs)
Aliases:
Returns a new DataBag containing attribute updates for `x`.
Most common usage is to build an update using kd.attrs and than attach it as a
DataBag update to the DataSlice.
Example:
x = ...
attr_update = kd.attrs(x, foo=..., bar=...)
x = x.updated(attr_update)
In case some attribute "foo" already exists and the update contains "foo",
either:
1) the schema of "foo" in the update must be implicitly castable to
`x.foo.get_schema()`; or
2) `x` is an OBJECT, in which case schema for "foo" will be overwritten.
An exception to (2) is if it was an Entity that was casted to an OBJECT using
kd.obj, e.g. then update for "foo" also must be castable to
`x.foo.get_schema()`. If this is not the case, an Error is raised.
This behavior can be overwritten by passing `overwrite=True`, which will cause
the schema for attributes to always be updated.
Args:
x: Entity / Object for which the attributes update is being created.
overwrite_schema: if True, schema for attributes is always updated.
**attrs: attrs to set in the update.
kd.core.clone(x, /, *, itemid=unspecified, schema=unspecified, **overrides)
Aliases:
Creates a DataSlice with clones of provided entities in a new DataBag.
The entities themselves are cloned (with new ItemIds) and their attributes are
extracted (with the same ItemIds).
Also see kd.shallow_clone and kd.deep_clone.
Note that unlike kd.deep_clone, if there are multiple references to the same
entity, the returned DataSlice will have multiple clones of it rather than
references to the same clone.
Args:
x: The DataSlice to copy.
itemid: The ItemId to assign to cloned entities. If not specified, new
ItemIds will be allocated.
schema: The schema to resolve attributes, and also to assign the schema to
the resulting DataSlice. If not specified, will use the schema of `x`.
**overrides: attribute overrides.
Returns:
A copy of the entities where entities themselves are cloned (new ItemIds)
and all of the rest extracted.
kd.core.container(**attrs)
Aliases:
Creates new Objects with an implicit stored schema.
Returned DataSlice has OBJECT schema and mutable DataBag.
Args:
**attrs: attrs to set on the returned object.
Returns:
data_slice.DataSlice with the given attrs and kd.OBJECT schema.
kd.core.deep_clone(x, /, schema=unspecified, **overrides)
Aliases:
Creates a slice with a (deep) copy of the given slice.
The entities themselves and all their attributes including both top-level and
non-top-level attributes are cloned (with new ItemIds).
Also see kd.shallow_clone and kd.clone.
Note that unlike kd.clone, if there are multiple references to the same entity
in `x`, or multiple ways to reach one entity through attributes, there will be
exactly one clone made per entity.
Args:
x: The slice to copy.
schema: The schema to use to find attributes to clone, and also to assign
the schema to the resulting DataSlice. If not specified, will use the
schema of 'x'.
**overrides: attribute overrides.
Returns:
A (deep) copy of the given DataSlice.
All referenced entities will be copied with newly allocated ItemIds. Note
that UUIDs will be copied as ItemIds.
kd.core.enriched(ds, *bag)
Aliases:
Returns a copy of a DataSlice with a additional fallback DataBag(s).
Values in the original DataBag of `ds` take precedence over the ones in
`*bag`.
The DataBag attached to the result is a new immutable DataBag that falls back
to the DataBag of `ds` if present and then to `*bag`.
`enriched(x, a, b)` is equivalent to `enriched(enriched(x, a), b)`, and so on
for additional DataBag args.
Args:
ds: DataSlice.
*bag: additional fallback DataBag(s).
Returns:
DataSlice with additional fallbacks.
kd.core.extract(ds, schema=unspecified)
Aliases:
Creates a DataSlice with a new DataBag containing only reachable attrs.
Args:
ds: DataSlice to extract.
schema: schema of the extracted DataSlice.
Returns:
A DataSlice with a new immutable DataBag attached.
kd.core.extract_bag(ds, schema=unspecified)
Aliases:
Creates a new DataBag containing only reachable attrs from 'ds'.
Args:
ds: DataSlice to extract.
schema: schema of the extracted DataSlice.
Returns:
A new immutable DataBag with only the reachable attrs from 'ds'.
kd.core.follow(x)
Aliases:
Returns the original DataSlice from a NoFollow DataSlice.
When a DataSlice is wrapped into a NoFollow DataSlice, it's attributes
are not further traversed during extract, clone, deep_clone, etc.
`kd.follow` operator inverses the DataSlice back to a traversable DataSlice.
Inverse of `nofollow`.
Args:
x: DataSlice to unwrap, if nofollowed.
kd.core.freeze(x)
Aliases:
Returns a frozen version of `x`.
kd.core.freeze_bag(x)
Aliases:
Returns a DataSlice with an immutable DataBag with the same data.
kd.core.get_attr(x, attr_name, default=unspecified)
Aliases:
Resolves (ObjectId(s), attr_name) => (Value|ObjectId)s.
In case attr points to Lists or Maps, the result is a DataSlice that
contains "pointers" to the beginning of lists/dicts.
For simple values ((entity, attr) => values), just returns
DataSlice(primitive values)
Args:
x: DataSlice to get attribute from.
attr_name: name of the attribute to access.
default: default value to use when `x` does not have such attribute. In case
default is specified, this will not warn/raise if the attribute does not
exist in the schema, so one can use `default=None` to suppress the missing
attribute warning/error. When `default=None` and the attribute is missing
on all entities, this will return an empty slices with NONE schema.
Returns:
DataSlice
kd.core.get_attr_names(x, intersection)
Returns a DataSlice with sorted unique attribute names of `x`.
In case of OBJECT schema, attribute names are fetched from the `__schema__`
attribute. In case of Entity schema, the attribute names are fetched from the
schema. In case of primitives, an empty list is returned.
Args:
x: A DataSlice.
intersection: If True, the intersection of all object attributes is
returned. Otherwise, the union is returned.
kd.core.get_bag(ds)
Aliases:
Returns the attached DataBag.
It raises an Error if there is no DataBag attached.
Args:
ds: DataSlice to get DataBag from.
Returns:
The attached DataBag.
kd.core.get_item(x, key_or_index)
Aliases:
Get items from Lists or Dicts in `x` by `key_or_index`.
Examples:
l = kd.list([1, 2, 3])
# Get List items by range slice from 1 to -1
kd.get_item(l, slice(1, -1)) -> kd.slice([2, 3])
# Get List items by indices
kd.get_item(l, kd.slice([2, 5])) -> kd.slice([3, None])
d = kd.dict({'a': 1, 'b': 2})
# Get Dict values by keys
kd.get_item(d, kd.slice(['a', 'c'])) -> kd.slice([1, None])
Args:
x: List or Dict DataSlice.
key_or_index: DataSlice or Slice.
Returns:
Result DataSlice.
kd.core.get_metadata(x)
Aliases:
Gets a metadata from a DataSlice.
Args:
x: DataSlice to get metadata from.
Returns:
Metadata DataSlice.
kd.core.has_attr(x, attr_name)
Aliases:
Indicates whether the items in `x` DataSlice have the given attribute.
This function checks for attributes based on data rather than "schema" and may
be slow in some cases.
Args:
x: DataSlice
attr_name: Name of the attribute to check.
Returns:
A MASK DataSlice with the same shape as `x` that contains present if the
attribute exists for the corresponding item.
kd.core.has_bag(ds)
Aliases:
Returns `present` if DataSlice `ds` has a DataBag attached.
kd.core.has_entity(x)
Aliases:
Returns present for each item in `x` that is an Entity.
Note that this is a pointwise operation.
Also see `kd.is_entity` for checking if `x` is an Entity DataSlice. But
note that `kd.all(kd.has_entity(x))` is not always equivalent to
`kd.is_entity(x)`. For example,
kd.is_entity(kd.item(None, kd.OBJECT)) -> kd.present
kd.all(kd.has_entity(kd.item(None, kd.OBJECT))) -> invalid for kd.all
kd.is_entity(kd.item([None], kd.OBJECT)) -> kd.present
kd.all(kd.has_entity(kd.item([None], kd.OBJECT))) -> kd.missing
Args:
x: DataSlice to check.
Returns:
A MASK DataSlice with the same shape as `x`.
kd.core.has_primitive(x)
Aliases:
Returns present for each item in `x` that is primitive.
Note that this is a pointwise operation.
Also see `kd.is_primitive` for checking if `x` is a primitive DataSlice. But
note that `kd.all(kd.has_primitive(x))` is not always equivalent to
`kd.is_primitive(x)`. For example,
kd.is_primitive(kd.int32(None)) -> kd.present
kd.all(kd.has_primitive(kd.int32(None))) -> invalid for kd.all
kd.is_primitive(kd.int32([None])) -> kd.present
kd.all(kd.has_primitive(kd.int32([None]))) -> kd.missing
Args:
x: DataSlice to check.
Returns:
A MASK DataSlice with the same shape as `x`.
kd.core.is_entity(x)
Aliases:
Returns whether x is an Entity DataSlice.
`x` is an Entity DataSlice if it meets one of the following conditions:
1) it has an Entity schema
2) it has OBJECT schema and only has Entity items
Also see `kd.has_entity` for a pointwise version. But note that
`kd.all(kd.has_entity(x))` is not always equivalent to
`kd.is_entity(x)`. For example,
kd.is_entity(kd.item(None, kd.OBJECT)) -> kd.present
kd.all(kd.has_entity(kd.item(None, kd.OBJECT))) -> invalid for kd.all
kd.is_entity(kd.item([None], kd.OBJECT)) -> kd.present
kd.all(kd.has_entity(kd.item([None], kd.OBJECT))) -> kd.missing
Args:
x: DataSlice to check.
Returns:
A MASK DataItem.
kd.core.is_primitive(x)
Aliases:
Returns whether x is a primitive DataSlice.
`x` is a primitive DataSlice if it meets one of the following conditions:
1) it has a primitive schema
2) it has OBJECT/SCHEMA schema and only has primitives
Also see `kd.has_primitive` for a pointwise version. But note that
`kd.all(kd.has_primitive(x))` is not always equivalent to
`kd.is_primitive(x)`. For example,
kd.is_primitive(kd.int32(None)) -> kd.present
kd.all(kd.has_primitive(kd.int32(None))) -> invalid for kd.all
kd.is_primitive(kd.int32([None])) -> kd.present
kd.all(kd.has_primitive(kd.int32([None]))) -> kd.missing
Args:
x: DataSlice to check.
Returns:
A MASK DataItem.
kd.core.maybe(x, attr_name)
Aliases:
A shortcut for kd.get_attr(x, attr_name, default=None).
kd.core.metadata(x, /, **attrs)
Aliases:
Returns a new DataBag containing metadata updates for `x`.
Most common usage is to build an update using kd.metadata and than attach it
as a DataBag update to the DataSlice.
Example:
x = ...
metadata_update = kd.metadata(x, foo=..., bar=...)
x = x.updated(metadata_update)
Note that if the metadata attribute name is not a valid Python identifier, it
might be set by `with_attr` instead:
metadata_update = kd.metadata(x).with_attr('123', value)
Args:
x: Schema for which the metadata update is being created.
**attrs: attrs to set in the metadata update.
kd.core.no_bag(ds)
Aliases:
Returns DataSlice without any DataBag attached.
kd.core.nofollow(x)
Aliases:
Returns a nofollow DataSlice targeting the given slice.
When a slice is wrapped into a nofollow, it's attributes are not further
traversed during extract, clone, deep_clone, etc.
`nofollow` is reversible.
Args:
x: DataSlice to wrap.
kd.core.ref(ds)
Aliases:
Returns `ds` with the DataBag removed.
Unlike `no_bag`, `ds` is required to hold ItemIds and no primitives are
allowed.
The result DataSlice still has the original schema. If the schema is an Entity
schema (including List/Dict schema), it is treated an ItemId after the DataBag
is removed.
Args:
ds: DataSlice of ItemIds.
kd.core.reify(ds, source)
Aliases:
Assigns a bag and schema from `source` to the slice `ds`.
kd.core.shallow_clone(x, /, *, itemid=unspecified, schema=unspecified, **overrides)
Aliases:
Creates a DataSlice with shallow clones of immediate attributes.
The entities themselves get new ItemIds and their top-level attributes are
copied by reference.
Also see kd.clone and kd.deep_clone.
Note that unlike kd.deep_clone, if there are multiple references to the same
entity, the returned DataSlice will have multiple clones of it rather than
references to the same clone.
Args:
x: The DataSlice to copy.{SELF}
itemid: The ItemId to assign to cloned entities. If not specified, will
allocate new ItemIds.
schema: The schema to resolve attributes, and also to assign the schema to
the resulting DataSlice. If not specified, will use the schema of 'x'.
**overrides: attribute overrides.
Returns:
A copy of the entities with new ItemIds where all top-level attributes are
copied by reference.
kd.core.strict_attrs(x, /, **attrs)
Aliases:
Returns a new DataBag containing attribute updates for `x`.
Strict version of kd.attrs disallowing adding new attributes.
Args:
x: Entity for which the attributes update is being created.
**attrs: attrs to set in the update.
kd.core.strict_with_attrs(x, /, **attrs)
Aliases:
Returns a DataSlice with a new DataBag containing updated attrs in `x`.
Strict version of kd.attrs disallowing adding new attributes.
Args:
x: Entity for which the attributes update is being created.
**attrs: attrs to set in the update.
kd.core.stub(x, attrs=DataSlice([], schema: NONE))
Aliases:
Copies a DataSlice's schema stub to a new DataBag.
The "schema stub" of a DataSlice is a subset of its schema (including embedded
schemas) that contains just enough information to support direct updates to
that DataSlice.
Optionally copies `attrs` schema attributes to the new DataBag as well.
This method works for items, objects, and for lists and dicts stored as items
or objects. The intended usage is to add new attributes to the object in the
new bag, or new items to the dict in the new bag, and then to be able
to merge the bags to obtain a union of attributes/values. For lists, we
extract the list with stubs for list items, which also works recursively so
nested lists are deep-extracted. Note that if you modify the list afterwards
by appending or removing items, you will no longer be able to merge the result
with the original bag.
Args:
x: DataSlice to extract the schema stub from.
attrs: Optional list of additional schema attribute names to copy. The
schemas for those attributes will be copied recursively (so including
attributes of those attributes etc).
Returns:
DataSlice with the same schema stub in the new DataBag.
kd.core.updated(ds, *bag)
Aliases:
Returns a copy of a DataSlice with DataBag(s) of updates applied.
Values in `*bag` take precedence over the ones in the original DataBag of
`ds`.
The DataBag attached to the result is a new immutable DataBag that falls back
to the DataBag of `ds` if present and then to `*bag`.
`updated(x, a, b)` is equivalent to `updated(updated(x, b), a)`, and so on
for additional DataBag args.
Args:
ds: DataSlice.
*bag: DataBag(s) of updates.
Returns:
DataSlice with additional fallbacks.
kd.core.with_attr(x, attr_name, value, overwrite_schema=False)
Aliases:
Returns a DataSlice with a new DataBag containing a single updated attribute.
This operator is useful if attr_name cannot be used as a key in keyword
arguments. E.g.: "123-f", "5", "%#$", etc. It still has to be a valid utf-8
unicode.
See kd.with_attrs docstring for more details on the rules and regarding
`overwrite` argument.
Args:
x: Entity / Object for which the attribute update is being created.
attr_name: utf-8 unicode representing the attribute name.
value: new value for attribute `attr_name`.
overwrite_schema: if True, schema for attribute is always updated.
kd.core.with_attrs(x, /, *, overwrite_schema=False, **attrs)
Aliases:
Returns a DataSlice with a new DataBag containing updated attrs in `x`.
This is a shorter version of `x.updated(kd.attrs(x, ...))`.
Example:
x = x.with_attrs(foo=..., bar=...)
# Or equivalent:
# x = kd.with_attrs(x, foo=..., bar=...)
In case some attribute "foo" already exists and the update contains "foo",
either:
1) the schema of "foo" in the update must be implicitly castable to
`x.foo.get_schema()`; or
2) `x` is an OBJECT, in which case schema for "foo" will be overwritten.
An exception to (2) is if it was an Entity that was casted to an OBJECT using
kd.obj, e.g. then update for "foo" also must be castable to
`x.foo.get_schema()`. If this is not the case, an Error is raised.
This behavior can be overwritten by passing `overwrite=True`, which will cause
the schema for attributes to always be updated.
Args:
x: Entity / Object for which the attributes update is being created.
overwrite_schema: if True, schema for attributes is always updated.
**attrs: attrs to set in the update.
kd.core.with_bag(ds, bag)
Aliases:
Returns a DataSlice with the given DataBatg attached.
kd.core.with_merged_bag(ds)
Aliases:
Returns a DataSlice with the DataBag of `ds` merged with its fallbacks.
Note that a DataBag has multiple fallback DataBags and fallback DataBags can
have fallbacks as well. This operator merges all of them into a new immutable
DataBag.
If `ds` has no attached DataBag, it raises an exception. If the DataBag of
`ds` does not have fallback DataBags, it is equivalent to `ds.freeze_bag()`.
Args:
ds: DataSlice to merge fallback DataBags of.
Returns:
A new DataSlice with an immutable DataBags.
kd.core.with_metadata(x, /, **attrs)
Aliases:
Returns a DataSlice with a new DataBag containing updated metadata for `x`.
This is a shorter version of `x.updated(kd.metadata(x, ...))`.
Example:
x = kd.with_metadata(x, foo=..., bar=...)
Note that if the metadata attribute name is not a valid Python identifier, it
might be set by `with_attr` instead:
x = kd.with_metadata(x).with_attr('123', value)
Args:
x: Entity / Object for which the metadata update is being created.
**attrs: attrs to set in the update.
kd.core.with_print(x, *args, sep=' ', end='\n')
Aliases:
Prints *args to stdout and returns `x`.
The operator uses str(arg) for each of the *args, i.e. it is not pointwise,
and too long arguments may be truncated.
Args:
x: Value to propagate (unchanged).
*args: DataSlice(s) to print.
sep: Separator to use between DataSlice(s).
end: End string to use after the last DataSlice.
Operators working with curves.
Operators
kd.curves.log_p1_pwl_curve(p, adjustments)
Aliases:
Specialization of PWLCurve with log(x + 1) transformation.
Args:
p: (DataSlice) input points to the curve
adjustments: (DataSlice) 2D data slice with points used for interpolation.
The second dimension must have regular size of 2. E.g., [[1, 1.7], [2,
3.6], [7, 5.7]]
Returns:
FLOAT64 DataSlice with the same dimensions as p with interpolation results.
kd.curves.log_pwl_curve(p, adjustments)
Aliases:
Specialization of PWLCurve with log(x) transformation.
Args:
p: (DataSlice) input points to the curve
adjustments: (DataSlice) 2D data slice with points used for interpolation.
The second dimension must have regular size of 2. E.g., [[1, 1.7], [2,
3.6], [7, 5.7]]
Returns:
FLOAT64 DataSlice with the same dimensions as p with interpolation results.
kd.curves.pwl_curve(p, adjustments)
Aliases:
Piecewise Linear (PWL) curve interpolation operator.
Args:
p: (DataSlice) input points to the curve
adjustments: (DataSlice) 2D data slice with points used for interpolation.
The second dimension must have regular size of 2. E.g., [[1, 1.7], [2,
3.6], [7, 5.7]]
Returns:
FLOAT64 DataSlice with the same dimensions as p with interpolation results.
kd.curves.symmetric_log_p1_pwl_curve(p, adjustments)
Aliases:
Specialization of PWLCurve with symmetric log(x + 1) transformation.
Args:
p: (DataSlice) input points to the curve
adjustments: (DataSlice) 2D data slice with points used for interpolation.
The second dimension must have regular size of 2. E.g., [[1, 1.7], [2,
3.6], [7, 5.7]]
Returns:
FLOAT64 DataSlice with the same dimensions as p with interpolation results.
Operators working with dictionaries.
Operators
kd.dicts.dict_update(x, keys, values=unspecified)
Aliases:
Returns DataBag containing updates to a DataSlice of dicts.
This operator has two forms:
kd.dict_update(x, keys, values) where keys and values are slices
kd.dict_update(x, dict_updates) where dict_updates is a DataSlice of dicts
If both keys and values are specified, they must both be broadcastable to the
shape of `x`. If only keys is specified (as dict_updates), it must be
broadcastable to 'x'.
Args:
x: DataSlice of dicts to update.
keys: A DataSlice of keys, or a DataSlice of dicts of updates.
values: A DataSlice of values, or unspecified if `keys` contains dicts.
kd.dicts.get_item(x, key_or_index)
Alias for kd.core.get_item operator.
kd.dicts.get_keys(dict_ds)
Aliases:
Returns keys of all Dicts in `dict_ds`.
The result DataSlice has one more dimension used to represent keys in each
dict than `dict_ds`. While the order of keys within a dict is arbitrary, it is
the same as get_values().
Args:
dict_ds: DataSlice of Dicts.
Returns:
A DataSlice of keys.
kd.dicts.get_values(dict_ds, key_ds=unspecified)
Aliases:
Returns values corresponding to `key_ds` for dicts in `dict_ds`.
When `key_ds` is specified, it is equivalent to dict_ds[key_ds].
When `key_ds` is unspecified, it returns all values in `dict_ds`. The result
DataSlice has one more dimension used to represent values in each dict than
`dict_ds`. While the order of values within a dict is arbitrary, it is the
same as get_keys().
Args:
dict_ds: DataSlice of Dicts.
key_ds: DataSlice of keys or unspecified.
Returns:
A DataSlice of values.
kd.dicts.has_dict(x)
Aliases:
Returns present for each item in `x` that is Dict.
Note that this is a pointwise operation.
Also see `kd.is_dict` for checking if `x` is a Dict DataSlice. But note that
`kd.all(kd.has_dict(x))` is not always equivalent to `kd.is_dict(x)`. For
example,
kd.is_dict(kd.item(None, kd.OBJECT)) -> kd.present
kd.all(kd.has_dict(kd.item(None, kd.OBJECT))) -> invalid for kd.all
kd.is_dict(kd.item([None], kd.OBJECT)) -> kd.present
kd.all(kd.has_dict(kd.item([None], kd.OBJECT))) -> kd.missing
Args:
x: DataSlice to check.
Returns:
A MASK DataSlice with the same shape as `x`.
kd.dicts.is_dict(x)
Aliases:
Returns whether x is a Dict DataSlice.
`x` is a Dict DataSlice if it meets one of the following conditions:
1) it has a Dict schema
2) it has OBJECT schema and only has Dict items
Also see `kd.has_dict` for a pointwise version. But note that
`kd.all(kd.has_dict(x))` is not always equivalent to `kd.is_dict(x)`. For
example,
kd.is_dict(kd.item(None, kd.OBJECT)) -> kd.present
kd.all(kd.has_dict(kd.item(None, kd.OBJECT))) -> invalid for kd.all
kd.is_dict(kd.item([None], kd.OBJECT)) -> kd.present
kd.all(kd.has_dict(kd.item([None], kd.OBJECT))) -> kd.missing
Args:
x: DataSlice to check.
Returns:
A MASK DataItem.
kd.dicts.like(shape_and_mask_from, /, items_or_keys=None, values=None, *, key_schema=None, value_schema=None, schema=None, itemid=None)
Aliases:
Creates new Koda dicts with shape and sparsity of `shape_and_mask_from`.
Returns immutable dicts.
If items_or_keys and values are not provided, creates empty dicts. Otherwise,
the function assigns the given keys and values to the newly created dicts. So
the keys and values must be either broadcastable to shape_and_mask_from
shape, or one dimension higher.
Args:
shape_and_mask_from: a DataSlice with the shape and sparsity for the
desired dicts.
items_or_keys: either a Python dict (if `values` is None) or a DataSlice
with keys. The Python dict case is supported only for scalar
shape_and_mask_from.
values: a DataSlice of values, when `items_or_keys` represents keys.
key_schema: the schema of the dict keys. If not specified, it will be
deduced from keys or defaulted to OBJECT.
value_schema: the schema of the dict values. If not specified, it will be
deduced from values or defaulted to OBJECT.
schema: The schema to use for the newly created Dict. If specified, then
key_schema and value_schema must not be specified.
itemid: Optional ITEMID DataSlice used as ItemIds of the resulting lists.
Returns:
A DataSlice with the dicts.
kd.dicts.new(items_or_keys=None, values=None, *, key_schema=None, value_schema=None, schema=None, itemid=None)
Aliases:
Creates a Koda dict.
Returns an immutable dict.
Acceptable arguments are:
1) no argument: a single empty dict
2) a Python dict whose keys are either primitives or DataItems and values
are primitives, DataItems, Python list/dict which can be converted to a
List/Dict DataItem, or a DataSlice which can folded into a List DataItem:
a single dict
3) two DataSlices/DataItems as keys and values: a DataSlice of dicts whose
shape is the last N-1 dimensions of keys/values DataSlice
Examples:
dict() -> returns a single new dict
dict({1: 2, 3: 4}) -> returns a single new dict
dict({1: [1, 2]}) -> returns a single dict, mapping 1->List[1, 2]
dict({1: kd.slice([1, 2])}) -> returns a single dict, mapping 1->List[1, 2]
dict({db.uuobj(x=1, y=2): 3}) -> returns a single dict, mapping uuid->3
dict(kd.slice([1, 2]), kd.slice([3, 4]))
-> returns a dict ({1: 3, 2: 4})
dict(kd.slice([[1], [2]]), kd.slice([3, 4]))
-> returns a 1-D DataSlice that holds two dicts ({1: 3} and {2: 4})
dict('key', 12) -> returns a single dict mapping 'key'->12
Args:
items_or_keys: a Python dict in case of items and a DataSlice in case of
keys.
values: a DataSlice. If provided, `items_or_keys` must be a DataSlice as
keys.
key_schema: the schema of the dict keys. If not specified, it will be
deduced from keys or defaulted to OBJECT.
value_schema: the schema of the dict values. If not specified, it will be
deduced from values or defaulted to OBJECT.
schema: The schema to use for the newly created Dict. If specified, then
key_schema and value_schema must not be specified.
itemid: Optional ITEMID DataSlice used as ItemIds of the resulting lists.
Returns:
A DataSlice with the dict.
kd.dicts.select_keys(ds, fltr)
Aliases:
Selects Dict keys by filtering out missing items in `fltr`.
Also see kd.select.
Args:
ds: Dict DataSlice to be filtered
fltr: filter DataSlice with dtype as kd.MASK or a Koda Functor or a Python
function which can be evalauted to such DataSlice. A Python function will
be traced for evaluation, so it cannot have Python control flow operations
such as `if` or `while`.
Returns:
Filtered DataSlice.
kd.dicts.select_values(ds, fltr)
Aliases:
Selects Dict values by filtering out missing items in `fltr`.
Also see kd.select.
Args:
ds: Dict DataSlice to be filtered
fltr: filter DataSlice with dtype as kd.MASK or a Koda Functor or a Python
function which can be evalauted to such DataSlice. A Python function will
be traced for evaluation, so it cannot have Python control flow operations
such as `if` or `while`.
Returns:
Filtered DataSlice.
kd.dicts.shaped(shape, /, items_or_keys=None, values=None, key_schema=None, value_schema=None, schema=None, itemid=None)
Aliases:
Creates new Koda dicts with the given shape.
Returns immutable dicts.
If items_or_keys and values are not provided, creates empty dicts. Otherwise,
the function assigns the given keys and values to the newly created dicts. So
the keys and values must be either broadcastable to `shape` or one dimension
higher.
Args:
shape: the desired shape.
items_or_keys: either a Python dict (if `values` is None) or a DataSlice
with keys. The Python dict case is supported only for scalar shape.
values: a DataSlice of values, when `items_or_keys` represents keys.
key_schema: the schema of the dict keys. If not specified, it will be
deduced from keys or defaulted to OBJECT.
value_schema: the schema of the dict values. If not specified, it will be
deduced from values or defaulted to OBJECT.
schema: The schema to use for the newly created Dict. If specified, then
key_schema and value_schema must not be specified.
itemid: Optional ITEMID DataSlice used as ItemIds of the resulting lists.
Returns:
A DataSlice with the dicts.
kd.dicts.shaped_as(shape_from, /, items_or_keys=None, values=None, key_schema=None, value_schema=None, schema=None, itemid=None)
Aliases:
Creates new Koda dicts with shape of the given DataSlice.
Returns immutable dicts.
If items_or_keys and values are not provided, creates empty dicts. Otherwise,
the function assigns the given keys and values to the newly created dicts. So
the keys and values must be either broadcastable to `shape` or one dimension
higher.
Args:
shape_from: mandatory DataSlice, whose shape the returned DataSlice will
have.
items_or_keys: either a Python dict (if `values` is None) or a DataSlice
with keys. The Python dict case is supported only for scalar shape.
values: a DataSlice of values, when `items_or_keys` represents keys.
key_schema: the schema of the dict keys. If not specified, it will be
deduced from keys or defaulted to OBJECT.
value_schema: the schema of the dict values. If not specified, it will be
deduced from values or defaulted to OBJECT.
schema: The schema to use for the newly created Dict. If specified, then
key_schema and value_schema must not be specified.
itemid: Optional ITEMID DataSlice used as ItemIds of the resulting lists.
Returns:
A DataSlice with the dicts.
kd.dicts.size(dict_slice)
Aliases:
Returns size of a Dict.
kd.dicts.with_dict_update(x, keys, values=unspecified)
Aliases:
Returns a DataSlice with a new DataBag containing updated dicts.
This operator has two forms:
kd.with_dict_update(x, keys, values) where keys and values are slices
kd.with_dict_update(x, dict_updates) where dict_updates is a DataSlice of
dicts
If both keys and values are specified, they must both be broadcastable to the
shape of `x`. If only keys is specified (as dict_updates), it must be
broadcastable to 'x'.
Args:
x: DataSlice of dicts to update.
keys: A DataSlice of keys, or a DataSlice of dicts of updates.
values: A DataSlice of values, or unspecified if `keys` contains dicts.
Operators that work solely with entities.
Operators
kd.entities.like(shape_and_mask_from, /, *, schema=None, overwrite_schema=False, itemid=None, **attrs)
Aliases:
Creates new Entities with the shape and sparsity from shape_and_mask_from.
Returns immutable Entities.
Args:
shape_and_mask_from: DataSlice, whose shape and sparsity the returned
DataSlice will have.
schema: optional DataSlice schema. If not specified, a new explicit schema
will be automatically created based on the schemas of the passed **attrs.
You can also pass schema='name' as a shortcut for
schema=kd.named_schema('name').
overwrite_schema: if schema attribute is missing and the attribute is being
set through `attrs`, schema is successfully updated.
itemid: optional ITEMID DataSlice used as ItemIds of the resulting entities.
**attrs: attrs to set in the returned Entity.
Returns:
data_slice.DataSlice with the given attrs.
kd.entities.new(arg=unspecified, /, *, schema=None, overwrite_schema=False, itemid=None, **attrs)
Aliases:
Creates Entities with given attrs.
Returns an immutable Entity.
Args:
arg: optional Python object to be converted to an Entity.
schema: optional DataSlice schema. If not specified, a new explicit schema
will be automatically created based on the schemas of the passed **attrs.
You can also pass schema='name' as a shortcut for
schema=kd.named_schema('name').
overwrite_schema: if schema attribute is missing and the attribute is being
set through `attrs`, schema is successfully updated.
itemid: optional ITEMID DataSlice used as ItemIds of the resulting entities.
itemid will only be set when the args is not a primitive or primitive
slice if args present.
**attrs: attrs to set in the returned Entity.
Returns:
data_slice.DataSlice with the given attrs.
kd.entities.shaped(shape, /, *, schema=None, overwrite_schema=False, itemid=None, **attrs)
Aliases:
Creates new Entities with the given shape.
Returns immutable Entities.
Args:
shape: JaggedShape that the returned DataSlice will have.
schema: optional DataSlice schema. If not specified, a new explicit schema
will be automatically created based on the schemas of the passed **attrs.
You can also pass schema='name' as a shortcut for
schema=kd.named_schema('name').
overwrite_schema: if schema attribute is missing and the attribute is being
set through `attrs`, schema is successfully updated.
itemid: optional ITEMID DataSlice used as ItemIds of the resulting entities.
**attrs: attrs to set in the returned Entity.
Returns:
data_slice.DataSlice with the given attrs.
kd.entities.shaped_as(shape_from, /, *, schema=None, overwrite_schema=False, itemid=None, **attrs)
Aliases:
Creates new Koda entities with shape of the given DataSlice.
Returns immutable Entities.
Args:
shape_from: DataSlice, whose shape the returned DataSlice will have.
schema: optional DataSlice schema. If not specified, a new explicit schema
will be automatically created based on the schemas of the passed **attrs.
You can also pass schema='name' as a shortcut for
schema=kd.named_schema('name').
overwrite_schema: if schema attribute is missing and the attribute is being
set through `attrs`, schema is successfully updated.
itemid: optional ITEMID DataSlice used as ItemIds of the resulting entities.
**attrs: attrs to set in the returned Entity.
Returns:
data_slice.DataSlice with the given attrs.
kd.entities.uu(seed=None, *, schema=None, overwrite_schema=False, **attrs)
Aliases:
Creates UuEntities with given attrs.
Returns an immutable UU Entity.
Args:
seed: string to seed the uuid computation with.
schema: optional DataSlice schema. If not specified, a UuSchema
will be automatically created based on the schemas of the passed **attrs.
overwrite_schema: if schema attribute is missing and the attribute is being
set through `attrs`, schema is successfully updated.
**attrs: attrs to set in the returned Entity.
Returns:
data_slice.DataSlice with the given attrs.
Expr utilities.
Operators
kd.expr.as_expr(arg)
Converts Python values into Exprs.
kd.expr.get_input_names(expr, container=InputContainer('I'))
Returns names of `container` inputs used in `expr`.
kd.expr.get_name(expr)
Returns the name of the given Expr, or None if it does not have one.
kd.expr.is_input(expr)
Returns True if `expr` is an input `I`.
kd.expr.is_literal(expr)
Returns True if `expr` is a Koda Literal.
kd.expr.is_packed_expr(ds)
Returns kd.present if the argument is a DataItem containing an Expr.
kd.expr.is_variable(expr)
Returns True if `expr` is a variable `V`.
kd.expr.literal(value)
Constructs an expr with a LiteralOperator wrapping the provided QValue.
kd.expr.pack_expr(expr)
Packs the given Expr into a DataItem.
kd.expr.sub(expr, *subs)
Returns `expr` with provided expressions replaced.
Example usage:
kd.sub(expr, (from_1, to_1), (from_2, to_2), ...)
For the special case of a single substitution, you can also do:
kd.sub(expr, from, to)
It does the substitution by traversing 'expr' post-order and comparing
fingerprints of sub-Exprs in the original expression and those in in 'subs'.
For example,
kd.sub(I.x + I.y, (I.x, I.z), (I.x + I.y, I.k)) -> I.k
kd.sub(I.x + I.y, (I.x, I.y), (I.y + I.y, I.z)) -> I.y + I.y
It does not do deep transformation recursively. For example,
kd.sub(I.x + I.y, (I.x, I.z), (I.y, I.x)) -> I.z + I.x
Args:
expr: Expr which substitutions are applied to
*subs: Either zero or more (sub_from, sub_to) tuples, or exactly two
arguments from and to. The keys should be expressions, and the values
should be possible to convert to expressions using kd.as_expr.
Returns:
A new Expr with substitutions.
kd.expr.sub_by_name(expr, /, **subs)
Returns `expr` with named subexpressions replaced.
Use `kde.with_name(expr, name)` to create a named subexpression.
Example:
foo = kde.with_name(I.x, 'foo')
bar = kde.with_name(I.y, 'bar')
expr = foo + bar
kd.sub_by_name(expr, foo=I.z)
# -> I.z + kde.with_name(I.y, 'bar')
Args:
expr: an expression.
**subs: mapping from subexpression name to replacement node.
kd.expr.sub_inputs(expr, container=InputContainer('I'), /, **subs)
Returns an expression with `container` inputs replaced with Expr(s).
kd.expr.unpack_expr(ds)
Unpacks an Expr stored in a DataItem.
kd.expr.unwrap_named(expr)
Unwraps a named Expr, raising if it is not named.
Extension type functionality.
Operators
kd.extension_types.dynamic_cast(value, qtype)
Up-, down-, and side-casts `value` to `qtype`.
kd.extension_types.extension_type(unsafe_override=False)
Aliases:
Creates a Koda extension type from the given original class.
This function is intended to be used as a class decorator. The decorated class
serves as a schema for the new extension type.
Internally, this function creates the following:
- A new `QType` for the extension type, which is a labeled `QType` on top of
an arolla::Object.
- A `QValue` class for representing evaluated instances of the extension type.
- An `ExprView` class for representing expressions that will evaluate to the
extension type.
It replaces the decorated class with a new class that acts as a factory. This
factory's `__new__` method dispatches to either create an `Expr` or a `QValue`
instance, depending on the types of the arguments provided.
The fields of the dataclass are exposed as properties on both the `QValue` and
`ExprView` classes. Any methods defined on the dataclass are also carried
over.
Note:
- The decorated class must not have its own `__new__` method.
- The type annotations on the fields of the dataclass are used to determine
the schema of the underlying `DataSlice` (if relevant).
- All fields must have type annotations.
- Supported annotations include `SchemaItem`, `DataSlice`, `DataBag`,
`JaggedShape`, and other extension types. Additionally, any QType can be
used as an annotation.
- The `with_attrs` method is automatically added, allowing for attributes to
be dynamically updated.
Example:
@extension_type()
class MyPoint:
x: kd.FLOAT32
y: kd.FLOAT32
def norm(self):
return (self.x**2 + self.y**2)**0.5
# Creates a QValue instance of MyPoint.
p1 = MyPoint(x=1.0, y=2.0)
Extension type inheritance is supported through Python inheritance. Passing an
extension type argument to a functor will automatically upcast / downcast the
argument to the correct extension type based on the argument annotation. To
support calling a child class's methods after upcasting, the parent method
must be annotated with @kd.extension_types.virtual() and the child method
must be annotated with @kd.extension_types.override(). Internally, this traces
the methods into Functors. Virtual methods _require_ proper return
annotations (and if relevant, input argument annotations).
Example:
@kd.extension_type(unsafe_override=True)
class A:
x: kd.INT32
def fn(self, y): # Normal method.
return self.x + y
@kd.extension_types.virtual()
def virt_fn(self, y): # Virtual method.
return self.x * y
@kd.extension_type(unsafe_override=True)
class B(A): # Inherits from A.
y: kd.FLOAT32
def fn(self, y):
return self.x + self.y + y
@kd.extension_types.override()
def virt_fn(self, y):
return self.x * self.y * y
@kd.fn
def call_a_fn(a: A): # Automatically casts to A.
return a.fn(4) # Calls non-virtual method.
@kd.fn
def call_a_virt_fn(a: A): # Automatically casts to A.
return a.virt_fn(4) # Calls virtual method.
b = B(2, 3)
# -> 6. `fn` is _not_ marked as virtual, so the parent method is invoked.
call_a_fn(b)
# -> 24.0. `virt_fn` is marked as virtual, so the child method is invoked.
call_a_virt_fn(b)
Args:
unsafe_override: Overrides existing registered extension types.
Returns:
A new class that serves as a factory for the extension type.
kd.extension_types.get_extension_qtype(cls)
Returns the QType for the given extension type class.
kd.extension_types.is_koda_extension(x)
Returns True iff the given object is an instance of a Koda extension type.
kd.extension_types.is_koda_extension_type(cls)
Returns True iff the given type is a registered Koda extension type.
kd.extension_types.override()
Marks the method as overriding a virtual method.
kd.extension_types.unwrap(ext)
Unwraps the extension type `ext` into an arolla::Object.
kd.extension_types.virtual()
Marks the method as virtual, allowing it to be overridden.
kd.extension_types.with_attrs(ext, **attrs)
No description
kd.extension_types.wrap(x, qtype)
Wraps `x` into an instance of the given extension type.
Operators to create and call functors.
Operators
kd.functor.FunctorFactory(*args, **kwargs)
`functor_factory` argument protocol for `trace_as_fn`.
Implements:
(py_types.FunctionType, return_type_as: arolla.QValue) -> DataItem
kd.functor.allow_arbitrary_unused_inputs(fn_def)
Returns a functor that allows unused inputs but otherwise behaves the same.
This is done by adding a `**__extra_inputs__` argument to the signature if
there is no existing variadic keyword argument there. If there is a variadic
keyword argument, this function will return the original functor.
This means that if the functor already accepts arbitrary inputs but fails
on unknown inputs further down the line (for example, when calling another
functor), this method will not fix it. In particular, this method has no
effect on the return values of kd.py_fn or kd.bind. It does however work
on the output of kd.trace_py_fn.
Args:
fn_def: The input functor.
Returns:
The input functor if it already has a variadic keyword argument, or its copy
but with an additional `**__extra_inputs__` variadic keyword argument if
there is no existing variadic keyword argument.
kd.functor.bind(fn_def, /, *, return_type_as=<class 'koladata.types.data_slice.DataSlice'>, **kwargs)
Aliases:
Returns a Koda functor that partially binds a function to `kwargs`.
This function is intended to work the same as functools.partial in Python.
More specifically, for every "k=something" argument that you pass to this
function, whenever the resulting functor is called, if the user did not
provide "k=something_else" at call time, we will add "k=something".
Note that you can only provide defaults for the arguments passed as keyword
arguments this way. Positional arguments must still be provided at call time.
Moreover, if the user provides a value for a positional-or-keyword argument
positionally, and it was previously bound using this method, an exception
will occur.
You can pass expressions with their own inputs as values in `kwargs`. Those
inputs will become inputs of the resulting functor, will be used to compute
those expressions, _and_ they will also be passed to the underying functor.
Use kd.functor.call_fn for a more clear separation of those inputs.
Example:
f = kd.bind(kd.fn(I.x + I.y), x=0)
kd.call(f, y=1) # 1
Args:
fn_def: A Koda functor.
return_type_as: The return type of the functor is expected to be the same as
the type of this value. This needs to be specified if the functor does not
return a DataSlice. kd.types.DataSlice and kd.types.DataBag can also be
passed here.
**kwargs: Partial parameter binding. The values in this map may be Koda
expressions or DataItems. When they are expressions, they must evaluate to
a DataSlice/DataItem or a primitive that will be automatically wrapped
into a DataItem. This function creates auxiliary variables with names
starting with '_aux_fn', so it is not recommended to pass variables with
such names.
Returns:
A new Koda functor with some parameters bound.
kd.functor.call(fn, *args, return_type_as=DataItem(None, schema: NONE), **kwargs)
Aliases:
Calls a functor.
See the docstring of `kd.fn` on how to create a functor.
Example:
kd.call(kd.fn(I.x + I.y), x=2, y=3)
# returns kd.item(5)
kd.lazy.call(I.fn, x=2, y=3).eval(fn=kd.fn(I.x + I.y))
# returns kd.item(5)
Args:
fn: The functor to be called, typically created via kd.fn().
*args: The positional arguments to pass to the call. Scalars will be
auto-boxed to DataItems.
return_type_as: The return type of the call is expected to be the same as
the return type of this expression. In most cases, this will be a literal
of the corresponding type. This needs to be specified if the functor does
not return a DataSlice. kd.types.DataSlice, kd.types.DataBag and
kd.types.JaggedShape can also be passed here.
**kwargs: The keyword arguments to pass to the call. Scalars will be
auto-boxed to DataItems.
Returns:
The result of the call.
kd.functor.call_and_update_namedtuple(fn, *args, namedtuple_to_update, **kwargs)
Calls a functor which returns a namedtuple and applies it as an update.
This operator exists to avoid the need to specify return_type_as for the inner
call (since the returned namedtuple may have a subset of fields of the
original namedtuple, potentially in a different order).
Example:
kd.functor.call_and_update_namedtuple(
kd.fn(lambda x: kd.namedtuple(x=x * 2)),
x=2,
namedtuple_to_update=kd.namedtuple(x=1, y=2))
# returns kd.namedtuple(x=4, y=2)
Args:
fn: The functor to be called, typically created via kd.fn().
*args: The positional arguments to pass to the call. Scalars will be
auto-boxed to DataItems.
namedtuple_to_update: The namedtuple to be updated with the result of the
call. The returned namedtuple must have a subset (possibly empty or full)
of fields of this namedtuple, with the same types.
**kwargs: The keyword arguments to pass to the call. Scalars will be
auto-boxed to DataItems.
Returns:
The updated namedtuple.
kd.functor.call_fn_normally_when_parallel(fn, *args, return_type_as=DataItem(None, schema: NONE), **kwargs)
Special call that will invoke the functor normally in parallel mode.
Normally, nested functor calls are also parallelized in parallel mode. This
operator can be used to disable the nested parallelization for a specific
call. Instead, the parallel evaluation will first wait for all inputs of
the call to be ready, and then call the functor normally on them. Note that
the functor must accept and return non-parallel types (DataSlices, DataBags)
and not futures. The functor may return an iterable, which will be converted
to a stream in parallel mode, or another non-parallel value, which will
be converted to a future in parallel mode. The functor must not accept
iterables as inputs, which will result in an error.
Outside of the parallel mode, this operator behaves exactly like
`functor.call`.
Args:
fn: The functor to be called, typically created via kd.fn().
*args: The positional arguments to pass to the call.
return_type_as: The return type of the call is expected to be the same as
the return type of this expression. In most cases, this will be a literal
of the corresponding type. This needs to be specified if the functor does
not return a DataSlice. kd.types.DataSlice, kd.types.DataBag and
kd.types.JaggedShape can also be passed here.
**kwargs: The keyword arguments to pass to the call.
Returns:
The result of the call.
kd.functor.call_fn_returning_stream_when_parallel(fn, *args, return_type_as=ITERABLE[DATA_SLICE]{sequence(value_qtype=DATA_SLICE)}, **kwargs)
Special call that will be transformed to expect fn to return a stream.
It should be used only if functor is provided externally in production
enviroment. Prefer `functor.call` for functors fully implemented in Python.
Args:
fn: function to be called. Should return Iterable in interactive mode and
Stream in parallel mode.
*args: positional args to pass to the function.
return_type_as: The return type of the call is expected to be the same as
the return type of this expression. In most cases, this will be a literal
of the corresponding type. This needs to be specified if the functor does
not return a Iterable[DataSlice].
**kwargs: The keyword arguments to pass to the call.
kd.functor.expr_fn(returns, *, signature=None, auto_variables=False, **variables)
Creates a functor.
Args:
returns: What should calling a functor return. Will typically be an Expr to
be evaluated, but can also be a DataItem in which case calling will just
return this DataItem, or a primitive that will be wrapped as a DataItem.
When this is an Expr, it either must evaluate to a DataSlice/DataItem, or
the return_type_as= argument should be specified at kd.call time.
signature: The signature of the functor. Will be used to map from args/
kwargs passed at calling time to I.smth inputs of the expressions. When
None, the default signature will be created based on the inputs from the
expressions involved.
auto_variables: When true, we create additional variables automatically
based on the provided expressions for 'returns' and user-provided
variables. All non-scalar-primitive DataSlice literals become their own
variables, and all named subexpressions become their own variables. This
helps readability and manipulation of the resulting functor.
**variables: The variables of the functor. Each variable can either be an
expression to be evaluated, or a DataItem, or a primitive that will be
wrapped as a DataItem. The result of evaluating the variable can be
accessed as V.smth in other expressions.
Returns:
A DataItem representing the functor.
kd.functor.flat_map_chain(iterable, fn, value_type_as=DataItem(None, schema: NONE))
Aliases:
Executes flat maps over the given iterable.
`fn` is called for each item in the iterable, and must return an iterable.
The resulting iterable is then chained to get the final result.
If `fn=lambda x: kd.iterables.make(f(x), g(x))` and
`iterable=kd.iterables.make(x1, x2)`, the resulting iterable will be
`kd.iterables.make(f(x1), g(x1), f(x2), g(x2))`.
Example:
```
kd.functor.flat_map_chain(
kd.iterables.make(1, 10),
lambda x: kd.iterables.make(x, x * 2, x * 3),
)
```
result: `kd.iterables.make(1, 2, 3, 10, 20, 30)`.
Args:
iterable: The iterable to iterate over.
fn: The function to be executed for each item in the iterable. It will
receive the iterable item as the positional argument and must return an
iterable.
value_type_as: The type to use as element type of the resulting iterable.
Returns:
The resulting iterable as chained output of `fn`.
kd.functor.flat_map_interleaved(iterable, fn, value_type_as=DataItem(None, schema: NONE))
Aliases:
Executes flat maps over the given iterable.
`fn` is called for each item in the iterable, and must return an
iterable. The resulting iterables are then interleaved to get the final
result. Please note that the order of the items in each functor output
iterable is preserved, while the order of these iterables is not preserved.
If `fn=lambda x: kd.iterables.make(f(x), g(x))` and
`iterable=kd.iterables.make(x1, x2)`, the resulting iterable will be
`kd.iterables.make(f(x1), g(x1), f(x2), g(x2))` or `kd.iterables.make(f(x1),
f(x2), g(x1), g(x2))` or `kd.iterables.make(g(x1), f(x1), f(x2), g(x2))` or
`kd.iterables.make(g(x1), g(x2), f(x1), f(x2))`.
Example:
```
kd.functor.flat_map_interleaved(
kd.iterables.make(1, 10),
lambda x: kd.iterables.make(x, x * 2, x * 3),
)
```
result: `kd.iterables.make(1, 10, 2, 3, 20, 30)`.
Args:
iterable: The iterable to iterate over.
fn: The function to be executed for each item in the iterable. It will
receive the iterable item as the positional argument and must return an
iterable.
value_type_as: The type to use as element type of the resulting iterable.
Returns:
The resulting iterable as interleaved output of `fn`.
kd.functor.fn(f, *, use_tracing=True, **kwargs)
Aliases:
Returns a Koda functor representing `f`.
This is the most generic version of the functools builder functions.
It accepts all functools supported function types including python functions,
Koda Expr.
Args:
f: Python function, Koda Expr, Expr packed into a DataItem, or a Koda
functor (the latter will be just returned unchanged).
use_tracing: Whether tracing should be used for Python functions.
**kwargs: Either variables or defaults to pass to the function. See the
documentation of `expr_fn` and `py_fn` for more details.
Returns:
A Koda functor representing `f`.
kd.functor.for_(iterable, body_fn, *, finalize_fn=unspecified, condition_fn=unspecified, returns=unspecified, yields=unspecified, yields_interleaved=unspecified, **initial_state)
Aliases:
Executes a loop over the given iterable.
Exactly one of `returns`, `yields`, `yields_interleaved` must be specified,
and that dictates what this operator returns.
When `returns` is specified, it is one more variable added to `initial_state`,
and the value of that variable at the end of the loop is returned.
When `yields` is specified, it must be an iterable, and the value
passed there, as well as the values set to this variable in each
iteration of the loop, are chained to get the resulting iterable.
When `yields_interleaved` is specified, the behavior is the same as `yields`,
but the values are interleaved instead of chained.
The behavior of the loop is equivalent to the following pseudocode:
state = initial_state # Also add `returns` to it if specified.
while condition_fn(state):
item = next(iterable)
if item == <end-of-iterable>:
upd = finalize_fn(**state)
else:
upd = body_fn(item, **state)
if yields/yields_interleaved is specified:
yield the corresponding data from upd, and remove it from upd.
state.update(upd)
if item == <end-of-iterable>:
break
if returns is specified:
return state['returns']
Args:
iterable: The iterable to iterate over.
body_fn: The function to be executed for each item in the iterable. It will
receive the iterable item as the positional argument, and the loop
variables as keyword arguments (excluding `yields`/`yields_interleaved` if
those are specified), and must return a namedtuple with the new values for
some or all loop variables (including `yields`/`yields_interleaved` if
those are specified).
finalize_fn: The function to be executed when the iterable is exhausted. It
will receive the same arguments as `body_fn` except the positional
argument, and must return the same namedtuple. If not specified, the state
at the end will be the same as the state after processing the last item.
Note that finalize_fn is not called if condition_fn ever returns a missing
mask.
condition_fn: The function to be executed to determine whether to continue
the loop. It will receive the loop variables as keyword arguments, and
must return a MASK scalar. Can be used to terminate the loop early without
processing all items in the iterable. If not specified, the loop will
continue until the iterable is exhausted.
returns: The loop variable that holds the return value of the loop.
yields: The loop variables that holds the values to yield at each iteration,
to be chained together.
yields_interleaved: The loop variables that holds the values to yield at
each iteration, to be interleaved.
**initial_state: The initial state of the loop variables.
Returns:
Either the return value or the iterable of yielded values.
kd.functor.fstr_fn(returns, **kwargs)
Returns a Koda functor from format string.
Format-string must be created via Python f-string syntax. It must contain at
least one formatted expression.
kwargs are used to assign values to the functor variables and can be used in
the formatted expression using V. syntax.
Each formatted expression must have custom format specification,
e.g. `{I.x:s}` or `{V.y:.2f}`.
Examples:
kd.call(fstr_fn(f'{I.x:s} {I.y:s}'), x=1, y=2) # kd.slice('1 2')
kd.call(fstr_fn(f'{V.x:s} {I.y:s}', x=1), y=2) # kd.slice('1 2')
kd.call(fstr_fn(f'{(I.x + I.y):s}'), x=1, y=2) # kd.slice('3')
kd.call(fstr_fn('abc')) # error - no substitutions
kd.call(fstr_fn('{I.x}'), x=1) # error - format should be f-string
Args:
returns: A format string.
**kwargs: variable assignments.
kd.functor.get_signature(fn_def)
Retrieves the signature attached to the given functor.
Args:
fn_def: The functor to retrieve the signature for, or a slice thereof.
Returns:
The signature(s) attached to the functor(s).
kd.functor.has_fn(x)
Aliases:
Returns `present` for each item in `x` that is a functor.
Note that this is a pointwise operator. See `kd.functor.is_fn` for the
corresponding scalar version.
Args:
x: DataSlice to check.
kd.functor.if_(cond, yes_fn, no_fn, *args, return_type_as=DataItem(None, schema: NONE), **kwargs)
Aliases:
Calls either `yes_fn` or `no_fn` depending on the value of `cond`.
This a short-circuit sibling of kd.cond which executes only one of the two
functors, and therefore requires that `cond` is a MASK scalar.
Example:
x = kd.item(5)
kd.if_(x > 3, lambda x: x + 1, lambda x: x - 1, x)
# returns 6, note that both lambdas were traced into functors.
Args:
cond: The condition to branch on. Must be a MASK scalar.
yes_fn: The functor to be called if `cond` is present.
no_fn: The functor to be called if `cond` is missing.
*args: The positional argument(s) to pass to the functor.
return_type_as: The return type of the call is expected to be the same as
the return type of this expression. In most cases, this will be a literal
of the corresponding type. This needs to be specified if the functor does
not return a DataSlice. kd.types.DataSlice, kd.types.DataBag and
kd.types.JaggedShape can also be passed here.
**kwargs: The keyword argument(s) to pass to the functor.
Returns:
The result of the call of either `yes_fn` or `no_fn`.
kd.functor.is_fn(obj)
Aliases:
Checks if `obj` represents a functor.
Args:
obj: The value to check.
Returns:
kd.present if `obj` is a DataSlice representing a functor, kd.missing
otherwise (for example if obj has wrong type).
kd.functor.map(fn, *args, include_missing=False, **kwargs)
Aliases:
Aligns fn and args/kwargs and calls corresponding fn on corresponding arg.
If certain items of `fn` are missing, the corresponding items of the result
will also be missing.
If certain items of `args`/`kwargs` are missing then:
- when include_missing=False (the default), the corresponding item of the
result will be missing.
- when include_missing=True, we are still calling the functor on those missing
args/kwargs.
`fn`, `args`, `kwargs` will all be broadcast to the common shape. The return
values of the functors will be converted to a common schema, or an exception
will be raised if the schemas are not compatible. In that case, you can add
the appropriate cast inside the functor.
Example:
fn1 = kdf.fn(lambda x, y: x + y)
fn2 = kdf.fn(lambda x, y: x - y)
fn = kd.slice([fn1, fn2])
x = kd.slice([[1, None, 3], [4, 5, 6]])
y = kd.slice(1)
kd.map(kd.slice([fn1, fn2]), x=x, y=y)
# returns kd.slice([[2, None, 4], [3, 4, 5]])
kd.map(kd.slice([fn1, None]), x=x, y=y)
# returns kd.slice([[2, None, 4], [None, None, None]])
Args:
fn: DataSlice containing the functor(s) to evaluate. All functors must
return a DataItem.
*args: The positional argument(s) to pass to the functors.
include_missing: Whether to call the functors on missing items of
args/kwargs.
**kwargs: The keyword argument(s) to pass to the functors.
Returns:
The evaluation result.
kd.functor.map_py_fn(f, *, schema=None, max_threads=1, ndim=0, include_missing=None, **defaults)
Returns a Koda functor wrapping a python function for kd.map_py.
See kd.map_py for detailed APIs, and kd.py_fn for details about function
wrapping. schema, max_threads and ndims cannot be Koda Expr or Koda functor.
Args:
f: Python function.
schema: The schema to use for resulting DataSlice.
max_threads: maximum number of threads to use.
ndim: Dimensionality of items to pass to `f`.
include_missing: Specifies whether `f` applies to all items (`=True`) or
only to items present in all `args` and `kwargs` (`=False`, valid only
when `ndim=0`); defaults to `False` when `ndim=0`.
**defaults: Keyword defaults to pass to the function. The values in this map
may be kde expressions, format strings, or 0-dim DataSlices. See the
docstring for py_fn for more details.
kd.functor.py_fn(f, *, return_type_as=<class 'koladata.types.data_slice.DataSlice'>, **defaults)
Aliases:
Returns a Koda functor wrapping a python function.
This is the most flexible way to wrap a python function for large, complex
code that doesn't require serialization. Note that functions wrapped with
py_fn are not serializable. See register_py_fn for an alternative that is
serializable.
Note that unlike the functors created by kd.functor.expr_fn from an Expr, this
functor will have exactly the same signature as the original function. In
particular, if the original function does not accept variadic keyword
arguments and and unknown argument is passed when calling the functor, an
exception will occur.
Args:
f: Python function. It is required that this function returns a
DataSlice/DataItem or a primitive that will be automatically wrapped into
a DataItem.
return_type_as: The return type of the function is expected to be the same
as the type of this value. This needs to be specified if the function does
not return a DataSlice/DataItem or a primitive that would be auto-boxed
into a DataItem. kd.types.DataSlice, kd.types.DataBag and
kd.types.JaggedShape can also be passed here.
**defaults: Keyword defaults to bind to the function. The values in this map
may be Koda expressions or DataItems (see docstring for kd.bind for more
details). Defaults can be overridden through kd.call arguments. **defaults
and inputs to kd.call will be combined and passed through to the function.
If a parameter that is not passed does not have a default value defined by
the function then an exception will occur.
Returns:
A DataItem representing the functor.
kd.functor.reduce(fn, items, initial_value)
Reduces an iterable using the given functor.
The result is a DataSlice that has the value: fn(fn(fn(initial_value,
items[0]), items[1]), ...), where the fn calls are done in the order of the
items in the iterable.
Args:
fn: A binary function or functor to be applied to each item of the iterable;
its return type must be the same as the first argument.
items: An iterable to be reduced.
initial_value: The initial value to be passed to the functor.
Returns:
Result of the reduction as a single value.
kd.functor.register_py_fn(f, *, return_type_as=<class 'koladata.types.data_slice.DataSlice'>, unsafe_override=False, **defaults)
Aliases:
Returns a Koda functor wrapping a function registered as an operator.
This is the recommended way to wrap a non-traceable python function into a
functor.
`f` will be wrapped into an operator and registered with the name taken from
the `__module__` and `__qualname__` attributes. This requires `f` to be named,
and to not be a locally defined function. Furthermore, attempts to call
`register_py_fn` on the same function will fail unless unsafe_override is True
(not recommended).
The resulting functor can be serialized and loaded from a different process
(unlike `py_fn`). In order for the serialized functor to be deserialized, an
equivalent call to `register_py_fn` is required to have been made in the
process that performs the deserialization. In practice, this is often
implemented through `f` being a top-level function that is traced at library
import time.
Note that unlike the functors created by kd.functor.expr_fn from an Expr, this
functor will have exactly the same signature as the original function. In
particular, if the original function does not accept variadic keyword
arguments and and unknown argument is passed when calling the functor, an
exception will occur.
Args:
f: Python function. It is required that this function returns a
DataSlice/DataItem or a primitive that will be automatically wrapped into
a DataItem. The function must also have `__qualname__` and `__module__`
attributes set.
return_type_as: The return type of the function is expected to be the same
as the type of this value. This needs to be specified if the function does
not return a DataSlice/DataItem or a primitive that would be auto-boxed
into a DataItem. kd.types.DataSlice, kd.types.DataBag and
kd.types.JaggedShape can also be passed here.
unsafe_override: Whether to override an existing operator. Not recommended
unless you know what you are doing.
**defaults: Keyword defaults to bind to the function. The values in this map
may be Koda expressions or DataItems (see docstring for kd.bind for more
details). Defaults can be overridden through kd.call arguments. **defaults
and inputs to kd.call will be combined and passed through to the function.
If a parameter that is not passed does not have a default value defined by
the function then an exception will occur.
Returns:
A DataItem representing the functor.
kd.functor.trace_as_fn(*, name=None, return_type_as=None, functor_factory=None)
Aliases:
A decorator to customize the tracing behavior for a particular function.
A function with this decorator is converted to an internally-stored functor.
In traced expressions that call the function, that functor is invoked as a
sub-functor via by 'kde.call', rather than the function being re-traced.
Additionally, the functor passed to 'kde.call' is assigned a name, so that
when auto_variables=True is used (which is the default in kd.trace_py_fn),
the functor for the decorated function will become an attribute of the
functor for the outer function being traced.
The result of 'kde.call' is also assigned a name with a '_result' suffix, so
that it also becomes an separate variable in the outer function being traced.
This is useful for debugging.
This can be used to avoid excessive re-tracing and recompilation of shared
python functions, to quickly add structure to the functor produced by tracing
for complex computations, or to conveniently embed a py_fn into a traced
expression.
When using kd.parallel.call_multithreaded, using this decorator on
sub-functors can improve parallelism, since all sub-functor calls
are treated as separate tasks to be parallelized there.
This decorator is intended to be applied to standalone functions.
When applying it to a lambda, consider specifying an explicit name, otherwise
it will be called '<lambda>' or '<lambda>_0' etc, which is not very useful.
When applying it to a class method, it is likely to fail in tracing mode
because it will try to auto-box the class instance into an expr, which is
likely not supported.
When executing the resulting function in eager mode, we will evaluate the
underlying function directly instead of evaluating the functor, to have
nicer stack traces in case of an exception. However, we will still apply
the boxing rules on the returned value (for example, convert Python primitives
to DataItems) to better emulate what will happen in tracing mode.
kd.functor.trace_py_fn(f, *, auto_variables=True, **defaults)
Aliases:
Returns a Koda functor created by tracing a given Python function.
When 'f' has variadic positional (*args) or variadic keyword
(**kwargs) arguments, their name must start with 'unused', and they
must actually be unused inside 'f'.
'f' must not use Python control flow operations such as if or for.
Args:
f: Python function.
auto_variables: When true, we create additional variables automatically
based on the traced expression. All DataSlice literals become their own
variables, and all named subexpressions become their own variables. This
helps readability and manipulation of the resulting functor. Note that
this defaults to True here, while it defaults to False in
kd.functor.expr_fn.
**defaults: Keyword defaults to bind to the function. The values in this map
may be Koda expressions or DataItems (see docstring for kd.bind for more
details). Defaults can be overridden through kd.call arguments. **defaults
and inputs to kd.call will be combined and passed through to the function.
If a parameter that is not passed does not have a default value defined by
the function then an exception will occur.
Returns:
A DataItem representing the functor.
kd.functor.while_(condition_fn, body_fn, *, returns=unspecified, yields=unspecified, yields_interleaved=unspecified, **initial_state)
Aliases:
While a condition functor returns present, runs a body functor repeatedly.
The items in `initial_state` (and `returns`, if specified) are used to
initialize a dict of state variables, which are passed as keyword arguments
to `condition_fn` and `body_fn` on each loop iteration, and updated from the
namedtuple (see kd.namedtuple) return value of `body_fn`.
Exactly one of `returns`, `yields`, or `yields_interleaved` must be specified.
The return value of this operator depends on which one is present:
- `returns`: the value of `returns` when the loop ends. The initial value of
`returns` must have the same qtype (e.g. DataSlice, DataBag) as the final
return value.
- `yields`: a single iterable chained (using `kd.iterables.chain`) from the
value of `yields` returned from each invocation of `body_fn`, The value of
`yields` must always be an iterable, including initially.
- `yields_interleaved`: the same as for `yields`, but the iterables are
interleaved (using `kd.iterables.iterleave`) instead of being chained.
Args:
condition_fn: A functor with keyword argument names matching the state
variable names and returning a MASK DataItem.
body_fn: A functor with argument names matching the state variable names and
returning a namedtuple (see kd.namedtuple) with a subset of the keys
of `initial_state`.
returns: If present, the initial value of the 'returns' state variable.
yields: If present, the initial value of the 'yields' state variable.
yields_interleaved: If present, the initial value of the
`yields_interleaved` state variable.
**initial_state: A dict of the initial values for state variables.
Returns:
If `returns` is a state variable, the value of `returns` when the loop
ended. Otherwise, an iterable combining the values of `yields` or
`yields_interleaved` from each body invocation.
Operators that work with ItemIds.
Operators
kd.ids.agg_uuid(x, ndim=unspecified)
Aliases:
Computes aggregated uuid of elements over the last `ndim` dimensions.
Args:
x: A DataSlice.
ndim: The number of dimensions to aggregate over. Requires 0 <= ndim <=
get_ndim(x).
Returns:
DataSlice with that has `rank = rank - ndim` and shape: `shape =
shape[:-ndim]`.
kd.ids.decode_itemid(ds)
Aliases:
Returns ItemIds decoded from the base62 strings.
kd.ids.deep_uuid(x, /, schema=unspecified, *, seed='')
Aliases:
Recursively computes uuid for x.
Args:
x: The slice to take uuid on.
schema: The schema to use to resolve '*' and '**' tokens. If not specified,
will use the schema of the 'x' DataSlice.
seed: The seed to use for uuid computation.
Returns:
Result of recursive uuid application `x`.
kd.ids.encode_itemid(ds)
Aliases:
Returns the base62 encoded ItemIds in `ds` as strings.
kd.ids.has_uuid(x)
Returns present for each item in `x` that has an UUID.
Also see `kd.ids.is_uuid` for checking if `x` is a UUIDs DataSlice. But note
that `kd.all(kd.has_uuid(x))` is not always equivalent to `kd.is_uuid(x)`. For
example,
kd.ids.is_uuid(kd.item(None, kd.OBJECT)) -> kd.present
kd.all(kd.ids.has_uuid(kd.item(None, kd.OBJECT))) -> invalid for kd.all
kd.ids.is_uuid(kd.item([None], kd.OBJECT)) -> kd.present
kd.all(kd.ids.has_uuid(kd.item([None], kd.OBJECT))) -> kd.missing
Args:
x: DataSlice to check.
Returns:
A MASK DataSlice with the same shape as `x`.
kd.ids.hash_itemid(x)
Aliases:
Returns a INT64 DataSlice of hash values of `x`.
The hash values are in the range of [0, 2**63-1].
The hash algorithm is subject to change. It is not guaranteed to be stable in
future releases.
Args:
x: DataSlice of ItemIds.
Returns:
A DataSlice of INT64 hash values.
kd.ids.is_uuid(x)
Returns whether x is an UUID DataSlice.
Note that the operator returns `kd.present` even for missing values, as long
as their schema does not prevent containing UUIDs.
Also see `kd.ids.has_uuid` for a pointwise version. But note that
`kd.all(kd.ids.has_uuid(x))` is not always equivalent to `kd.is_uuid(x)`. For
example,
kd.ids.is_uuid(kd.item(None, kd.OBJECT)) -> kd.present
kd.all(kd.ids.has_uuid(kd.item(None, kd.OBJECT))) -> invalid for kd.all
kd.ids.is_uuid(kd.item([None], kd.OBJECT)) -> kd.present
kd.all(kd.ids.has_uuid(kd.item([None], kd.OBJECT))) -> kd.missing
Args:
x: DataSlice to check.
Returns:
A MASK DataItem.
kd.ids.uuid(seed='', **kwargs)
Aliases:
Creates a DataSlice whose items are Fingerprints identifying arguments.
Args:
seed: text seed for the uuid computation.
**kwargs: a named tuple mapping attribute names to DataSlices. The DataSlice
values must be alignable.
Returns:
DataSlice of Uuids. The i-th uuid is computed by taking the i-th (aligned)
item from each kwarg value.
kd.ids.uuid_for_dict(seed='', **kwargs)
Aliases:
Creates a DataSlice whose items are Fingerprints identifying arguments.
To be used for keying dict items.
e.g.
kd.dict(['a', 'b'], [1, 2], itemid=kd.uuid_for_dict(seed='seed', a=ds(1)))
Args:
seed: text seed for the uuid computation.
**kwargs: a named tuple mapping attribute names to DataSlices. The DataSlice
values must be alignable.
Returns:
DataSlice of Uuids. The i-th uuid is computed by taking the i-th (aligned)
item from each kwarg value.
kd.ids.uuid_for_list(seed='', **kwargs)
Aliases:
Creates a DataSlice whose items are Fingerprints identifying arguments.
To be used for keying list items.
e.g.
kd.list([1, 2, 3], itemid=kd.uuid_for_list(seed='seed', a=ds(1)))
Args:
seed: text seed for the uuid computation.
**kwargs: a named tuple mapping attribute names to DataSlices. The DataSlice
values must be alignable.
Returns:
DataSlice of Uuids. The i-th uuid is computed by taking the i-th (aligned)
item from each kwarg value.
kd.ids.uuids_with_allocation_size(seed='', *, size)
Aliases:
Creates a DataSlice whose items are uuids.
The uuids are allocated in a single allocation. They are all distinct.
You can think of the result as a DataSlice created with:
[fingerprint(seed, size, i) for i in range(size)]
Args:
seed: text seed for the uuid computation.
size: the size of the allocation. It will also be used for the uuid
computation.
Returns:
A 1-dimensional DataSlice with `size` distinct uuids.
Operators that work with iterables. These APIs are in active development and might change often.
Operators
kd.iterables.chain(*iterables, value_type_as=unspecified)
Creates an iterable that chains the given iterables, in the given order.
The iterables must all have the same value type. If value_type_as is
specified, it must be the same as the value type of the iterables, if any.
Args:
*iterables: A list of iterables to be chained (concatenated).
value_type_as: A value that has the same type as the iterables. It is useful
to specify this explicitly if the list of iterables may be empty. If this
is not specified and the list of iterables is empty, the iterable will
have DataSlice as the value type.
Returns:
An iterable that chains the given iterables, in the given order.
kd.iterables.from_1d_slice(slice_)
Converts a 1D DataSlice to a Koda iterable of DataItems.
Args:
slice_: A 1D DataSlice to be converted to an iterable.
Returns:
A Koda iterable of DataItems, in the order of the slice. All returned
DataItems point to the same DataBag as the input DataSlice.
kd.iterables.interleave(*iterables, value_type_as=unspecified)
Creates an iterable that interleaves the given iterables.
The resulting iterable has all items from all input iterables, and the order
within each iterable is preserved. But the order of interleaving of different
iterables can be arbitrary.
Having unspecified order allows the parallel execution to put the items into
the result in the order they are computed, potentially increasing the amount
of parallel processing done.
The iterables must all have the same value type. If value_type_as is
specified, it must be the same as the value type of the iterables, if any.
Args:
*iterables: A list of iterables to be interleaved.
value_type_as: A value that has the same type as the iterables. It is useful
to specify this explicitly if the list of iterables may be empty. If this
is not specified and the list of iterables is empty, the iterable will
have DataSlice as the value type.
Returns:
An iterable that interleaves the given iterables, in arbitrary order.
kd.iterables.make(*items, value_type_as=unspecified)
Creates an iterable from the provided items, in the given order.
The items must all have the same type (for example data slice, or data bag).
However, in case of data slices, the items can have different shapes or
schemas.
Args:
*items: Items to be put into the iterable.
value_type_as: A value that has the same type as the items. It is useful to
specify this explicitly if the list of items may be empty. If this is not
specified and the list of items is empty, the iterable will have data
slice as the value type.
Returns:
An iterable with the given items.
kd.iterables.make_unordered(*items, value_type_as=unspecified)
Creates an iterable from the provided items, in an arbitrary order.
Having unspecified order allows the parallel execution to put the items into
the iterable in the order they are computed, potentially increasing the amount
of parallel processing done.
When used with the non-parallel evaluation, we intentionally randomize the
order to prevent user code from depending on the order, and avoid
discrepancies when switching to parallel evaluation.
Args:
*items: Items to be put into the iterable.
value_type_as: A value that has the same type as the items. It is useful to
specify this explicitly if the list of items may be empty. If this is not
specified and the list of items is empty, the iterable will have data
slice as the value type.
Returns:
An iterable with the given items, in an arbitrary order.
kd.iterables.reduce_concat(items, initial_value, ndim=1)
Concatenates the values of the given iterable.
This operator is a concrete case of the more general kd.functor.reduce, which
exists to speed up such concatenation from O(N^2) that the general reduce
would provide to O(N). See the docstring of kd.concat for more details about
the concatenation semantics.
Args:
items: An iterable of data slices to be concatenated.
initial_value: The initial value to be concatenated before items.
ndim: The number of last dimensions to concatenate.
Returns:
The concatenated data slice.
kd.iterables.reduce_updated_bag(items, initial_value)
Merges the bags from the given iterable into one.
This operator is a concrete case of the more general kd.functor.reduce, which
exists to speed up such merging from O(N^2) that the general reduce
would provide to O(N). See the docstring of kd.updated_bag for more details
about the merging semantics.
Args:
items: An iterable of data bags to be merged.
initial_value: The data bag to be merged with the items. Note that the items
will be merged as updates to this bag, meaning that they will take
precedence over the initial_value on conflicts.
Returns:
The merged data bag.
JSON serialization operators.
Operators
kd.json.from_json(x, /, schema=OBJECT, default_number_schema=OBJECT, *, on_invalid=DataSlice([], schema: NONE), keys_attr='json_object_keys', values_attr='json_object_values')
Aliases:
Parses a DataSlice `x` of JSON strings.
The result will have the same shape as `x`, and missing items in `x` will be
missing in the result. The result will use a new immutable DataBag.
If `schema` is OBJECT (the default), the schema is inferred from the JSON
data, and the result will have an OBJECT schema. The decoded data will only
have BOOLEAN, numeric, STRING, LIST[OBJECT], and entity schemas, corresponding
to JSON primitives, arrays, and objects.
If `default_number_schema` is OBJECT (the default), then the inferred schema
of each JSON number will be INT32, INT64, or FLOAT32, depending on its value
and on whether it contains a decimal point or exponent, matching the combined
behavior of python json and `kd.from_py`. Otherwise, `default_number_schema`
must be a numeric schema, and the inferred schema of all JSON numbers will be
that schema.
For example:
kd.from_json(None) -> kd.obj(None)
kd.from_json('null') -> kd.obj(None)
kd.from_json('true') -> kd.obj(True)
kd.from_json('[true, false, null]') -> kd.obj([True, False, None])
kd.from_json('[1, 2.0]') -> kd.obj([1, 2.0])
kd.from_json('[1, 2.0]', kd.OBJECT, kd.FLOAT64)
-> kd.obj([kd.float64(1.0), kd.float64(2.0)])
JSON objects parsed using an OBJECT schema will record the object key order on
the attribute specified by `keys_attr` as a LIST[STRING], and also redundantly
record a copy of the object values as a parallel LIST on the attribute
specified by `values_attr`. If there are duplicate keys, the last value is the
one stored on the Koda object attribute. If a key conflicts with `keys_attr`
or `values_attr`, it is only available in the `values_attr` list. These
behaviors can be disabled by setting `keys_attr` and/or `values_attr` to None.
For example:
kd.from_json('{"a": 1, "b": "y", "c": null}') ->
kd.obj(a=1.0, b='y', c=None,
json_object_keys=kd.list(['a', 'b', 'c']),
json_object_values=kd.list([1.0, 'y', None]))
kd.from_json('{"a": 1, "b": "y", "c": null}',
keys_attr=None, values_attr=None) ->
kd.obj(a=1.0, b='y', c=None)
kd.from_json('{"a": 1, "b": "y", "c": null}',
keys_attr='my_keys', values_attr='my_values') ->
kd.obj(a=1.0, b='y', c=None,
my_keys=kd.list(['a', 'b', 'c']),
my_values=kd.list([1.0, 'y', None]))
kd.from_json('{"a": 1, "a": 2", "a": 3}') ->
kd.obj(a=3.0,
json_object_keys=kd.list(['a', 'a', 'a']),
json_object_values=kd.list([1.0, 2.0, 3.0]))
kd.from_json('{"json_object_keys": ["x", "y"]}') ->
kd.obj(json_object_keys=kd.list(['json_object_keys']),
json_object_values=kd.list([["x", "y"]]))
If `schema` is explicitly specified, it is used to validate the JSON data,
and the result DataSlice will have `schema` as its schema.
OBJECT schemas inside subtrees of `schema` are allowed, and will use the
inference behavior described above.
Primitive schemas in `schema` will attempt to cast any JSON primitives using
normal Koda explicit casting rules, and if those fail, using the following
additional rules:
- BYTES will accept JSON strings containing base64 (RFC 4648 section 4)
If entity schemas in `schema` have attributes matching `keys_attr` and/or
`values_attr`, then the object key and/or value order (respectively) will be
recorded as lists on those attributes, similar to the behavior for OBJECT
described above. These attributes must have schemas LIST[STRING] and
LIST[T] (for a T compatible with the contained values) if present.
For example:
kd.from_json('null', kd.MASK) -> kd.missing
kd.from_json('null', kd.STRING) -> kd.str(None)
kd.from_json('123', kd.INT32) -> kd.int32(123)
kd.from_json('123', kd.FLOAT32) -> kd.int32(123.0)
kd.from_json('"123"', kd.STRING) -> kd.string('123')
kd.from_json('"123"', kd.INT32) -> kd.int32(123)
kd.from_json('"123"', kd.FLOAT32) -> kd.float32(123.0)
kd.from_json('"MTIz"', kd.BYTES) -> kd.bytes(b'123')
kd.from_json('"inf"', kd.FLOAT32) -> kd.float32(float('inf'))
kd.from_json('"1e100"', kd.FLOAT32) -> kd.float32(float('inf'))
kd.from_json('[1, 2, 3]', kd.list_schema(kd.INT32)) -> kd.list([1, 2, 3])
kd.from_json('{"a": 1}', kd.schema.new_schema(a=kd.INT32)) -> kd.new(a=1)
kd.from_json('{"a": 1}', kd.dict_schema(kd.STRING, kd.INT32)
-> kd.dict({"a": 1})
kd.from_json('{"b": 1, "a": 2}',
kd.new_schema(
a=kd.INT32, json_object_keys=kd.list_schema(kd.STRING))) ->
kd.new(a=1, json_object_keys=kd.list(['b', 'a', 'c']))
kd.from_json('{"b": 1, "a": 2, "c": 3}',
kd.new_schema(a=kd.INT32,
json_object_keys=kd.list_schema(kd.STRING),
json_object_values=kd.list_schema(kd.OBJECT))) ->
kd.new(a=1, c=3.0,
json_object_keys=kd.list(['b', 'a', 'c']),
json_object_values=kd.list([1, 2.0, 3.0]))
In general:
`kd.to_json(kd.from_json(x))` is equivalent to `x`, ignoring differences in
JSON number formatting and padding.
`kd.from_json(kd.to_json(x), kd.get_schema(x))` is equivalent to `x` if `x`
has a concrete (no OBJECT) schema, ignoring differences in Koda itemids.
In other words, `to_json` doesn't capture the full information of `x`, but
the original schema of `x` has enough additional information to recover it.
Args:
x: A DataSlice of STRING containing JSON strings to parse.
schema: A SCHEMA DataItem containing the desired result schema. Defaults to
kd.OBJECT.
default_number_schema: A SCHEMA DataItem containing a numeric schema, or
None to infer all number schemas using python-boxing-like rules.
on_invalid: If specified, a DataItem to use in the result wherever the
corresponding JSON string in `x` was invalid. If unspecified, any invalid
JSON strings in `x` will cause an operator error.
keys_attr: A STRING DataItem that controls which entity attribute is used to
record json object key order, if it is present on the schema.
values_attr: A STRING DataItem that controls which entity attribute is used
to record json object values, if it is present on the schema.
Returns:
A DataSlice with the same shape as `x` and schema `schema`.
kd.json.to_json(x, /, *, indent=DataItem(None, schema: NONE), ensure_ascii=True, keys_attr='json_object_keys', values_attr='json_object_values', include_missing_values=True)
Aliases:
Converts `x` to a DataSlice of JSON strings.
The following schemas are allowed:
- STRING, BYTES, INT32, INT64, FLOAT32, FLOAT64, MASK, BOOLEAN
- LIST[T] where T is an allowed schema
- DICT{K, V} where K is one of {STRING, BYTES, INT32, INT64}, and V is an
allowed schema
- Entity schemas where all attribute values have allowed schemas
- OBJECT schemas resolving to allowed schemas
Itemid cycles are not allowed.
Missing DataSlice items in the input are missing in the result. Missing values
inside of lists/entities/etc. are encoded as JSON `null` (or `false` for
`kd.missing`). If `include_missing_values` is `False`, entity attributes with
missing values are omitted from the JSON output.
For example:
kd.to_json(None) -> kd.str(None)
kd.to_json(kd.missing) -> kd.str(None)
kd.to_json(kd.present) -> 'true'
kd.to_json(True) -> 'true'
kd.to_json(kd.slice([1, None, 3])) -> ['1', None, '3']
kd.to_json(kd.list([1, None, 3])) -> '[1, null, 3]'
kd.to_json(kd.dict({'a': 1, 'b':'2'}) -> '{"a": 1, "b": "2"}'
kd.to_json(kd.new(a=1, b='2')) -> '{"a": 1, "b": "2"}'
kd.to_json(kd.new(x=None)) -> '{"x": null}'
kd.to_json(kd.new(x=kd.missing)) -> '{"x": false}'
kd.to_json(kd.new(a=1, b=None), include_missing_values=False)
-> '{"a": 1}'
Koda BYTES values are converted to base64 strings (RFC 4648 section 4).
Integers are always stored exactly in decimal. Finite floating point values
are formatted similar to python format string `%.17g`, except that a decimal
point and at least one decimal digit are always present if the format doesn't
use scientific notation. This appears to match the behavior of python json.
Non-finite floating point values are stored as the strings "inf", "-inf" and
"nan". This differs from python json, which emits non-standard JSON tokens
`Infinity` and `NaN`. This also differs from javascript, which stores these
values as `null`, which would be ambiguous with Koda missing values. There is
unfortunately no standard way to express these values in JSON.
By default, JSON objects are written with keys in sorted order. However, it is
also possible to control the key order of JSON objects using the `keys_attr`
argument. If an entity has the attribute specified by `keys_attr`, then that
attribute must have schema LIST[STRING], and the JSON object will have exactly
the key order specified in that list, including duplicate keys.
To write duplicate JSON object keys with different values, use `values_attr`
to designate an attribute to hold a parallel list of values to write.
For example:
kd.to_json(kd.new(x=1, y=2)) -> '{"x": 2, "y": 1}'
kd.to_json(kd.new(x=1, y=2, json_object_keys=kd.list(['y', 'x'])))
-> '{"y": 2, "x": 1}'
kd.to_json(kd.new(x=1, y=2, foo=kd.list(['y', 'x'])), keys_attr='foo')
-> '{"y": 2, "x": 1}'
kd.to_json(kd.new(x=1, y=2, z=3, json_object_keys=kd.list(['x', 'z', 'x'])))
-> '{"x": 1, "z": 3, "x": 1}'
kd.to_json(kd.new(json_object_keys=kd.list(['x', 'z', 'x']),
json_object_values=kd.list([1, 2, 3])))
-> '{"x": 1, "z": 2, "x": 3}'
kd.to_json(kd.new(a=kd.list(['x', 'z', 'x']), b=kd.list([1, 2, 3])),
keys_attr='a', values_attr='b')
-> '{"x": 1, "z": 2, "x": 3}'
The `indent` and `ensure_ascii` arguments control JSON formatting:
- If `indent` is negative, then the JSON is formatted without any whitespace.
- If `indent` is None (the default), the JSON is formatted with a single
padding space only after ',' and ':' and no other whitespace.
- If `indent` is zero or positive, the JSON is pretty-printed, with that
number of spaces used for indenting each level.
- If `ensure_ascii` is True (the default) then all non-ASCII code points in
strings will be escaped, and the result strings will be ASCII-only.
Otherwise, they will be left as-is.
For example:
kd.to_json(kd.list([1, 2, 3]), indent=-1) -> '[1,2,3]'
kd.to_json(kd.list([1, 2, 3]), indent=2) -> '[\n 1,\n 2,\n 3\n]'
kd.to_json('✨', ensure_ascii=True) -> '"\\u2728"'
kd.to_json('✨', ensure_ascii=False) -> '"✨"'
Args:
x: The DataSlice to convert.
indent: An INT32 DataItem that describes how the result should be indented.
ensure_ascii: A BOOLEAN DataItem that controls non-ASCII escaping.
keys_attr: A STRING DataItem that controls which entity attribute controls
json object key order, or None to always use sorted order. Defaults to
`json_object_keys`.
values_attr: A STRING DataItem that can be used with `keys_attr` to give
full control over json object contents. Defaults to
`json_object_values`.
include_missing_values: A BOOLEAN DataItem. If `False`, attributes with
missing values will be omitted from entity JSON objects. Defaults to
`True`.
Operators working with lists.
Operators
kd.lists.appended_list(x, append)
Aliases:
Appends items in `append` to the end of each list in `x`.
`x` and `append` must have compatible shapes.
The resulting lists have different ItemIds from the original lists.
Args:
x: DataSlice of lists.
append: DataSlice of values to append to each list in `x`.
Returns:
DataSlice of lists with new itemd ids in a new immutable DataBag.
kd.lists.concat(*lists)
Aliases:
Returns a DataSlice of Lists concatenated from the List items of `lists`.
Returned lists are immutable.
Each input DataSlice must contain only present List items, and the item
schemas of each input must be compatible. Input DataSlices are aligned (see
`kd.align`) automatically before concatenation.
If `lists` is empty, this returns a single empty list with OBJECT item schema.
Args:
*lists: the DataSlices of Lists to concatenate
Returns:
DataSlice of concatenated Lists
kd.lists.explode(x, ndim=1)
Aliases:
Explodes a List DataSlice `x` a specified number of times.
A single list "explosion" converts a rank-K DataSlice of LIST[T] to a
rank-(K+1) DataSlice of T, by unpacking the items in the Lists in the original
DataSlice as a new DataSlice dimension in the result. Missing values in the
original DataSlice are treated as empty lists.
A single list explosion can also be done with `x[:]`.
If `ndim` is set to a non-negative integer, explodes recursively `ndim` times.
An `ndim` of zero is a no-op.
If `ndim` is set to a negative integer, explodes as many times as possible,
until at least one of the items of the resulting DataSlice is not a List.
Args:
x: DataSlice of Lists to explode
ndim: the number of explosion operations to perform, defaults to 1
Returns:
DataSlice
kd.lists.get_item(x, key_or_index)
Alias for kd.core.get_item operator.
kd.lists.has_list(x)
Aliases:
Returns present for each item in `x` that is List.
Note that this is a pointwise operation.
Also see `kd.is_list` for checking if `x` is a List DataSlice. But note that
`kd.all(kd.has_list(x))` is not always equivalent to `kd.is_list(x)`. For
example,
kd.is_list(kd.item(None, kd.OBJECT)) -> kd.present
kd.all(kd.has_list(kd.item(None, kd.OBJECT))) -> invalid for kd.all
kd.is_list(kd.item([None], kd.OBJECT)) -> kd.present
kd.all(kd.has_list(kd.item([None], kd.OBJECT))) -> kd.missing
Args:
x: DataSlice to check.
Returns:
A MASK DataSlice with the same shape as `x`.
kd.lists.implode(x, /, ndim=1, itemid=None)
Aliases:
Implodes a Dataslice `x` a specified number of times.
Returned lists are immutable.
A single list "implosion" converts a rank-(K+1) DataSlice of T to a rank-K
DataSlice of LIST[T], by folding the items in the last dimension of the
original DataSlice into newly-created Lists.
If `ndim` is set to a non-negative integer, implodes recursively `ndim` times.
If `ndim` is set to a negative integer, implodes as many times as possible,
until the result is a DataItem (i.e. a rank-0 DataSlice) containing a single
nested List.
Args:
x: the DataSlice to implode
ndim: the number of implosion operations to perform
itemid: Optional ITEMID DataSlice used as ItemIds of the resulting lists.
Returns:
DataSlice of nested Lists
kd.lists.is_list(x)
Aliases:
Returns whether x is a List DataSlice.
`x` is a List DataSlice if it meets one of the following conditions:
1) it has a List schema
2) it has OBJECT schema and only has List items
Also see `kd.has_list` for a pointwise version. But note that
`kd.all(kd.has_list(x))` is not always equivalent to `kd.is_list(x)`. For
example,
kd.is_list(kd.item(None, kd.OBJECT)) -> kd.present
kd.all(kd.has_list(kd.item(None, kd.OBJECT))) -> invalid for kd.all
kd.is_list(kd.item([None], kd.OBJECT)) -> kd.present
kd.all(kd.has_list(kd.item([None], kd.OBJECT))) -> kd.missing
Args:
x: DataSlice to check.
Returns:
A MASK DataItem.
kd.lists.like(shape_and_mask_from, /, items=None, *, item_schema=None, schema=None, itemid=None)
Aliases:
Creates new Koda lists with shape and sparsity of `shape_and_mask_from`.
Returns immutable lists.
Args:
shape_and_mask_from: a DataSlice with the shape and sparsity for the
desired lists.
items: optional items to assign to the newly created lists. If not
given, the function returns empty lists.
item_schema: the schema of the list items. If not specified, it will be
deduced from `items` or defaulted to OBJECT.
schema: The schema to use for the list. If specified, then item_schema must
not be specified.
itemid: Optional ITEMID DataSlice used as ItemIds of the resulting lists.
Returns:
A DataSlice with the lists.
kd.lists.list_append_update(x, append)
Aliases:
Returns a DataBag containing an update to a DataSlice of lists.
The updated lists are the lists in `x` with the specified items appended at
the end.
`x` and `append` must have compatible shapes.
The resulting lists maintain the same ItemIds. Also see kd.appended_list()
which works similarly but resulting lists have new ItemIds.
Args:
x: DataSlice of lists.
append: DataSlice of values to append to each list in `x`.
Returns:
A new immutable DataBag containing the list with the appended items.
kd.lists.select_items(ds, fltr)
Aliases:
Selects List items by filtering out missing items in fltr.
Also see kd.select.
Args:
ds: List DataSlice to be filtered
fltr: filter can be a DataSlice with dtype as kd.MASK. It can also be a Koda
Functor or a Python function which can be evalauted to such DataSlice. A
Python function will be traced for evaluation, so it cannot have Python
control flow operations such as `if` or `while`.
Returns:
Filtered DataSlice.
kd.lists.shaped(shape, /, items=None, *, item_schema=None, schema=None, itemid=None)
Aliases:
Creates new Koda lists with the given shape.
Returns immutable lists.
Args:
shape: the desired shape.
items: optional items to assign to the newly created lists. If not
given, the function returns empty lists.
item_schema: the schema of the list items. If not specified, it will be
deduced from `items` or defaulted to OBJECT.
schema: The schema to use for the list. If specified, then item_schema must
not be specified.
itemid: Optional ITEMID DataSlice used as ItemIds of the resulting lists.
Returns:
A DataSlice with the lists.
kd.lists.shaped_as(shape_from, /, items=None, *, item_schema=None, schema=None, itemid=None)
Aliases:
Creates new Koda lists with shape of the given DataSlice.
Returns immutable lists.
Args:
shape_from: mandatory DataSlice, whose shape the returned DataSlice will
have.
items: optional items to assign to the newly created lists. If not given,
the function returns empty lists.
item_schema: the schema of the list items. If not specified, it will be
deduced from `items` or defaulted to OBJECT.
schema: The schema to use for the list. If specified, then item_schema must
not be specified.
itemid: Optional ITEMID DataSlice used as ItemIds of the resulting lists.
Returns:
A DataSlice with the lists.
kd.lists.size(list_slice)
Aliases:
Returns size of a List.
kd.lists.with_list_append_update(x, append)
Aliases:
Returns a DataSlice with a new DataBag containing updated appended lists.
The updated lists are the lists in `x` with the specified items appended at
the end.
`x` and `append` must have compatible shapes.
The resulting lists maintain the same ItemIds. Also see kd.appended_list()
which works similarly but resulting lists have new ItemIds.
Args:
x: DataSlice of lists.
append: DataSlice of values to append to each list in `x`.
Returns:
A DataSlice of lists in a new immutable DataBag.
Masking operators.
Operators
kd.masking.agg_all(x, ndim=unspecified)
Aliases:
Returns present if all elements are present along the last ndim dimensions.
`x` must have MASK dtype.
The resulting slice has `rank = rank - ndim` and shape: `shape =
shape[:-ndim]`.
Args:
x: A DataSlice.
ndim: The number of dimensions to compute indices over. Requires 0 <= ndim
<= get_ndim(x).
kd.masking.agg_any(x, ndim=unspecified)
Aliases:
Returns present if any element is present along the last ndim dimensions.
`x` must have MASK dtype.
The resulting slice has `rank = rank - ndim` and shape: `shape =
shape[:-ndim]`.
Args:
x: A DataSlice.
ndim: The number of dimensions to compute indices over. Requires 0 <= ndim
<= get_ndim(x).
kd.masking.agg_has(x, ndim=unspecified)
Aliases:
Returns present iff any element is present along the last ndim dimensions.
The resulting slice has `rank = rank - ndim` and shape: `shape =
shape[:-ndim]`.
It is equivalent to `kd.agg_any(kd.has(x))`.
Args:
x: A DataSlice.
ndim: The number of dimensions to compute indices over. Requires 0 <= ndim
<= get_ndim(x).
kd.masking.all(x)
Aliases:
Returns present iff all elements are present over all dimensions.
`x` must have MASK dtype.
The result is a zero-dimensional DataItem.
Args:
x: A DataSlice.
kd.masking.any(x)
Aliases:
Returns present iff any element is present over all dimensions.
`x` must have MASK dtype.
The result is a zero-dimensional DataItem.
Args:
x: A DataSlice.
kd.masking.apply_mask(x, y)
Aliases:
Filters `x` to items where `y` is present.
Pointwise masking operator that replaces items in DataSlice `x` by None
if corresponding items in DataSlice `y` of MASK dtype is `kd.missing`.
Args:
x: DataSlice.
y: DataSlice.
Returns:
Masked DataSlice.
kd.masking.coalesce(x, y)
Aliases:
Fills in missing values of `x` with values of `y`.
Pointwise masking operator that replaces missing items (i.e. None) in
DataSlice `x` by corresponding items in DataSlice y`.
`x` and `y` do not need to have the same type.
Args:
x: DataSlice.
y: DataSlice used to fill missing items in `x`.
Returns:
Coalesced DataSlice.
kd.masking.cond(condition, yes, no=DataItem(None, schema: NONE))
Aliases:
Returns `yes` where `condition` is present, otherwise `no`.
Pointwise operator selects items in `yes` if corresponding items are
`kd.present` or items in `no` otherwise. `condition` must have MASK dtype.
If `no` is unspecified corresponding items in result are missing.
Note that there is _no_ short-circuiting based on the `condition` - both `yes`
and `no` branches will be evaluated irrespective of its value. See `kd.if_`
for a short-circuiting version of this operator.
Args:
condition: DataSlice.
yes: DataSlice.
no: DataSlice or unspecified.
Returns:
DataSlice of items from `yes` and `no` based on `condition`.
kd.masking.disjoint_coalesce(x, y)
Aliases:
Fills in missing values of `x` with values of `y`.
Raises if `x` and `y` intersect. It is equivalent to `x | y` with additional
assertion that `x` and `y` are disjoint.
Args:
x: DataSlice.
y: DataSlice used to fill missing items in `x`.
Returns:
Coalesced DataSlice.
kd.masking.has(x)
Aliases:
Returns presence of `x`.
Pointwise operator which take a DataSlice and return a MASK indicating the
presence of each item in `x`. Returns `kd.present` for present items and
`kd.missing` for missing items.
Args:
x: DataSlice.
Returns:
DataSlice representing the presence of `x`.
kd.masking.has_not(x)
Aliases:
Returns present iff `x` is missing element-wise.
Pointwise operator which take a DataSlice and return a MASK indicating
iff `x` is missing element-wise. Returns `kd.present` for missing
items and `kd.missing` for present items.
Args:
x: DataSlice.
Returns:
DataSlice representing the non-presence of `x`.
kd.masking.mask_and(x, y)
Aliases:
Applies pointwise MASK_AND operation on `x` and `y`.
Both `x` and `y` must have MASK dtype. MASK_AND operation is defined as:
kd.mask_and(kd.present, kd.present) -> kd.present
kd.mask_and(kd.present, kd.missing) -> kd.missing
kd.mask_and(kd.missing, kd.present) -> kd.missing
kd.mask_and(kd.missing, kd.missing) -> kd.missing
It is equivalent to `x & y`.
Args:
x: DataSlice.
y: DataSlice.
Returns:
DataSlice.
kd.masking.mask_equal(x, y)
Aliases:
Applies pointwise MASK_EQUAL operation on `x` and `y`.
Both `x` and `y` must have MASK dtype. MASK_EQUAL operation is defined as:
kd.mask_equal(kd.present, kd.present) -> kd.present
kd.mask_equal(kd.present, kd.missing) -> kd.missing
kd.mask_equal(kd.missing, kd.present) -> kd.missing
kd.mask_equal(kd.missing, kd.missing) -> kd.present
Note that this is different from `x == y`. For example,
kd.missing == kd.missing -> kd.missing
Args:
x: DataSlice.
y: DataSlice.
Returns:
DataSlice.
kd.masking.mask_not_equal(x, y)
Aliases:
Applies pointwise MASK_NOT_EQUAL operation on `x` and `y`.
Both `x` and `y` must have MASK dtype. MASK_NOT_EQUAL operation is defined as:
kd.mask_not_equal(kd.present, kd.present) -> kd.missing
kd.mask_not_equal(kd.present, kd.missing) -> kd.present
kd.mask_not_equal(kd.missing, kd.present) -> kd.present
kd.mask_not_equal(kd.missing, kd.missing) -> kd.missing
Note that this is different from `x != y`. For example,
kd.present != kd.missing -> kd.missing
kd.missing != kd.present -> kd.missing
Args:
x: DataSlice.
y: DataSlice.
Returns:
DataSlice.
kd.masking.mask_or(x, y)
Aliases:
Applies pointwise MASK_OR operation on `x` and `y`.
Both `x` and `y` must have MASK dtype. MASK_OR operation is defined as:
kd.mask_or(kd.present, kd.present) -> kd.present
kd.mask_or(kd.present, kd.missing) -> kd.present
kd.mask_or(kd.missing, kd.present) -> kd.present
kd.mask_or(kd.missing, kd.missing) -> kd.missing
It is equivalent to `x | y`.
Args:
x: DataSlice.
y: DataSlice.
Returns:
DataSlice.
kd.masking.present_like(x)
Aliases:
Creates a DataSlice of present masks with the shape and sparsity of `x`.
Example:
x = kd.slice([0], [0, None])
kd.present_like(x) -> kd.slice([[present], [present, None]])
Args:
x: DataSlice to match the shape and sparsity of.
Returns:
A DataSlice with the same shape and sparsity as `x`.
kd.masking.present_shaped(shape)
Aliases:
Creates a DataSlice of present masks with the given shape.
Example:
shape = kd.shapes.new([2], [1, 2])
kd.masking.present_shaped(shape) -> kd.slice([[present], [present,
present]])
Args:
shape: shape to expand to.
Returns:
A DataSlice with the same shape as `shape`.
kd.masking.present_shaped_as(x)
Aliases:
Creates a DataSlice of present masks with the shape of `x`.
Example:
x = kd.slice([0], [0, 0])
kd.masking.present_shaped_as(x) -> kd.slice([[present], [present, present]])
Args:
x: DataSlice to match the shape of.
Returns:
A DataSlice with the same shape as `x`.
kd.masking.xor(x, y)
Aliases:
Applies pointwise XOR operation on `x` and `y`.
Both `x` and `y` must have MASK dtype. XOR operation is defined as:
kd.xor(kd.present, kd.present) -> kd.missing
kd.xor(kd.present, kd.missing) -> kd.present
kd.xor(kd.missing, kd.present) -> kd.present
kd.xor(kd.missing, kd.missing) -> kd.missing
It is equivalent to `x ^ y`.
Args:
x: DataSlice.
y: DataSlice.
Returns:
DataSlice.
Arithmetic operators.
Operators
kd.math.abs(x)
Computes pointwise absolute value of the input.
kd.math.add(x, y)
Aliases:
Computes pointwise x + y.
kd.math.agg_inverse_cdf(x, cdf_arg, ndim=unspecified)
Returns the value with CDF (in [0, 1]) approximately equal to the input.
The value is computed along the last ndim dimensions.
The return value will have an offset of floor((cdf - 1e-6) * size()) in the
(ascendingly) sorted array.
Args:
x: a DataSlice of numbers.
cdf_arg: (float) CDF value.
ndim: The number of dimensions to compute inverse CDF over. Requires 0 <=
ndim <= get_ndim(x).
kd.math.agg_max(x, ndim=unspecified)
Aliases:
Returns the maximum of items along the last ndim dimensions.
The resulting slice has `rank = rank - ndim` and shape: `shape =
shape[:-ndim]`.
Example:
ds = kd.slice([[2, None, 1], [3, 4], [None, None]])
kd.agg_max(ds) # -> kd.slice([2, 4, None])
kd.agg_max(ds, ndim=1) # -> kd.slice([2, 4, None])
kd.agg_max(ds, ndim=2) # -> kd.slice(4)
Args:
x: A DataSlice of numbers.
ndim: The number of dimensions to compute indices over. Requires 0 <= ndim
<= get_ndim(x).
kd.math.agg_mean(x, ndim=unspecified)
Returns the means along the last ndim dimensions.
The resulting slice has `rank = rank - ndim` and shape: `shape =
shape[:-ndim]`.
Example:
ds = kd.slice([[1, None, None], [3, 4], [None, None]])
kd.agg_mean(ds) # -> kd.slice([1, 3.5, None])
kd.agg_mean(ds, ndim=1) # -> kd.slice([1, 3.5, None])
kd.agg_mean(ds, ndim=2) # -> kd.slice(2.6666666666666) # (1 + 3 + 4) / 3)
Args:
x: A DataSlice of numbers.
ndim: The number of dimensions to compute indices over. Requires 0 <= ndim
<= get_ndim(x).
kd.math.agg_median(x, ndim=unspecified)
Returns the medians along the last ndim dimensions.
The resulting slice has `rank = rank - ndim` and shape: `shape =
shape[:-ndim]`.
Please note that for even number of elements, the median is the next value
down from the middle, p.ex.: median([1, 2]) == 1.
That is made by design to fulfill the following property:
1. type of median(x) == type of elements of x;
2. median(x) ∈ x.
Args:
x: A DataSlice of numbers.
ndim: The number of dimensions to compute indices over. Requires 0 <= ndim
<= get_ndim(x).
kd.math.agg_min(x, ndim=unspecified)
Aliases:
Returns the minimum of items along the last ndim dimensions.
The resulting slice has `rank = rank - ndim` and shape: `shape =
shape[:-ndim]`.
Example:
ds = kd.slice([[2, None, 1], [3, 4], [None, None]])
kd.agg_min(ds) # -> kd.slice([1, 3, None])
kd.agg_min(ds, ndim=1) # -> kd.slice([1, 3, None])
kd.agg_min(ds, ndim=2) # -> kd.slice(1)
Args:
x: A DataSlice of numbers.
ndim: The number of dimensions to compute indices over. Requires 0 <= ndim
<= get_ndim(x).
kd.math.agg_std(x, unbiased=True, ndim=unspecified)
Returns the standard deviation along the last ndim dimensions.
The resulting slice has `rank = rank - ndim` and shape: `shape =
shape[:-ndim]`.
Example:
ds = kd.slice([10, 9, 11])
kd.agg_std(ds) # -> kd.slice(1.0)
kd.agg_std(ds, unbiased=False) # -> kd.slice(0.8164966)
Args:
x: A DataSlice of numbers.
unbiased: A boolean flag indicating whether to substract 1 from the number
of elements in the denominator.
ndim: The number of dimensions to compute indices over. Requires 0 <= ndim
<= get_ndim(x).
kd.math.agg_sum(x, ndim=unspecified)
Aliases:
Returns the sums along the last ndim dimensions.
The resulting slice has `rank = rank - ndim` and shape: `shape =
shape[:-ndim]`.
Example:
ds = kd.slice([[1, None, 1], [3, 4], [None, None]])
kd.agg_sum(ds) # -> kd.slice([2, 7, None])
kd.agg_sum(ds, ndim=1) # -> kd.slice([2, 7, None])
kd.agg_sum(ds, ndim=2) # -> kd.slice(9)
Args:
x: A DataSlice of numbers.
ndim: The number of dimensions to compute indices over. Requires 0 <= ndim
<= get_ndim(x).
kd.math.agg_var(x, unbiased=True, ndim=unspecified)
Returns the variance along the last ndim dimensions.
The resulting slice has `rank = rank - ndim` and shape: `shape =
shape[:-ndim]`.
Example:
ds = kd.slice([10, 9, 11])
kd.agg_var(ds) # -> kd.slice(1.0)
kd.agg_var(ds, unbiased=False) # -> kd.slice([0.6666667])
Args:
x: A DataSlice of numbers.
unbiased: A boolean flag indicating whether to substract 1 from the number
of elements in the denominator.
ndim: The number of dimensions to compute indices over. Requires 0 <= ndim
<= get_ndim(x).
kd.math.argmax(x, ndim=unspecified)
Aliases:
Returns indices of the maximum of items along the last ndim dimensions.
The resulting DataSlice has `rank = rank - ndim` and shape: `shape =
shape[:-ndim]`.
Returns the index of NaN in case there is a NaN present.
Example:
ds = kd.slice([[2, None, 1], [3, 4], [None, None], [2, NaN, 1]])
kd.argmax(ds) # -> kd.slice([0, 1, None, 1])
kd.argmax(ds, ndim=1) # -> kd.slice([0, 1, None, 1])
kd.argmax(ds, ndim=2) # -> kd.slice(8) # index of NaN
Args:
x: A DataSlice of numbers.
ndim: The number of dimensions to compute indices over. Requires 0 <= ndim
<= get_ndim(x).
kd.math.argmin(x, ndim=unspecified)
Aliases:
Returns indices of the minimum of items along the last ndim dimensions.
The resulting DataSlice has `rank = rank - ndim` and shape: `shape =
shape[:-ndim]`.
Returns the index of NaN in case there is a NaN present.
Example:
ds = kd.slice([[2, None, 1], [3, 4], [None, None], [2, NaN, 1]])
kd.argmin(ds) # -> kd.slice([2, 0, None, 1])
kd.argmin(ds, ndim=1) # -> kd.slice([2, 0, None, 1])
kd.argmin(ds, ndim=2) # -> kd.slice(8) # index of NaN
Args:
x: A DataSlice of numbers.
ndim: The number of dimensions to compute indices over. Requires 0 <= ndim
<= get_ndim(x).
kd.math.cdf(x, weights=unspecified, ndim=unspecified)
Returns the CDF of x in the last ndim dimensions of x element-wise.
The CDF is an array of floating-point values of the same shape as x and
weights, where each element represents which percentile the corresponding
element in x is situated at in its sorted group, i.e. the percentage of values
in the group that are smaller than or equal to it.
Args:
x: a DataSlice of numbers.
weights: if provided, will compute weighted CDF: each output value will
correspond to the weight percentage of values smaller than or equal to x.
ndim: The number of dimensions to compute CDF over.
kd.math.ceil(x)
Computes pointwise ceiling of the input, e.g.
rounding up: returns the smallest integer value that is not less than the
input.
kd.math.cum_max(x, ndim=unspecified)
Aliases:
Returns the cumulative max of items along the last ndim dimensions.
kd.math.cum_min(x, ndim=unspecified)
Returns the cumulative minimum of items along the last ndim dimensions.
kd.math.cum_sum(x, ndim=unspecified)
Returns the cumulative sum of items along the last ndim dimensions.
kd.math.divide(x, y)
Computes pointwise x / y.
kd.math.exp(x)
Computes pointwise exponential of the input.
kd.math.floor(x)
Computes pointwise floor of the input, e.g.
rounding down: returns the largest integer value that is not greater than the
input.
kd.math.floordiv(x, y)
Computes pointwise x // y.
kd.math.inverse_cdf(x, cdf_arg)
Returns the value with CDF (in [0, 1]) approximately equal to the input.
The return value is computed over all dimensions. It will have an offset of
floor((cdf - 1e-6) * size()) in the (ascendingly) sorted array.
Args:
x: a DataSlice of numbers.
cdf_arg: (float) CDF value.
kd.math.is_nan(x)
Aliases:
Returns pointwise `kd.present|missing` if the input is NaN or not.
kd.math.log(x)
Computes pointwise natural logarithm of the input.
kd.math.log10(x)
Computes pointwise logarithm in base 10 of the input.
kd.math.max(x)
Aliases:
Returns the maximum of items over all dimensions.
The result is a zero-dimensional DataItem.
Args:
x: A DataSlice of numbers.
kd.math.maximum(x, y)
Aliases:
Computes pointwise max(x, y).
kd.math.mean(x)
Returns the mean of elements over all dimensions.
The result is a zero-dimensional DataItem.
Args:
x: A DataSlice of numbers.
kd.math.median(x)
Returns the median of elements over all dimensions.
The result is a zero-dimensional DataItem.
Please note that for even number of elements, the median is the next value
down from the middle, p.ex.: median([1, 2]) == 1.
That is made by design to fulfill the following property:
1. type of median(x) == type of elements of x;
2. median(x) ∈ x.
Args:
x: A DataSlice of numbers.
kd.math.min(x)
Aliases:
Returns the minimum of items over all dimensions.
The result is a zero-dimensional DataItem.
Args:
x: A DataSlice of numbers.
kd.math.minimum(x, y)
Aliases:
Computes pointwise min(x, y).
kd.math.mod(x, y)
Computes pointwise x % y.
kd.math.multiply(x, y)
Computes pointwise x * y.
kd.math.neg(x)
Computes pointwise negation of the input, i.e. -x.
kd.math.pos(x)
Computes pointwise positive of the input, i.e. +x.
kd.math.pow(x, y)
Computes pointwise x ** y.
kd.math.round(x)
Computes pointwise rounding of the input.
Please note that this is NOT bankers rounding, unlike Python built-in or
Tensorflow round(). If the first decimal is exactly 0.5, the result is
rounded to the number with a higher absolute value:
round(1.4) == 1.0
round(1.5) == 2.0
round(1.6) == 2.0
round(2.5) == 3.0 # not 2.0
round(-1.4) == -1.0
round(-1.5) == -2.0
round(-1.6) == -2.0
round(-2.5) == -3.0 # not -2.0
kd.math.sigmoid(x, half=0.0, slope=1.0)
Computes sigmoid of the input.
sigmoid(x) = 1 / (1 + exp(-slope * (x - half)))
Args:
x: A DataSlice of numbers.
half: A DataSlice of numbers.
slope: A DataSlice of numbers.
Return:
sigmoid(x) computed with the formula above.
kd.math.sign(x)
Computes the sign of the input.
Args:
x: A DataSlice of numbers.
Returns:
A dataslice of with {-1, 0, 1} of the same shape and type as the input.
kd.math.softmax(x, beta=1.0, ndim=unspecified)
Returns the softmax of x alon the last ndim dimensions.
The softmax represents Exp(x * beta) / Sum(Exp(x * beta)) over last ndim
dimensions of x.
Args:
x: An array of numbers.
beta: A floating point scalar number that controls the smooth of the
softmax.
ndim: The number of last dimensions to compute softmax over.
kd.math.subtract(x, y)
Computes pointwise x - y.
kd.math.sum(x)
Aliases:
Returns the sum of elements over all dimensions.
The result is a zero-dimensional DataItem.
Args:
x: A DataSlice of numbers.
kd.math.t_distribution_inverse_cdf(x, degrees_of_freedom)
Student's t-distribution inverse CDF.
Args:
x: A DataSlice of numbers.
degrees_of_freedom: A DataSlice of numbers.
Return:
t_distribution_inverse_cdf(x).
Operators that work solely with objects.
Operators
kd.objs.like(shape_and_mask_from, /, *, itemid=None, **attrs)
Aliases:
Creates Objects with shape and sparsity from shape_and_mask_from.
Returned DataSlice has OBJECT schema and is immutable.
Args:
shape_and_mask_from: DataSlice, whose shape and sparsity the returned
DataSlice will have.
itemid: optional ITEMID DataSlice used as ItemIds of the resulting obj(s).
**attrs: attrs to set in the returned Entity.
Returns:
data_slice.DataSlice with the given attrs.
kd.objs.new(arg=unspecified, /, *, itemid=None, **attrs)
Aliases:
Creates new Objects with an implicit stored schema.
Returned DataSlice has OBJECT schema and is immutable.
Args:
arg: optional Python object to be converted to an Object.
itemid: optional ITEMID DataSlice used as ItemIds of the resulting obj(s).
itemid will only be set when the args is not a primitive or primitive
slice if args presents.
**attrs: attrs to set on the returned object.
Returns:
data_slice.DataSlice with the given attrs and kd.OBJECT schema.
kd.objs.shaped(shape, /, *, itemid=None, **attrs)
Aliases:
Creates Objects with the given shape.
Returned DataSlice has OBJECT schema and is immutable.
Args:
shape: JaggedShape that the returned DataSlice will have.
itemid: optional ITEMID DataSlice used as ItemIds of the resulting obj(s).
**attrs: attrs to set in the returned Entity.
Returns:
data_slice.DataSlice with the given attrs.
kd.objs.shaped_as(shape_from, /, *, itemid=None, **attrs)
Aliases:
Creates Objects with the shape of the given DataSlice.
Returned DataSlice has OBJECT schema and is immutable.
Args:
shape_from: DataSlice, whose shape the returned DataSlice will have.
itemid: optional ITEMID DataSlice used as ItemIds of the resulting obj(s).
**attrs: attrs to set in the returned Entity.
Returns:
data_slice.DataSlice with the given attrs.
kd.objs.uu(seed=None, **attrs)
Aliases:
Creates object(s) whose ids are uuid(s) with the provided attributes.
Returned DataSlice has OBJECT schema and is immutable.
In order to create a different "Type" from the same arguments, use
`seed` key with the desired value, e.g.
kd.uuobj(seed='type_1', x=kd.slice([1, 2, 3]), y=kd.slice([4, 5, 6]))
and
kd.uuobj(seed='type_2', x=kd.slice([1, 2, 3]), y=kd.slice([4, 5, 6]))
have different ids.
Args:
seed: (str) Allows different uuobj(s) to have different ids when created
from the same inputs.
**attrs: key-value pairs of object attributes where values are DataSlices
or can be converted to DataSlices using kd.new / kd.obj.
Returns:
data_slice.DataSlice
Operator definition and registration tooling.
Operators
kd.optools.add_alias(name, alias)
No description
kd.optools.add_to_registry(name=None, *, aliases=(), unsafe_override=False, view=<class 'koladata.expr.view.KodaView'>, repr_fn=None)
Wrapper around Arolla's add_to_registry with Koda functionality.
Args:
name: Optional name of the operator. Otherwise, inferred from the op.
aliases: Optional aliases for the operator.
unsafe_override: Whether to override an existing operator.
view: Optional view to use for the operator. If None, the default arolla
ExprView will be used.
repr_fn: Optional repr function to use for the operator and its aliases. In
case of None, a default repr function will be used.
Returns:
Registered operator.
kd.optools.add_to_registry_as_overload(name=None, *, overload_condition_expr, unsafe_override=False)
Koda wrapper around Arolla's add_to_registry_as_overload.
Note that for e.g. `name = "foo.bar.baz"`, the wrapped operator will
be registered as an overload `"baz"` of the overloadable operator `"foo.bar"`.
Performs no additional Koda-specific registration.
Args:
name: Optional name of the operator. Otherwise, inferred from the op.
overload_condition_expr: Condition for the overload.
unsafe_override: Whether to override an existing operator.
Returns:
A decorator that registers an overload for the operator with the
corresponding name. Returns the original operator (unlinke the arolla
equivalent).
kd.optools.add_to_registry_as_overloadable(name, *, unsafe_override=False, view=<class 'koladata.expr.view.KodaView'>, repr_fn=None, aux_policy='koladata_default_boxing')
Koda wrapper around Arolla's add_to_registry_as_overloadable.
Performs additional Koda-specific registration, such as setting the view and
repr function.
Args:
name: Name of the operator.
unsafe_override: Whether to override an existing operator.
view: Optional view to use for the operator.
repr_fn: Optional repr function to use for the operator and its aliases. In
case of None, a default repr function will be used.
aux_policy: Aux policy for the operator.
Returns:
An overloadable registered operator.
kd.optools.as_backend_operator(name, *, qtype_inference_expr=DATA_SLICE, qtype_constraints=(), deterministic=True, custom_boxing_fn_name_per_parameter=None)
Decorator for Koladata backend operators with a unified binding policy.
Args:
name: The name of the operator.
qtype_inference_expr: Expression that computes operator's output type.
Argument types can be referenced using `arolla.P.arg_name`.
qtype_constraints: List of `(predicate_expr, error_message)` pairs.
`predicate_expr` may refer to the argument QType using
`arolla.P.arg_name`. If a type constraint is not fulfilled, the
corresponding `error_message` is used. Placeholders, like `{arg_name}`,
get replaced with the actual type names during the error message
formatting.
deterministic: If set to False, a hidden parameter (with the name
`optools.UNIFIED_NON_DETERMINISTIC_PARAM_NAME`) is added to the end of the
signature. This parameter receives special handling by the binding policy
implementation.
custom_boxing_fn_name_per_parameter: A dictionary specifying a custom boxing
function per parameter (constants with the boxing functions look like:
`koladata.types.py_boxing.WITH_*`, e.g. `WITH_PY_FUNCTION_TO_PY_OBJECT`).
Returns:
A decorator that constructs a backend operator based on the provided Python
function signature.
kd.optools.as_lambda_operator(name, *, qtype_constraints=(), deterministic=None, custom_boxing_fn_name_per_parameter=None, suppress_unused_parameter_warning=False)
Decorator for Koladata lambda operators with a unified binding policy.
Args:
name: The name of the operator.
qtype_constraints: List of `(predicate_expr, error_message)` pairs.
`predicate_expr` may refer to the argument QType using
`arolla.P.arg_name`. If a type constraint is not fulfilled, the
corresponding `error_message` is used. Placeholders, like `{arg_name}`,
get replaced with the actual type names during the error message
formatting.
deterministic: If True, the resulting operator will be deterministic and may
only use deterministic operators. If False, the operator will be declared
non-deterministic. By default, the decorator attempts to detect the
operator's determinism.
custom_boxing_fn_name_per_parameter: A dictionary specifying a custom boxing
function per parameter (constants with the boxing functions look like:
`koladata.types.py_boxing.WITH_*`, e.g. `WITH_PY_FUNCTION_TO_PY_OBJECT`).
suppress_unused_parameter_warning: If True, unused parameters will not cause
a warning.
Returns:
A decorator that constructs a lambda operator by tracing a Python function.
kd.optools.as_py_function_operator(name, *, qtype_inference_expr=DATA_SLICE, qtype_constraints=(), codec=None, deterministic=True, custom_boxing_fn_name_per_parameter=None)
Returns a decorator for defining Koladata-specific py-function operators.
The decorated function should accept QValues as input and returns a single
QValue. Variadic positional and keyword arguments are passed as tuples and
dictionaries of QValues, respectively.
Importantly, it is recommended that the function on which the operator is
based be pure -- that is, deterministic and without side effects.
If the function is not pure, please specify deterministic=False.
Args:
name: The name of the operator.
qtype_inference_expr: expression that computes operator's output qtype; an
argument qtype can be referenced as P.arg_name.
qtype_constraints: QType constraints for the operator.
codec: A PyObject serialization codec for the wrapped function, compatible
with `arolla.types.encode_py_object`. The resulting operator is
serializable only if the codec is specified.
deterministic: Set this to `False` if the wrapped function is not pure
(i.e., non-deterministic or has side effects).
custom_boxing_fn_name_per_parameter: A dictionary specifying a custom boxing
function per parameter (constants with the boxing functions look like:
`koladata.types.py_boxing.WITH_*`, e.g. `WITH_PY_FUNCTION_TO_PY_OBJECT`).
kd.optools.as_qvalue(arg)
Converts Python values into QValues.
kd.optools.as_qvalue_or_expr(arg)
Converts Python values into QValues or Exprs.
kd.optools.equiv_to_op(this_op, that_op)
Returns true iff the impl of `this_op` equals the impl of `that_op`.
kd.optools.make_operators_container(*namespaces)
Returns an OperatorsContainer for the given namespaces.
Note that the returned container accesses the global namespace. A common
pattern is therefore:
foo = make_operators_container('foo', 'foo.bar', 'foo.baz').foo
Args:
*namespaces: Namespaces to make available in the returned container.
kd.optools.unified_non_deterministic_arg()
Returns a non-deterministic token for use with `bind_op(..., arg)`.
kd.optools.unified_non_deterministic_kwarg()
Returns a non-deterministic token for use with `bind_op(..., **kwarg)`.
Protocol buffer serialization operators.
Operators
kd.proto.from_proto_bytes(x, proto_path, /, *, extensions=unspecified, itemids=unspecified, schema=unspecified, on_invalid=unspecified)
Aliases:
Parses a DataSlice `x` of binary proto messages.
This is equivalent to parsing `x.to_py()` as a binary proto message in Python,
and then converting the parsed message to a DataSlice using `kd.from_proto`,
but bypasses Python, is traceable, and supports any shape and sparsity, and
can handle parse errors.
`x` must be a DataSlice of BYTES. Missing elements of `x` will be missing in
the result.
`proto_path` must be a DataItem containing a STRING fully-qualified proto
message name, which will be used to look up the message descriptor in the C++
generated descriptor pool. For this to work, the C++ proto message needs to
be compiled into the binary that executes this operator, which is not the
same as the proto message being available in Python.
See kd.from_proto for a detailed explanation of the `extensions`, `itemids`,
and `schema` arguments.
If `on_invalid` is unset, this operator will throw an error if any input
fails to parse. If `on_invalid` is set, it must be broadcastable to `x`, and
will be used in place of the result wherever the input fails to parse.
Args:
x: DataSlice of BYTES
proto_path: DataItem containing STRING
extensions: 1D DataSlice of STRING
itemids: DataSlice of ITEMID with the same shape as `x` (optional)
schema: DataItem containing SCHEMA (optional)
on_invalid: DataSlice broacastable to the result (optional)
Returns:
A DataSlice representing the proto data.
kd.proto.from_proto_json(x, proto_path, /, *, extensions=unspecified, itemids=unspecified, schema=unspecified, on_invalid=unspecified)
Aliases:
Parses a DataSlice `x` of proto JSON-format strings.
This is equivalent to parsing `x.to_py()` as a JSON-format proto message in
Python, and then converting the parsed message to a DataSlice using
`kd.from_proto`, but bypasses Python, is traceable, supports any shape and
sparsity, and can handle parse errors.
`x` must be a DataSlice of STRING. Missing elements of `x` will be missing in
the result.
`proto_path` must be a DataItem containing a STRING fully-qualified proto
message name, which will be used to look up the message descriptor in the C++
generated descriptor pool. For this to work, the C++ proto message needs to
be compiled into the binary that executes this operator, which is not the
same as the proto message being available in Python.
See kd.from_proto for a detailed explanation of the `extensions`, `itemids`,
and `schema` arguments.
If `on_invalid` is unset, this operator will throw an error if any input
fails to parse. If `on_invalid` is set, it must be broadcastable to `x`, and
will be used in place of the result wherever the input fails to parse.
Args:
x: DataSlice of STRING
proto_path: DataItem containing STRING
extensions: 1D DataSlice of STRING
itemids: DataSlice of ITEMID with the same shape as `x` (optional)
schema: DataItem containing SCHEMA (optional)
on_invalid: DataSlice broacastable to the result (optional)
Returns:
A DataSlice representing the proto data.
kd.proto.schema_from_proto_path(proto_path, /, *, extensions=DataItem(Entity:#5ikYYvXepp19g47QDLnJR2, schema: ITEMID))
Aliases:
Returns a Koda schema representing a proto message class.
This is equivalent to `kd.schema_from_proto(message_cls)` if `message_cls` is
the Python proto class with full name `proto_path`, but bypasses Python and
is traceable.
`proto_path` must be a DataItem containing a STRING fully-qualified proto
message name, which will be used to look up the message descriptor in the C++
generated descriptor pool. For this to work, the C++ proto message needs to
be compiled into the binary that executes this operator, which is not the
same as the proto message being available in Python.
See `kd.schema_from_proto` for a detailed explanation of the `extensions`
argument.
Args:
proto_path: DataItem containing STRING
extensions: 1D DataSlice of STRING
kd.proto.to_proto_bytes(x, proto_path, /)
Aliases:
Serializes a DataSlice `x` as binary proto messages.
This is equivalent to using `kd.to_proto` to serialize `x` as a proto message
in Python, then serializing that message into a binary proto, but bypasses
Python, is traceable, and supports any shape and sparsity.
`x` must be serializable as the proto message with full name `proto_path`.
Missing elements of `x` will be missing in the result.
`proto_path` must be a DataItem containing a STRING fully-qualified proto
message name, which will be used to look up the message descriptor in the C++
generated descriptor pool. For this to work, the C++ proto message needs to
be compiled into the binary that executes this operator, which is not the
same as the proto message being available in Python.
Args:
x: DataSlice
proto_path: DataItem containing STRING
Returns:
A DataSlice of BYTES with the same shape and sparsity as `x`.
kd.proto.to_proto_json(x, proto_path, /)
Aliases:
Serializes a DataSlice `x` as JSON-format proto messages.
This is equivalent to using `kd.to_proto` to serialize `x` as a proto message
in Python, then serializing that message into a JSON-format proto, but
bypasses Python, is traceable, and supports any shape and sparsity.
`x` must be serializable as the proto message with full name `proto_path`.
Missing elements of `x` will be missing in the result.
`proto_path` must be a DataItem containing a STRING fully-qualified proto
message name, which will be used to look up the message descriptor in the C++
generated descriptor pool. For this to work, the C++ proto message needs to
be compiled into the binary that executes this operator, which is not the
same as the proto message being available in Python.
Args:
x: DataSlice
proto_path: DataItem containing STRING
Returns:
A DataSlice of STRING with the same shape and sparsity as `x`.
Operators for parallel computation.
Operators
kd.parallel.call_multithreaded(fn, /, *args, return_type_as=<class 'koladata.types.data_slice.DataSlice'>, max_threads=None, timeout=None, **kwargs)
Calls a functor with the given arguments.
Variables of the functor or of its sub-functors will be computed in parallel
when they don't depend on each other. If the internal computation involves
iterables, the corresponding computations will be done in a streaming fashion.
Note that you should not use this function inside another functor (via py_fn),
as it will block the thread executing it, which can lead to deadlock when we
don't have enough threads in the thread pool. Instead, please compose all
functors first into one functor and then use one call to call_multithreaded to
execute them all in parallel.
Args:
fn: The functor to call.
*args: The positional arguments to pass to the functor.
return_type_as: The return type of the call is expected to be the same as
the return type of this expression. In most cases, this will be a literal
of the corresponding type. This needs to be specified if the functor does
not return a DataSlice. kd.types.DataSlice, kd.types.DataBag and
kd.types.JaggedShape can also be passed here.
max_threads: The maximum number of threads to use. None means to use the
default executor.
timeout: The maximum time to wait for the call to finish. None means to wait
indefinitely.
**kwargs: The keyword arguments to pass to the functor.
Returns:
The result of the call. Iterables and tuples/namedtuples of iterables are
not yet supported for the result, since that would mean that the result
is/has a stream, and this method needs to return multiple values at
different times instead of one value at the end.
kd.parallel.transform(fn, *, allow_runtime_transforms=False)
Transforms a functor to run in parallel.
The resulting functor will take and return parallel versions of the arguments
and return values of `fn`. Currently there is no public API to create a
parallel version (DataSlice -> future[DataSlice]), this is work in progress.
Args:
fn: The functor to transform.
allow_runtime_transforms: Whether to allow sub-functors to be not literals,
but computed expressions, which will therefore have to be transformed at
runtime. This can be slow.
Returns:
The transformed functor.
kd.parallel.yield_multithreaded(fn, /, *args, value_type_as=<class 'koladata.types.data_slice.DataSlice'>, max_threads=None, timeout=None, **kwargs)
Calls a functor returning an iterable, and yields the results as they go.
Variables of the functor or of its sub-functors will be computed in parallel
when they don't depend on each other. If the internal computation involves
iterables, the corresponding computations will be done in a streaming fashion.
The functor must return an iterable.
Note that you should not use this function inside another functor (via py_fn),
as it will block the thread executing it, which can lead to deadlock when we
don't have enough threads in the thread pool. Instead, please compose all
functors first into one functor and then use
one call to call_multithreaded/yield_multithreaded to execute them all in
parallel.
Args:
fn: The functor to call.
*args: The positional arguments to pass to the functor.
value_type_as: The return type of the call is expected to be an iterable of
the return type of this expression. In most cases, this will be a literal
of the corresponding type. This needs to be specified if the functor does
not return an iterable of DataSlice. kd.types.DataSlice, kd.types.DataBag
and kd.types.JaggedShape can also be passed here.
max_threads: The maximum number of threads to use. None means to use the
default executor.
timeout: The maximum time to wait for the computation of all items of the
output iterable to finish. None means to wait indefinitely.
**kwargs: The keyword arguments to pass to the functor.
Returns:
Yields the items of the output iterable as soon as they are available.
Operators that call Python functions.
Operators
kd.py.apply_py(fn, *args, return_type_as=unspecified, **kwargs)
Aliases:
Applies Python function `fn` on args.
It is equivalent to fn(*args, **kwargs).
Args:
fn: function to apply to `*args` and `**kwargs`. It is required that this
function returns a DataSlice/DataItem or a primitive that will be
automatically wrapped into a DataItem.
*args: positional arguments to pass to `fn`.
return_type_as: The return type of the function is expected to be the same
as the return type of this expression. In most cases, this will be a
literal of the corresponding type. This needs to be specified if the
function does not return a DataSlice/DataItem or a primitive that would be
auto-boxed into a DataItem. kd.types.DataSlice, kd.types.DataBag and
kd.types.JaggedShape can also be passed here.
**kwargs: keyword arguments to pass to `fn`.
Returns:
Result of fn applied on the arguments.
kd.py.apply_py_on_cond(yes_fn, no_fn, cond, *args, **kwargs)
Aliases:
Applies Python functions on args filtered with `cond` and `~cond`.
It is equivalent to
yes_fn(
*( x & cond for x in args ),
**{ k: (v & cond) for k, v in kwargs.items() },
) | no_fn(
*( x & ~cond for x in args ),
**{ k: (v & ~cond) for k, v in kwargs.items() },
)
Args:
yes_fn: function to apply on filtered args.
no_fn: function to apply on inverse filtered args (this parameter can be
None).
cond: filter dataslice.
*args: arguments to filter and then pass to yes_fn and no_fn.
**kwargs: keyword arguments to filter and then pass to yes_fn and no_fn.
Returns:
The union of results of yes_fn and no_fn applied on filtered args.
kd.py.apply_py_on_selected(fn, cond, *args, **kwargs)
Aliases:
Applies Python function `fn` on args filtered with cond.
It is equivalent to
fn(
*( x & cond for x in args ),
**{ k: (v & cond) for k, v in kwargs.items() },
)
Args:
fn: function to apply on filtered args.
cond: filter dataslice.
*args: arguments to filter and then pass to fn.
**kwargs: keyword arguments to filter and then pass to fn.
Returns:
Result of fn applied on filtered args.
kd.py.map_py(fn, *args, schema=DataItem(None, schema: NONE), max_threads=1, ndim=0, include_missing=DataItem(None, schema: NONE), item_completed_callback=DataItem(None, schema: NONE), **kwargs)
Aliases:
Apply the python function `fn` on provided `args` and `kwargs`.
Example:
def my_fn(x, y):
if x is None or y is None:
return None
return x * y
kd.map_py(my_fn, slice_1, slice_2)
# Via keyword
kd.map_py(my_fn, x=slice_1, y=slice_2)
All DataSlices in `args` and `kwargs` must have compatible shapes.
Lambdas also work for object inputs/outputs.
In this case, objects are wrapped as DataSlices.
For example:
def my_fn_object_inputs(x):
return x.y + x.z
def my_fn_object_outputs(x):
return db.obj(x=1, y=2) if x.z > 3 else db.obj(x=2, y=1)
The `ndim` argument controls how many dimensions should be passed to `fn` in
each call. If `ndim = 0` then `0`-dimensional values will be passed, if
`ndim = 1` then python `list`s will be passed, if `ndim = 2` then lists of
python `list`s will be passed and so on.
`0`-dimensional (non-`list`) values passed to `fn` are either python
primitives (`float`, `int`, `str`, etc.) or single-valued `DataSlices`
containing `ItemId`s in the non-primitive case.
In this way, `ndim` can be used for aggregation.
For example:
def my_agg_count(x):
return len([i for i in x if i is not None])
kd.map_py(my_agg_count, data_slice, ndim=1)
`fn` may return any objects that kd.from_py can handle, in other words
primitives, lists, dicts and dataslices. They will be converted to
the corresponding Koda data structures.
For example:
def my_expansion(x):
return [[y, y] for y in x]
res = kd.map_py(my_expansion, data_slice, ndim=1)
# Each item of res is a list of lists, so we can get a slice with
# the inner items like this:
print(res[:][:])
It's also possible to set custom serialization for the fn (i.e. if you want to
serialize the expression and later deserialize it in the different process).
For example to serialize the function using cloudpickle you can use
`kd_ext.py_cloudpickle(fn)` instead of fn.
Args:
fn: Function.
*args: Input DataSlices.
schema: The schema to use for resulting DataSlice.
max_threads: maximum number of threads to use.
ndim: Dimensionality of items to pass to `fn`.
include_missing: Specifies whether `fn` applies to all items (`=True`) or
only to items present in all `args` and `kwargs` (`=False`, valid only
when `ndim=0`); defaults to `False` when `ndim=0`.
item_completed_callback: A callback that will be called after each item is
processed. It will be called in the original thread that called `map_py`
in case `max_threads` is greater than 1, as we rely on this property for
cases like progress reporting. As such, it can not be attached to the `fn`
itself.
**kwargs: Input DataSlices.
Returns:
Result DataSlice.
kd.py.map_py_on_cond(true_fn, false_fn, cond, *args, schema=DataItem(None, schema: NONE), max_threads=1, item_completed_callback=DataItem(None, schema: NONE), **kwargs)
Aliases:
Apply python functions on `args` and `kwargs` based on `cond`.
`cond`, `args` and `kwargs` are first aligned. `cond` cannot have a higher
dimensions than `args` or `kwargs`.
Also see kd.map_py().
This function supports only pointwise, not aggregational, operations.
`true_fn` is applied when `cond` is kd.present. Otherwise, `false_fn` is
applied.
Args:
true_fn: Function.
false_fn: Function.
cond: Conditional DataSlice.
*args: Input DataSlices.
schema: The schema to use for resulting DataSlice.
max_threads: maximum number of threads to use.
item_completed_callback: A callback that will be called after each item is
processed. It will be called in the original thread that called
`map_py_on_cond` in case `max_threads` is greater than 1, as we rely on
this property for cases like progress reporting. As such, it can not be
attached to the `true_fn` and `false_fn` themselves.
**kwargs: Input DataSlices.
Returns:
Result DataSlice.
kd.py.map_py_on_selected(fn, cond, *args, schema=DataItem(None, schema: NONE), max_threads=1, item_completed_callback=DataItem(None, schema: NONE), **kwargs)
Aliases:
Apply python function `fn` on `args` and `kwargs` based on `cond`.
`cond`, `args` and `kwargs` are first aligned. `cond` cannot have a higher
dimensions than `args` or `kwargs`.
Also see kd.map_py().
This function supports only pointwise, not aggregational, operations. `fn` is
applied when `cond` is kd.present.
Args:
fn: Function.
cond: Conditional DataSlice.
*args: Input DataSlices.
schema: The schema to use for resulting DataSlice.
max_threads: maximum number of threads to use.
item_completed_callback: A callback that will be called after each item is
processed. It will be called in the original thread that called
`map_py_on_selected` in case `max_threads` is greater than 1, as we rely
on this property for cases like progress reporting. As such, it can not be
attached to the `fn` itself.
**kwargs: Input DataSlices.
Returns:
Result DataSlice.
Random and sampling operators.
Operators
kd.random.cityhash(x, seed)
Aliases:
Returns a hash value of 'x' for given seed.
The hash value is generated using CityHash library. The result will have the
same shape and sparsity as `x`. The output values are INT64.
Args:
x: DataSlice for hash.
seed: seed for hash, must be a scalar.
Returns:
The hash values as INT64 DataSlice.
kd.random.mask(x, ratio, seed, key=unspecified)
Returns a mask with near size(x) * ratio present values at random indices.
The sampling of indices is performed on flatten `x` rather than on the last
dimension.
The sampling is stable given the same inputs. Optional `key` can be used to
provide additional stability. That is, `key` is used for sampling if set and
items corresponding to empty keys are never sampled. Otherwise, the indices of
`x` is used.
Note that the sampling is performed as follows:
hash(key, seed) < ratio * 2^63
Therefore, exact sampled count is not guaranteed. E.g. result of sampling an
array of 1000 items with 0.1 ratio has present items close to 100 (e.g. 98)
rather than exact 100 items. However this provides per-item stability that
the sampling result for an item is deterministic given the same key regardless
other keys are provided.
Examples:
# Select 50% from last dimension.
ds = kd.slice([[1, 2, None, 4], [5, None, None, 8]])
kd.random.mask(ds, 0.5, 123)
-> kd.slice([
[None, None, kd.present, None],
[kd.present, None, None, kd.present]
])
# Use 'key' for stability
ds_1 = kd.slice([[1, 2, None, 4], [5, None, None, 8]])
key_1 = kd.slice([['a', 'b', 'c', 'd'], ['a', 'b', 'c', 'd']])
kd.random.mask(ds_1, 0.5, 123, key_1)
-> kd.slice([
[None, None, None, kd.present],
[None, None, None, kd.present],
])
ds_2 = kd.slice([[4, 3, 2, 1], [5, 6, 7, 8]])
key_2 = kd.slice([['c', 'd', 'b', 'a'], ['a', 'b', 'c', 'd']])
kd.random.mask(ds_2, 0.5, 123, key_2)
-> kd.slice([
[None, kd.present, None, None],
[None, None, None, kd.present],
])
Args:
x: DataSlice whose shape is used for sampling.
ratio: float number between [0, 1].
seed: seed from random sampling.
key: keys used to generate random numbers. The same key generates the same
random number.
kd.random.randint_like(x, low=unspecified, high=unspecified, seed=unspecified)
Aliases:
Returns a DataSlice of random INT64 numbers with the same sparsity as `x`.
When `seed` is not specified, the results are different across multiple
invocations given the same input.
Args:
x: used to determine the shape and sparsity of the resulting DataSlice.
low: Lowest (signed) integers to be drawn (unless high=None, in which case
this parameter is 0 and this value is used for high), inclusive.
high: If provided, the largest integer to be drawn (see above behavior if
high=None), exclusive.
seed: Seed for the random number generator. The same input with the same
seed generates the same random numbers.
Returns:
A DataSlice of random numbers.
kd.random.randint_shaped(shape, low=unspecified, high=unspecified, seed=unspecified)
Aliases:
Returns a DataSlice of random INT64 numbers with the given shape.
When `seed` is not specified, the results are different across multiple
invocations given the same input.
Args:
shape: used for the shape of the resulting DataSlice.
low: Lowest (signed) integers to be drawn (unless high=None, in which case
this parameter is 0 and this value is used for high), inclusive.
high: If provided, the largest integer to be drawn (see above behavior if
high=None), exclusive.
seed: Seed for the random number generator. The same input with the same
seed generates the same random numbers.
Returns:
A DataSlice of random numbers.
kd.random.randint_shaped_as(x, low=unspecified, high=unspecified, seed=unspecified)
Aliases:
Returns a DataSlice of random INT64 numbers with the same shape as `x`.
When `seed` is not specified, the results are different across multiple
invocations given the same input.
Args:
x: used to determine the shape of the resulting DataSlice.
low: Lowest (signed) integers to be drawn (unless high=None, in which case
this parameter is 0 and this value is used for high), inclusive.
high: If provided, the largest integer to be drawn (see above behavior if
high=None), exclusive.
seed: Seed for the random number generator. The same input with the same
seed generates the same random numbers.
Returns:
A DataSlice of random numbers.
kd.random.sample(x, ratio, seed, key=unspecified)
Aliases:
Randomly sample items in `x` based on ratio.
The sampling is performed on flatten `x` rather than on the last dimension.
All items including missing items in `x` are eligible for sampling.
The sampling is stable given the same inputs. Optional `key` can be used to
provide additional stability. That is, `key` is used for sampling if set and
items corresponding to empty keys are never sampled. Otherwise, the indices of
`x` is used.
Note that the sampling is performed as follows:
hash(key, seed) < ratio * 2^63
Therefore, exact sampled count is not guaranteed. E.g. result of sampling an
array of 1000 items with 0.1 ratio has present items close to 100 (e.g. 98)
rather than exact 100 items. However this provides per-item stability that
the sampling result for an item is deterministic given the same key regardless
other keys are provided.
Examples:
# Select 50% from last dimension.
ds = kd.slice([[1, 2, None, 4], [5, None, None, 8]])
kd.sample(ds, 0.5, 123) -> kd.slice([[None, 4], [None, 8]])
# Use 'key' for stability
ds_1 = kd.slice([[1, 2, None, 4], [5, None, None, 8]])
key_1 = kd.slice([['a', 'b', 'c', 'd'], ['a', 'b', 'c', 'd']])
kd.sample(ds_1, 0.5, 123, key_1) -> kd.slice([[None, 2], [None, None]])
ds_2 = kd.slice([[4, 3, 2, 1], [5, 6, 7, 8]])
key_2 = kd.slice([['c', 'a', 'b', 'd'], ['a', 'b', 'c', 'd']])
kd.sample(ds_2, 0.5, 123, key_2) -> kd.slice([[4, 2], [6, 7]])
Args:
x: DataSlice to sample.
ratio: float number between [0, 1].
seed: seed from random sampling.
key: keys used to generate random numbers. The same key generates the same
random number.
Returns:
Sampled DataSlice.
kd.random.sample_n(x, n, seed, key=unspecified)
Aliases:
Randomly sample n items in `x` from the last dimension.
The sampling is performed over the last dimension rather than on flatten `x`.
`n` can either can be a scalar integer or DataSlice. If it is a DataSlice, it
must have compatible shape with `x.get_shape()[:-1]`. All items including
missing items in `x` are eligible for sampling.
The sampling is stable given the same inputs. Optional `key` can be used to
provide additional stability. That is, `key` is used for sampling if set.
Otherwise, the indices of `x` are used.
Examples:
# Select 2 items from last dimension.
ds = kd.slice([[1, 2, None, 4], [5, None, None, 8]])
kd.sample_n(ds, 2, 123) -> kd.slice([[2, 4], [None, 8]])
# Select 1 item from the first and 2 items from the second.
ds = kd.slice([[1, 2, None, 4], [5, None, None, 8]])
kd.sample_n(ds, [1, 2], 123) -> kd.slice([[4], [None, 5]])
# Use 'key' for stability
ds_1 = kd.slice([[1, 2, None, 4], [5, None, None, 8]])
key_1 = kd.slice([['a', 'b', 'c', 'd'], ['a', 'b', 'c', 'd']])
kd.sample_n(ds_1, 2, 123, key_1) -> kd.slice([[None, 2], [None, None]])
ds_2 = kd.slice([[4, 3, 2, 1], [5, 6, 7, 8]])
key_2 = kd.slice([['c', 'a', 'b', 'd'], ['a', 'b', 'c', 'd']])
kd.sample_n(ds_2, 2, 123, key_2) -> kd.slice([[4, 2], [6, 7]])
Args:
x: DataSlice to sample.
n: number of items to sample. Either an integer or a DataSlice.
seed: seed from random sampling.
key: keys used to generate random numbers. The same key generates the same
random number.
Returns:
Sampled DataSlice.
kd.random.shuffle(x, /, ndim=unspecified, seed=unspecified)
Aliases:
Randomly shuffles a DataSlice along a single dimension (last by default).
If `ndim` is not specified, items are shuffled in the last dimension.
If `ndim` is specified, then the dimension `ndim` from the last is shuffled,
equivalent to `kd.explode(kd.shuffle(kd.implode(x, ndim)), ndim)`.
When `seed` is not specified, the results are different across multiple
invocations given the same input.
For example:
kd.shuffle(kd.slice([[1, 2, 3], [4, 5], [6]]))
-> kd.slice([[3, 1, 2], [5, 4], [6]]) (possible output)
kd.shuffle(kd.slice([[1, 2, 3], [4, 5]]), ndim=1)
-> kd.slice([[4, 5], [6], [1, 2, 3]]) (possible output)
Args:
x: DataSlice to shuffle.
ndim: The index of the dimension to shuffle, from the end (0 = last dim).
The last dimension is shuffled if this is unspecified.
seed: Seed for the random number generator. The same input with the same
seed generates the same random numbers.
Returns:
Shuffled DataSlice.
Schema-related operators.
Operators
kd.schema.agg_common_schema(x, ndim=unspecified)
Returns the common schema of `x` along the last `ndim` dimensions.
The "common schema" is defined according to go/koda-type-promotion.
Examples:
kd.agg_common_schema(kd.slice([kd.INT32, None, kd.FLOAT32]))
# -> kd.FLOAT32
kd.agg_common_schema(kd.slice([[kd.INT32, None], [kd.FLOAT32, kd.FLOAT64]]))
# -> kd.slice([kd.INT32, kd.FLOAT64])
kd.agg_common_schema(
kd.slice([[kd.INT32, None], [kd.FLOAT32, kd.FLOAT64]]), ndim=2)
# -> kd.FLOAT64
Args:
x: DataSlice of schemas.
ndim: The number of last dimensions to aggregate over.
kd.schema.cast_to(x, schema)
Aliases:
Returns `x` casted to the provided `schema` using explicit casting rules.
Dispatches to the relevant `kd.to_...` operator. Performs permissive casting,
e.g. allowing FLOAT32 -> INT32 casting through `kd.cast_to(slice, INT32)`.
Args:
x: DataSlice to cast.
schema: Schema to cast to. Must be a scalar.
kd.schema.cast_to_implicit(x, schema)
Returns `x` casted to the provided `schema` using implicit casting rules.
Note that `schema` must be the common schema of `schema` and `x.get_schema()`
according to go/koda-type-promotion.
Args:
x: DataSlice to cast.
schema: Schema to cast to. Must be a scalar.
kd.schema.cast_to_narrow(x, schema)
Returns `x` casted to the provided `schema`.
Allows for schema narrowing, where OBJECT types can be casted to primitive
schemas as long as the data is implicitly castable to the schema. Follows the
casting rules of `kd.cast_to_implicit` for the narrowed schema.
Args:
x: DataSlice to cast.
schema: Schema to cast to. Must be a scalar.
kd.schema.common_schema(x)
Returns the common schema as a scalar DataItem of `x`.
The "common schema" is defined according to go/koda-type-promotion.
Args:
x: DataSlice of schemas.
kd.schema.dict_schema(key_schema, value_schema)
Aliases:
Returns a Dict schema with the provided `key_schema` and `value_schema`.
kd.schema.get_dtype(ds)
Aliases:
Returns a primitive schema representing the underlying items' dtype.
If `ds` has a primitive schema, this returns that primitive schema, even if
all items in `ds` are missing. If `ds` has an OBJECT schema but contains
primitive values of a single dtype, it returns the schema for that primitive
dtype.
In case of items in `ds` have non-primitive types or mixed dtypes, returns
a missing schema (i.e. `kd.item(None, kd.SCHEMA)`).
Examples:
kd.get_primitive_schema(kd.slice([1, 2, 3])) -> kd.INT32
kd.get_primitive_schema(kd.slice([None, None, None], kd.INT32)) -> kd.INT32
kd.get_primitive_schema(kd.slice([1, 2, 3], kd.OBJECT)) -> kd.INT32
kd.get_primitive_schema(kd.slice([1, 'a', 3], kd.OBJECT)) -> missing schema
kd.get_primitive_schema(kd.obj())) -> missing schema
Args:
ds: DataSlice to get dtype from.
Returns:
a primitive schema DataSlice.
kd.schema.get_item_schema(list_schema)
Aliases:
Returns the item schema of a List schema`.
kd.schema.get_itemid(x)
Aliases:
Casts `x` to ITEMID using explicit (permissive) casting rules.
kd.schema.get_key_schema(dict_schema)
Aliases:
Returns the key schema of a Dict schema`.
kd.schema.get_nofollowed_schema(schema)
Aliases:
Returns the original schema from nofollow schema.
Requires `nofollow_schema` to be a nofollow schema, i.e. that it wraps some
other schema.
Args:
schema: nofollow schema DataSlice.
kd.schema.get_obj_schema(x)
Aliases:
Returns a DataSlice of schemas for Objects and primitives in `x`.
DataSlice `x` must have OBJECT schema.
Examples:
db = kd.bag()
s = db.new_schema(a=kd.INT32)
obj = s(a=1).embed_schema()
kd.get_obj_schema(kd.slice([1, None, 2.0, obj]))
-> kd.slice([kd.INT32, NONE, kd.FLOAT32, s])
Args:
x: OBJECT DataSlice
Returns:
A DataSlice of schemas.
kd.schema.get_primitive_schema(ds)
Alias for kd.schema.get_dtype operator.
kd.schema.get_repr(schema)
Returns a string representation of the schema.
Named schemas are only represented by their name. Other schemas are
represented by their content.
Args:
schema: A scalar schema DataSlice.
Returns:
A scalar string DataSlice. A repr of the given schema.
kd.schema.get_schema(x)
Aliases:
Returns the schema of `x`.
kd.schema.get_value_schema(dict_schema)
Aliases:
Returns the value schema of a Dict schema`.
kd.schema.internal_maybe_named_schema(name_or_schema)
Converts a string to a named schema, passes through schema otherwise.
The operator also passes through arolla.unspecified, and raises when
it receives anything else except unspecified, string or schema DataItem.
This operator exists to support kd.core.new* family of operators.
Args:
name_or_schema: The input name or schema.
Returns:
The schema unchanged, or a named schema with the given name.
kd.schema.is_dict_schema(x)
Returns true iff `x` is a Dict schema DataItem.
kd.schema.is_entity_schema(x)
Returns true iff `x` is an Entity schema DataItem.
kd.schema.is_list_schema(x)
Returns true iff `x` is a List schema DataItem.
kd.schema.is_primitive_schema(x)
Returns true iff `x` is a primitive schema DataItem.
kd.schema.is_struct_schema(x)
Returns true iff `x` is a Struct schema DataItem.
kd.schema.list_schema(item_schema)
Aliases:
Returns a List schema with the provided `item_schema`.
kd.schema.named_schema(name, /, **kwargs)
Aliases:
Creates a named entity schema.
A named schema will have its item id derived only from its name, which means
that two named schemas with the same name will have the same item id, even in
different DataBags, or with different kwargs passed to this method.
Args:
name: The name to use to derive the item id of the schema.
**kwargs: a named tuple mapping attribute names to DataSlices. The DataSlice
values must be schemas themselves.
Returns:
data_slice.DataSlice with the item id of the required schema and kd.SCHEMA
schema, with a new immutable DataBag attached containing the provided
kwargs.
kd.schema.new_schema(**kwargs)
Creates a new allocated schema.
Args:
**kwargs: a named tuple mapping attribute names to DataSlices. The DataSlice
values must be schemas themselves.
Returns:
(DataSlice) containing the schema id.
kd.schema.nofollow_schema(schema)
Aliases:
Returns a NoFollow schema of the provided schema.
`nofollow_schema` is reversible with `get_actual_schema`.
`nofollow_schema` can only be called on implicit and explicit schemas and
OBJECT. It raises an Error if called on primitive schemas, ITEMID, etc.
Args:
schema: Schema DataSlice to wrap.
kd.schema.schema_from_py(tpe)
Aliases:
Creates a Koda entity schema corresponding to the given Python type.
This method supports the following Python types / type annotations
recursively:
- Primitive types: int, float, bool, str, bytes.
- Collections: list[...], dict[...], Sequence[...], Mapping[...], ect.
- Unions: only "smth | None" or "Optional[smth]" is supported.
- Dataclasses.
This can be used in conjunction with kd.from_py to convert lists of Python
objects to efficient Koda DataSlices. Because of the 'efficient' goal, we
create an entity schema and do not use kd.OBJECT inside, which also results
in strict type checking. If you do not care
about efficiency or type safety, you can use kd.from_py(..., schema=kd.OBJECT)
directly.
Args:
tpe: The Python type to create a schema for.
Returns:
A Koda entity schema corresponding to the given Python type. The returned
schema is a uu-schema, in other words we always return the same output for
the same input. For dataclasses, we use the module name and the class name
to derive the itemid for the uu-schema.
kd.schema.to_bool(x)
Casts `x` to BOOLEAN using explicit (permissive) casting rules.
kd.schema.to_bytes(x)
Casts `x` to BYTES using explicit (permissive) casting rules.
kd.schema.to_expr(x)
Aliases:
Casts `x` to EXPR using explicit (permissive) casting rules.
kd.schema.to_float32(x)
Casts `x` to FLOAT32 using explicit (permissive) casting rules.
kd.schema.to_float64(x)
Casts `x` to FLOAT64 using explicit (permissive) casting rules.
kd.schema.to_int32(x)
Casts `x` to INT32 using explicit (permissive) casting rules.
kd.schema.to_int64(x)
Casts `x` to INT64 using explicit (permissive) casting rules.
kd.schema.to_itemid(x)
Alias for kd.schema.get_itemid operator.
kd.schema.to_mask(x)
Casts `x` to MASK using explicit (permissive) casting rules.
kd.schema.to_none(x)
Aliases:
Casts `x` to NONE using explicit (permissive) casting rules.
kd.schema.to_object(x)
Aliases:
Casts `x` to OBJECT using explicit (permissive) casting rules.
kd.schema.to_schema(x)
Aliases:
Casts `x` to SCHEMA using explicit (permissive) casting rules.
kd.schema.to_str(x)
Casts `x` to STRING using explicit (permissive) casting rules.
kd.schema.uu_schema(seed='', **kwargs)
Aliases:
Creates a UUSchema, i.e. a schema keyed by a uuid.
In order to create a different id from the same arguments, use
`seed` argument with the desired value, e.g.
kd.uu_schema(seed='type_1', x=kd.INT32, y=kd.FLOAT32)
and
kd.uu_schema(seed='type_2', x=kd.INT32, y=kd.FLOAT32)
have different ids.
Args:
seed: string seed for the uuid computation.
**kwargs: a named tuple mapping attribute names to DataSlices. The DataSlice
values must be schemas themselves.
Returns:
(DataSlice) containing the schema uuid.
kd.schema.with_schema(x, schema)
Aliases:
Returns a copy of `x` with the provided `schema`.
If `schema` is an Entity schema, it must have no DataBag or the same DataBag
as `x`. To set schema with a different DataBag, use `kd.set_schema` instead.
It only changes the schemas of `x` and does not change the items in `x`. To
change the items in `x`, use `kd.cast_to` instead. For example,
kd.with_schema(kd.ds([1, 2, 3]), kd.FLOAT32) -> fails because the items in
`x` are not compatible with FLOAT32.
kd.cast_to(kd.ds([1, 2, 3]), kd.FLOAT32) -> kd.ds([1.0, 2.0, 3.0])
When items in `x` are primitives or `schemas` is a primitive schema, it checks
items and schema are compatible. When items are ItemIds and `schema` is a
non-primitive schema, it does not check the underlying data matches the
schema. For example,
kd.with_schema(kd.ds([1, 2, 3], schema=kd.OBJECT), kd.INT32) ->
kd.ds([1, 2, 3])
kd.with_schema(kd.ds([1, 2, 3]), kd.INT64) -> fail
db = kd.bag()
kd.with_schema(kd.ds(1).with_bag(db), db.new_schema(x=kd.INT32)) -> fail due
to incompatible schema
kd.with_schema(db.new(x=1), kd.INT32) -> fail due to incompatible schema
kd.with_schema(db.new(x=1), kd.schema.new_schema(x=kd.INT32)) -> fail due to
different DataBag
kd.with_schema(db.new(x=1), kd.schema.new_schema(x=kd.INT32).no_bag()) ->
work
kd.with_schema(db.new(x=1), db.new_schema(x=kd.INT64)) -> work
Args:
x: DataSlice to change the schema of.
schema: DataSlice containing the new schema.
Returns:
DataSlice with the new schema.
kd.schema.with_schema_from_obj(x)
Aliases:
Returns `x` with its embedded common schema set as the schema.
* `x` must have OBJECT schema.
* All items in `x` must have a common schema.
* If `x` is empty, the schema is set to NONE.
* If `x` contains mixed primitives without a common primitive type, the output
will have OBJECT schema.
Args:
x: An OBJECT DataSlice.
Operators that work on shapes
Operators
kd.shapes.dim_mapping(shape, dim)
Returns the parent-to-child mapping of the dimension in the given shape.
Example:
shape = kd.shapes.new([2], [3, 2], [1, 2, 0, 2, 1])
kd.shapes.dim_mapping(shape, 0) # -> kd.slice([0, 0])
kd.shapes.dim_mapping(shape, 1) # -> kd.slice([0, 0, 0, 1, 1])
kd.shapes.dim_mapping(shape, 2) # -> kd.slice([0, 1, 1, 3, 3, 4])
Args:
shape: a JaggedShape.
dim: the dimension to get the parent-to-child mapping for.
kd.shapes.dim_sizes(shape, dim)
Returns the row sizes at the provided dimension in the given shape.
Example:
shape = kd.shapes.new([2], [2, 1])
kd.shapes.dim_sizes(shape, 0) # -> kd.slice([2])
kd.shapes.dim_sizes(shape, 1) # -> kd.slice([2, 1])
Args:
shape: a JaggedShape.
dim: the dimension to get the sizes for.
kd.shapes.expand_to_shape(x, shape, ndim=unspecified)
Aliases:
Expands `x` based on the provided `shape`.
When `ndim` is not set, expands `x` to `shape`. The dimensions
of `x` must be the same as the first N dimensions of `shape` where N is the
number of dimensions of `x`. For example,
Example 1:
x: [[1, 2], [3]]
shape: JaggedShape(3, [2, 1], [1, 2, 3])
result: [[[1], [2, 2]], [[3, 3, 3]]]
Example 2:
x: [[1, 2], [3]]
shape: JaggedShape(3, [1, 1], [1, 3])
result: incompatible shapes
Example 3:
x: [[1, 2], [3]]
shape: JaggedShape(2)
result: incompatible shapes
When `ndim` is set, the expansion is performed in 3 steps:
1) the last N dimensions of `x` are first imploded into lists
2) the expansion operation is performed on the DataSlice of lists
3) the lists in the expanded DataSlice are exploded
The result will have M + ndim dimensions where M is the number
of dimensions of `shape`.
For example,
Example 4:
x: [[1, 2], [3]]
shape: JaggedShape(2, [1, 2])
ndim: 1
result: [[[1, 2]], [[3], [3]]]
Example 5:
x: [[1, 2], [3]]
shape: JaggedShape(2, [1, 2])
ndim: 2
result: [[[[1, 2], [3]]], [[[1, 2], [3]], [[1, 2], [3]]]]
Args:
x: DataSlice to expand.
shape: JaggedShape.
ndim: the number of dimensions to implode during expansion.
Returns:
Expanded DataSlice
kd.shapes.flatten(x, from_dim=0, to_dim=unspecified)
Aliases:
Returns `x` with dimensions `[from_dim:to_dim]` flattened.
Indexing works as in python:
* If `to_dim` is unspecified, `to_dim = rank()` is used.
* If `to_dim < from_dim`, `to_dim = from_dim` is used.
* If `to_dim < 0`, `max(0, to_dim + rank())` is used. The same goes for
`from_dim`.
* If `to_dim > rank()`, `rank()` is used. The same goes for `from_dim`.
The above-mentioned adjustments places both `from_dim` and `to_dim` in the
range `[0, rank()]`. After adjustments, the new DataSlice has `rank() ==
old_rank - (to_dim - from_dim) + 1`. Note that if `from_dim == to_dim`, a
"unit" dimension is inserted at `from_dim`.
Example:
# Flatten the last two dimensions into a single dimension, producing a
# DataSlice with `rank = old_rank - 1`.
kd.get_shape(x) # -> JaggedShape(..., [2, 1], [7, 5, 3])
flat_x = kd.flatten(x, -2)
kd.get_shape(flat_x) # -> JaggedShape(..., [12, 3])
# Flatten all dimensions except the last, producing a DataSlice with
# `rank = 2`.
kd.get_shape(x) # -> jaggedShape(..., [7, 5, 3])
flat_x = kd.flatten(x, 0, -1)
kd.get_shape(flat_x) # -> JaggedShape([3], [7, 5, 3])
# Flatten all dimensions.
kd.get_shape(x) # -> JaggedShape([3], [7, 5, 3])
flat_x = kd.flatten(x)
kd.get_shape(flat_x) # -> JaggedShape([15])
Args:
x: a DataSlice.
from_dim: start of dimensions to flatten. Defaults to `0` if unspecified.
to_dim: end of dimensions to flatten. Defaults to `rank()` if unspecified.
kd.shapes.flatten_end(x, n_times=1)
Aliases:
Returns `x` with a shape flattened `n_times` from the end.
The new shape has x.get_ndim() - n_times dimensions.
Given that flattening happens from the end, only positive integers are
allowed. For more control over flattening, please use `kd.flatten`, instead.
Args:
x: a DataSlice.
n_times: number of dimensions to flatten from the end
(0 <= n_times <= rank).
kd.shapes.get_shape(x)
Aliases:
Returns the shape of `x`.
kd.shapes.get_sizes(x)
Returns a DataSlice of sizes of a given shape.
Example:
kd.shapes.get_sizes(kd.shapes.new([2], [2, 1])) -> kd.slice([[2], [2, 1]])
kd.shapes.get_sizes(kd.slice([['a', 'b'], ['c']])) -> kd.slice([[2], [2,
1]])
Args:
x: a shape or a DataSlice from which the shape will be taken.
Returns:
A 2-dimensional DataSlice where the first dimension's size corresponds to
the shape's rank and the n-th subslice corresponds to the sizes of the n-th
dimension of the original shape.
kd.shapes.is_expandable_to_shape(x, target_shape, ndim=unspecified)
Returns true if `x` is expandable to `target_shape`.
See `expand_to_shape` for a detailed description of expansion.
Args:
x: DataSlice that would be expanded.
target_shape: JaggedShape that would be expanded to.
ndim: The number of dimensions to implode before expansion. If unset,
defaults to 0.
kd.shapes.ndim(shape)
Aliases:
Returns the rank of the jagged shape.
kd.shapes.new(*dimensions)
Returns a JaggedShape from the provided dimensions.
Example:
# Creates a scalar shape (i.e. no dimension).
kd.shapes.new() # -> JaggedShape()
# Creates a 3-dimensional shape with all uniform dimensions.
kd.shapes.new(2, 3, 1) # -> JaggedShape(2, 3, 1)
# Creates a 3-dimensional shape with 2 sub-values in the first dimension.
#
# The second dimension is jagged with 2 values. The first value in the
# second dimension has 2 sub-values, and the second value has 1 sub-value.
#
# The third dimension is jagged with 3 values. The first value in the third
# dimension has 1 sub-value, the second has 2 sub-values, and the third has
# 3 sub-values.
kd.shapes.new(2, [2, 1], [1, 2, 3])
# -> JaggedShape(2, [2, 1], [1, 2, 3])
Args:
*dimensions: A combination of Edges and DataSlices representing the
dimensions of the JaggedShape. Edges are used as is, while DataSlices are
treated as sizes. DataItems (of ints) are interpreted as uniform
dimensions which have the same child size for all parent elements.
DataSlices (of ints) are interpreted as a list of sizes, where `ds[i]` is
the child size of parent `i`. Only rank-0 or rank-1 int DataSlices are
supported.
kd.shapes.rank(shape)
Alias for kd.shapes.ndim operator.
kd.shapes.reshape(x, shape)
Aliases:
Returns a DataSlice with the provided shape.
Examples:
x = kd.slice([1, 2, 3, 4])
# Using a shape.
kd.reshape(x, kd.shapes.new(2, 2)) # -> kd.slice([[1, 2], [3, 4]])
# Using a tuple of sizes.
kd.reshape(x, kd.tuple(2, 2)) # -> kd.slice([[1, 2], [3, 4]])
# Using a tuple of sizes and a placeholder dimension.
kd.reshape(x, kd.tuple(-1, 2)) # -> kd.slice([[1, 2], [3, 4]])
# Using a tuple of sizes and a placeholder dimension.
kd.reshape(x, kd.tuple(-1, 2)) # -> kd.slice([[1, 2], [3, 4]])
# Using a tuple of slices and a placeholder dimension.
kd.reshape(x, kd.tuple(-1, kd.slice([3, 1])))
# -> kd.slice([[1, 2, 3], [4]])
# Reshaping a scalar.
kd.reshape(1, kd.tuple(1, 1)) # -> kd.slice([[1]])
# Reshaping an empty slice.
kd.reshape(kd.slice([]), kd.tuple(2, 0)) # -> kd.slice([[], []])
Args:
x: a DataSlice.
shape: a JaggedShape or a tuple of dimensions that forms a shape through
`kd.shapes.new`, with additional support for a `-1` placeholder dimension.
kd.shapes.reshape_as(x, shape_from)
Aliases:
Returns a DataSlice x reshaped to the shape of DataSlice shape_from.
kd.shapes.size(shape)
Returns the total number of elements the jagged shape represents.
Operators that perform DataSlice transformations.
Operators
kd.slices.agg_count(x, ndim=unspecified)
Aliases:
Returns counts of present items over the last ndim dimensions.
The resulting slice has `rank = rank - ndim` and shape: `shape =
shape[:-ndim]`.
Example:
ds = kd.slice([[1, None, 1], [3, 4, 5], [None, None]])
kd.agg_count(ds) # -> kd.slice([2, 3, 0])
kd.agg_count(ds, ndim=1) # -> kd.slice([2, 3, 0])
kd.agg_count(ds, ndim=2) # -> kd.slice(5)
Args:
x: A DataSlice.
ndim: The number of dimensions to aggregate over. Requires 0 <= ndim <=
get_ndim(x).
kd.slices.agg_size(x, ndim=unspecified)
Aliases:
Returns number of items in `x` over the last ndim dimensions.
Note that it counts missing items, which is different from `kd.count`.
The resulting DataSlice has `rank = rank - ndim` and shape: `shape =
shape[:-ndim]`.
Example:
ds = kd.slice([[1, None, 1], [3, 4, 5], [None, None]])
kd.agg_size(ds) # -> kd.slice([3, 3, 2])
kd.agg_size(ds, ndim=1) # -> kd.slice([3, 3, 2])
kd.agg_size(ds, ndim=2) # -> kd.slice(8)
Args:
x: A DataSlice.
ndim: The number of dimensions to aggregate over. Requires 0 <= ndim <=
get_ndim(x).
Returns:
A DataSlice of number of items in `x` over the last `ndim` dimensions.
kd.slices.align(*args)
Aliases:
Expands all of the DataSlices in `args` to the same common shape.
All DataSlices must be expandable to the shape of the DataSlice with the
largest number of dimensions.
Example:
kd.align(kd.slice([[1, 2, 3], [4, 5]]), kd.slice('a'), kd.slice([1, 2]))
# Returns:
# (
# kd.slice([[1, 2, 3], [4, 5]]),
# kd.slice([['a', 'a', 'a'], ['a', 'a']]),
# kd.slice([[1, 1, 1], [2, 2]]),
# )
Args:
*args: DataSlices to align.
Returns:
A tuple of aligned DataSlices, matching `args`.
kd.slices.at(x, indices)
Aliases:
Returns a new DataSlice with items at provided indices.
`indices` must have INT32 or INT64 dtype or OBJECT schema holding INT32 or
INT64 items.
Indices in the DataSlice `indices` are based on the last dimension of the
DataSlice `x`. Negative indices are supported and out-of-bound indices result
in missing items.
If ndim(x) - 1 > ndim(indices), indices are broadcasted to shape(x)[:-1].
If ndim(x) <= ndim(indices), indices are unchanged but shape(x)[:-1] must be
broadcastable to shape(indices).
Example:
x = kd.slice([[1, None, 2], [3, 4]])
kd.take(x, kd.item(1)) # -> kd.slice([[None, 4]])
kd.take(x, kd.slice([0, 1])) # -> kd.slice([1, 4])
kd.take(x, kd.slice([[0, 1], [1]])) # -> kd.slice([[1, None], [4]])
kd.take(x, kd.slice([[[0, 1], []], [[1], [0]]]))
# -> kd.slice([[[1, None]], []], [[4], [3]]])
kd.take(x, kd.slice([3, -3])) # -> kd.slice([None, None])
kd.take(x, kd.slice([-1, -2])) # -> kd.slice([2, 3])
kd.take(x, kd.slice('1')) # -> dtype mismatch error
kd.take(x, kd.slice([1, 2, 3])) -> incompatible shape
Args:
x: DataSlice to be indexed
indices: indices used to select items
Returns:
A new DataSlice with items selected by indices.
kd.slices.bool(x)
Aliases:
Returns kd.slice(x, kd.BOOLEAN).
kd.slices.bytes(x)
Aliases:
Returns kd.slice(x, kd.BYTES).
kd.slices.collapse(x, ndim=unspecified)
Aliases:
Collapses the same items over the last ndim dimensions.
Missing items are ignored. For each collapse aggregation, the result is
present if and only if there is at least one present item and all present
items are the same.
The resulting slice has `rank = rank - ndim` and shape: `shape =
shape[:-ndim]`.
Example:
ds = kd.slice([[1, None, 1], [3, 4, 5], [None, None]])
kd.collapse(ds) # -> kd.slice([1, None, None])
kd.collapse(ds, ndim=1) # -> kd.slice([1, None, None])
kd.collapse(ds, ndim=2) # -> kd.slice(None)
Args:
x: A DataSlice.
ndim: The number of dimensions to collapse into. Requires 0 <= ndim <=
get_ndim(x).
Returns:
Collapsed DataSlice.
kd.slices.concat(*args, ndim=1)
Aliases:
Returns the concatenation of the given DataSlices on dimension `rank-ndim`.
All given DataSlices must have the same rank, and the shapes of the first
`rank-ndim` dimensions must match. If they have incompatible shapes, consider
using `kd.align(*args)`, `arg.repeat(...)`, or `arg.expand_to(other_arg, ...)`
to bring them to compatible shapes first.
The shape of the concatenated result is the following:
1) the shape of the first `rank-ndim` dimensions remains the same
2) the shape of the concatenation dimension is the element-wise sum of the
shapes of the arguments' concatenation dimensions
3) the shapes of the last `ndim-1` dimensions are interleaved within the
groups implied by the concatenation dimension
Alteratively, if we think of each input DataSlice as a nested Python list,
this operator simultaneously iterates over the inputs at depth `rank-ndim`,
concatenating the root lists of the corresponding nested sub-lists from each
input.
For example,
a = kd.slice([[[1, 2], [3]], [[5], [7, 8]]])
b = kd.slice([[[1], [2]], [[3], [4]]])
kd.concat(a, b, ndim=1) -> [[[1, 2, 1], [3, 2]], [[5, 3], [7, 8, 4]]]
kd.concat(a, b, ndim=2) -> [[[1, 2], [3], [1], [2]], [[5], [7, 8], [3], [4]]]
kd.concat(a, b, ndim=3) -> [[[1, 2], [3]], [[5], [7, 8]],
[[1], [2]], [[3], [4]]]
kd.concat(a, b, ndim=4) -> raise an exception
kd.concat(a, b) -> the same as kd.concat(a, b, ndim=1)
The reason auto-broadcasting is not supported is that such behavior can be
confusing and often not what users want. For example,
a = kd.slice([[[1, 2], [3]], [[5], [7, 8]]])
b = kd.slice([[1, 2], [3, 4]])
kd.concat(a, b) -> should it be which of the following?
[[[1, 2, 1, 2], [3, 1, 2]], [[5, 3, 4], [7, 8, 3, 4]]]
[[[1, 2, 1, 1], [3, 2]], [[5, 3], [7, 8, 4, 4]]]
[[[1, 2, 1], [3, 2]], [[5, 3], [7, 8, 4]]]
Args:
*args: The DataSlices to concatenate.
ndim: The number of last dimensions to concatenate (default 1).
Returns:
The contatenation of the input DataSlices on dimension `rank-ndim`. In case
the input DataSlices come from different DataBags, this will refer to a
new merged immutable DataBag.
kd.slices.count(x)
Aliases:
Returns the count of present items over all dimensions.
The result is a zero-dimensional DataItem.
Args:
x: A DataSlice of numbers.
kd.slices.cum_count(x, ndim=unspecified)
Aliases:
Computes a partial count of present items over the last `ndim` dimensions.
If `ndim` isn't specified, it defaults to 1 (count over the last dimension).
Example:
x = kd.slice([[1, None, 1, 1], [3, 4, 5]])
kd.cum_count(x, ndim=1) # -> kd.slice([[1, None, 2, 3], [1, 2, 3]])
kd.cum_count(x, ndim=2) # -> kd.slice([[1, None, 2, 3], [4, 5, 6]])
Args:
x: A DataSlice.
ndim: The number of trailing dimensions to count within. Requires 0 <= ndim
<= get_ndim(x).
Returns:
A DataSlice of INT64 with the same shape and sparsity as `x`.
kd.slices.dense_rank(x, descending=False, ndim=unspecified)
Aliases:
Returns dense ranks of items in `x` over the last `ndim` dimensions.
Items are grouped over the last `ndim` dimensions and ranked within the group.
`ndim` is set to 1 by default if unspecified. Ranks are integers starting from
0, assigned to values in ascending order by default.
By dense ranking ("1 2 2 3" ranking), equal items are assigned to the same
rank and the next items are assigned to that rank plus one (i.e. no gap
between the rank numbers).
NaN values are ranked lowest regardless of the order of ranking. Ranks of
missing items are missing in the result.
Example:
ds = kd.slice([[4, 3, None, 3], [3, None, 2, 1]])
kd.dense_rank(x) -> kd.slice([[1, 0, None, 0], [2, None, 1, 0]])
kd.dense_rank(x, descending=True) ->
kd.slice([[0, 1, None, 1], [0, None, 1, 2]])
kd.dense_rank(x, ndim=0) -> kd.slice([[0, 0, None, 0], [0, None, 0, 0]])
kd.dense_rank(x, ndim=2) -> kd.slice([[3, 2, None, 2], [2, None, 1, 0]])
Args:
x: DataSlice to rank.
descending: If true, items are compared in descending order.
ndim: The number of dimensions to rank over. Requires 0 <= ndim <=
get_ndim(x).
Returns:
A DataSlice of dense ranks.
kd.slices.empty_shaped(shape, schema=MASK)
Aliases:
Returns a DataSlice of missing items with the given shape.
If `schema` is a Struct schema, an empty Databag is created and attached to
the resulting DataSlice and `schema` is adopted into that DataBag.
Args:
shape: Shape of the resulting DataSlice.
schema: optional schema of the resulting DataSlice.
kd.slices.empty_shaped_as(shape_from, schema=MASK)
Aliases:
Returns a DataSlice of missing items with the shape of `shape_from`.
If `schema` is a Struct schema, an empty Databag is created and attached to
the resulting DataSlice and `schema` is adopted into that DataBag.
Args:
shape_from: used for the shape of the resulting DataSlice.
schema: optional schema of the resulting DataSlice.
kd.slices.expand_to(x, target, ndim=unspecified)
Aliases:
Expands `x` based on the shape of `target`.
When `ndim` is not set, expands `x` to the shape of
`target`. The dimensions of `x` must be the same as the first N
dimensions of `target` where N is the number of dimensions of `x`. For
example,
Example 1:
x: kd.slice([[1, 2], [3]])
target: kd.slice([[[0], [0, 0]], [[0, 0, 0]]])
result: kd.slice([[[1], [2, 2]], [[3, 3, 3]]])
Example 2:
x: kd.slice([[1, 2], [3]])
target: kd.slice([[[0]], [[0, 0, 0]]])
result: incompatible shapes
Example 3:
x: kd.slice([[1, 2], [3]])
target: kd.slice([0, 0])
result: incompatible shapes
When `ndim` is set, the expansion is performed in 3 steps:
1) the last N dimensions of `x` are first imploded into lists
2) the expansion operation is performed on the DataSlice of lists
3) the lists in the expanded DataSlice are exploded
The result will have M + ndim dimensions where M is the number
of dimensions of `target`.
For example,
Example 4:
x: kd.slice([[1, 2], [3]])
target: kd.slice([[1], [2, 3]])
ndim: 1
result: kd.slice([[[1, 2]], [[3], [3]]])
Example 5:
x: kd.slice([[1, 2], [3]])
target: kd.slice([[1], [2, 3]])
ndim: 2
result: kd.slice([[[[1, 2], [3]]], [[[1, 2], [3]], [[1, 2], [3]]]])
Args:
x: DataSlice to expand.
target: target DataSlice.
ndim: the number of dimensions to implode during expansion.
Returns:
Expanded DataSlice
kd.slices.expr_quote(x)
Aliases:
Returns kd.slice(x, kd.EXPR).
kd.slices.float32(x)
Aliases:
Returns kd.slice(x, kd.FLOAT32).
kd.slices.float64(x)
Aliases:
Returns kd.slice(x, kd.FLOAT64).
kd.slices.get_ndim(x)
Aliases:
Returns the number of dimensions of DataSlice `x`.
kd.slices.get_repr(x, /, *, depth=25, item_limit=200, item_limit_per_dimension=25, format_html=False, max_str_len=100, show_attributes=True, show_databag_id=False, show_shape=False, show_schema=False)
Aliases:
Returns a string representation of the DataSlice `x`.
Args:
x: DataSlice to represent.
depth: Maximum depth when printing nested DataSlices.
item_limit: When it is a DataSlice, it means the maximum number of items to
show across all dimensions. When it is a DataItem, it means the maximum
number of entity/object attributes, list items, or dict key/value pairs to
show. Must be non-negative, or an error will be raised.
item_limit_per_dimension: The maximum number of items to show per dimension
in a DataSlice. It is only enforced when the size of DataSlice is larger
than `item_limit`. Must be non-negative, or an error will be raised.
format_html: When true, attributes and object ids are wrapped in HTML tags
to make it possible to style with CSS and interpret interactions with JS.
max_str_len: Maximum length of repr string to show for text and
bytes if non negative.
show_attributes: When true, show the attributes of the entity/object in non
DataItem DataSlice.
show_databag_id: When true, the repr will show the databag id.
show_shape: When true, the repr will show the shape.
show_schema: When true, the repr will show the schema.
Returns:
A string representation of the DataSlice `x`.
kd.slices.group_by(x, *args, sort=False)
Aliases:
Returns permutation of `x` with injected grouped_by dimension.
The resulting DataSlice has get_ndim() + 1. The first `get_ndim() - 1`
dimensions are unchanged. The last two dimensions corresponds to the groups
and the items within the groups.
Values of the result is a permutation of `x`. `args` are used for the grouping
keys. If length of `args` is greater than 1, the key is a tuple.
If `args` is empty, the key is `x`.
If sort=True groups are ordered by value, otherwise groups are ordered by the
appearance of the first object in the group.
Example 1:
x: kd.slice([1, 3, 2, 1, 2, 3, 1, 3])
result: kd.slice([[1, 1, 1], [3, 3, 3], [2, 2]])
Example 2:
x: kd.slice([1, 3, 2, 1, 2, 3, 1, 3], sort=True)
result: kd.slice([[1, 1, 1], [2, 2], [3, 3, 3]])
Example 3:
x: kd.slice([[1, 2, 1, 3, 1, 3], [1, 3, 1]])
result: kd.slice([[[1, 1, 1], [2], [3, 3]], [[1, 1], [3]]])
Example 4:
x: kd.slice([1, 3, 2, 1, None, 3, 1, None])
result: kd.slice([[1, 1, 1], [3, 3], [2]])
Missing values are not listed in the result.
Example 5:
x: kd.slice([1, 2, 3, 4, 5, 6, 7, 8]),
y: kd.slice([7, 4, 0, 9, 4, 0, 7, 0]),
result: kd.slice([[1, 7], [2, 5], [3, 6, 8], [4]])
When *args is present, `x` is not used for the key.
Example 6:
x: kd.slice([1, 2, 3, 4, None, 6, 7, 8]),
y: kd.slice([7, 4, 0, 9, 4, 0, 7, None]),
result: kd.slice([[1, 7], [2, None], [3, 6], [4]])
Items with missing key is not listed in the result.
Missing `x` values are missing in the result.
Example 7:
x: kd.slice([1, 2, 3, 4, 5, 6, 7, 8]),
y: kd.slice([7, 4, 0, 9, 4, 0, 7, 0]),
z: kd.slice([A, D, B, A, D, C, A, B]),
result: kd.slice([[1, 7], [2, 5], [3], [4], [6]])
When *args has two or more values, the key is a tuple.
In this example we have the following groups:
(7, A), (4, D), (0, B), (9, A), (0, C)
Args:
x: DataSlice to group.
*args: DataSlices keys to group by. All data slices must have the same shape
as x. Scalar DataSlices are not supported. If not present, `x` is used as
the key.
sort: Whether groups should be ordered by value.
Returns:
DataSlice with the same shape and schema as `x` with injected grouped
by dimension.
kd.slices.group_by_indices(*args, sort=False)
Aliases:
Returns a indices DataSlice with injected grouped_by dimension.
The resulting DataSlice has get_ndim() + 1. The first `get_ndim() - 1`
dimensions are unchanged. The last two dimensions corresponds to the groups
and the items within the groups.
Values of the DataSlice are the indices of the items within the parent
dimension. `kd.take(x, kd.group_by_indices(x))` would group the items in
`x` by their values.
If sort=True groups are ordered by value, otherwise groups are ordered by the
appearance of the first object in the group.
Example 1:
x: kd.slice([1, 3, 2, 1, 2, 3, 1, 3])
result: kd.slice([[0, 3, 6], [1, 5, 7], [2, 4]])
We have three groups in order: 1, 3, 2. Each sublist contains the indices of
the items in the original DataSlice.
Example 2:
x: kd.slice([1, 3, 2, 1, 2, 3, 1, 3], sort=True)
result: kd.slice([[0, 3, 6], [2, 4], [1, 5, 7]])
Groups are now ordered by value.
Example 3:
x: kd.slice([[1, 2, 1, 3, 1, 3], [1, 3, 1]])
result: kd.slice([[[0, 2, 4], [1], [3, 5]], [[0, 2], [1]]])
We have three groups in the first sublist in order: 1, 2, 3 and two groups
in the second sublist in order: 1, 3.
Each sublist contains the indices of the items in the original sublist.
Example 4:
x: kd.slice([1, 3, 2, 1, None, 3, 1, None])
result: kd.slice([[0, 3, 6], [1, 5], [2]])
Missing values are not listed in the result.
Example 5:
x: kd.slice([1, 2, 3, 1, 2, 3, 1, 3]),
y: kd.slice([7, 4, 0, 9, 4, 0, 7, 0]),
result: kd.slice([[0, 6], [1, 4], [2, 5, 7], [3]])
With several arguments keys is a tuple.
In this example we have the following groups: (1, 7), (2, 4), (3, 0), (1, 9)
Args:
*args: DataSlices keys to group by. All data slices must have the same
shape. Scalar DataSlices are not supported.
sort: Whether groups should be ordered by value.
Returns:
INT64 DataSlice with indices and injected grouped_by dimension.
kd.slices.index(x, dim=-1)
Aliases:
Returns the indices of the elements computed over dimension `dim`.
The resulting slice has the same shape as the input.
Example:
ds = kd.slice([
[
['a', None, 'c'],
['d', 'e']
],
[
[None, 'g'],
['h', 'i', 'j']
]
])
kd.index(ds, dim=0)
# -> kd.slice([[[0, None, 0], [0, 0]], [[None, 1], [1, 1, 1]]])
kd.index(ds, dim=1)
# -> kd.slice([[[0, None, 0], [1, 1]], [[None, 0], [1, 1, 1]]])
kd.index(ds, dim=2) # (same as kd.index(ds, -1) or kd.index(ds))
# -> kd.slice([[[0, None, 2], [0, 1]], [[None, 1], [0, 1, 2]]])
kd.index(ds) -> kd.index(ds, dim=ds.get_ndim() - 1)
Args:
x: A DataSlice.
dim: The dimension to compute indices over.
Requires -get_ndim(x) <= dim < get_ndim(x).
If dim < 0 then dim = get_ndim(x) + dim.
kd.slices.int32(x)
Aliases:
Returns kd.slice(x, kd.INT32).
kd.slices.int64(x)
Aliases:
Returns kd.slice(x, kd.INT64).
kd.slices.internal_is_compliant_attr_name(attr_name, /)
Returns true iff `attr_name` can be accessed through `getattr(slice, attr_name)`.
kd.slices.internal_select_by_slice(ds, fltr, expand_filter=True)
A version of kd.select that does not support lambdas/functors.
kd.slices.inverse_mapping(x, ndim=unspecified)
Aliases:
Returns inverse permutations of indices over the last `ndim` dimension.
It interprets `indices` over the last `ndim` dimension as a permutation and
substitute with the corresponding inverse permutation. `ndim` is set to 1 by
default if unspecified. It fails when `indices` is not a valid permutation.
Example:
indices = kd.slice([[1, 2, 0], [1, None]])
kd.inverse_mapping(indices) -> kd.slice([[2, 0, 1], [None, 0]])
Explanation:
indices = [[1, 2, 0], [1, None]]
inverse_permutation[1, 2, 0] = [2, 0, 1]
inverse_permutation[1, None] = [None, 0]
kd.inverse_mapping(indices, ndim=1) -> raise
indices = kd.slice([[1, 2, 0], [3, None]])
kd.inverse_mapping(indices, ndim=2) -> kd.slice([[2, 0, 1], [3, None]])
Args:
x: A DataSlice of indices.
ndim: The number of dimensions to compute inverse permutations over.
Requires 0 <= ndim <= get_ndim(x).
Returns:
An inverse permutation of indices.
kd.slices.inverse_select(ds, fltr)
Aliases:
Creates a DataSlice by putting items in ds to present positions in fltr.
The shape of `ds` and the shape of `fltr` must have the same rank and the same
first N-1 dimensions. That is, only the last dimension can be different. The
shape of `ds` must be the same as the shape of the DataSlice after applying
`fltr` using kd.select. That is,
ds.get_shape() == kd.select(fltr, fltr).get_shape().
Example:
ds = kd.slice([[1, None], [2]])
fltr = kd.slice([[None, kd.present, kd.present], [kd.present, None]])
kd.inverse_select(ds, fltr) -> [[None, 1, None], [2, None]]
ds = kd.slice([1, None, 2])
fltr = kd.slice([[None, kd.present, kd.present], [kd.present, None]])
kd.inverse_select(ds, fltr) -> error due to different ranks
ds = kd.slice([[1, None, 2]])
fltr = kd.slice([[None, kd.present, kd.present], [kd.present, None]])
kd.inverse_select(ds, fltr) -> error due to different N-1 dimensions
ds = kd.slice([[1], [2]])
fltr = kd.slice([[None, kd.present, kd.present], [kd.present, None]])
kd.inverse_select(ds, fltr) -> error due to incompatible shapes
Note, in most cases, kd.inverse_select is not a strict reverse operation of
kd.select as kd.select operation is lossy and does not require `ds` and `fltr`
to have the same rank. That is,
kd.inverse_select(kd.select(ds, fltr), fltr) != ds.
The most common use case of kd.inverse_select is to restore the shape of the
original DataSlice after applying kd.select and performing some operations on
the subset of items in the original DataSlice. E.g.
filtered_ds = kd.select(ds, fltr)
# do something on filtered_ds
ds = kd.inverse_select(filtered_ds, fltr) | ds
Args:
ds: DataSlice to be reverse filtered
fltr: filter DataSlice with dtype as kd.MASK.
Returns:
Reverse filtered DataSlice.
kd.slices.is_empty(x)
Aliases:
Returns kd.present if all items in the DataSlice are missing.
kd.slices.is_expandable_to(x, target, ndim=unspecified)
Aliases:
Returns true if `x` is expandable to `target`.
Args:
x: DataSlice to expand.
target: target DataSlice.
ndim: the number of dimensions to implode before expansion.
See `expand_to` for a detailed description of expansion.
kd.slices.is_shape_compatible(x, y)
Aliases:
Returns present if the shapes of `x` and `y` are compatible.
Two DataSlices have compatible shapes if dimensions of one DataSlice equal or
are prefix of dimensions of another DataSlice.
Args:
x: DataSlice to check.
y: DataSlice to check.
Returns:
A MASK DataItem indicating whether 'x' and 'y' are compatible.
kd.slices.isin(x, y)
Aliases:
Returns a DataItem indicating whether DataItem x is present in y.
kd.slices.item(x, /, schema=None)
Aliases:
Returns a DataItem created from `x`.
If `schema` is set, that schema is used, otherwise the schema is inferred from
`x`. Python value must be convertible to Koda scalar and the result cannot
be multidimensional DataSlice.
Args:
x: a Python value or a DataItem.
schema: schema DataItem to set. If `x` is already a DataItem, this will cast
it to the given schema.
kd.slices.mask(x)
Aliases:
Returns kd.slice(x, kd.MASK).
kd.slices.ordinal_rank(x, tie_breaker=unspecified, descending=False, ndim=unspecified)
Aliases:
Returns ordinal ranks of items in `x` over the last `ndim` dimensions.
Items are grouped over the last `ndim` dimensions and ranked within the group.
`ndim` is set to 1 by default if unspecified. Ranks are integers starting from
0, assigned to values in ascending order by default.
By ordinal ranking ("1 2 3 4" ranking), equal items receive distinct ranks.
Items are compared by the triple (value, tie_breaker, position) to resolve
ties. When descending=True, values are ranked in descending order but
tie_breaker and position are ranked in ascending order.
NaN values are ranked lowest regardless of the order of ranking. Ranks of
missing items are missing in the result. If `tie_breaker` is specified, it
cannot be more sparse than `x`.
Example:
ds = kd.slice([[0, 3, None, 6], [5, None, 2, 1]])
kd.ordinal_rank(x) -> kd.slice([[0, 1, None, 2], [2, None, 1, 0]])
kd.ordinal_rank(x, descending=True) ->
kd.slice([[2, 1, None, 0], [0, None, 1, 2]])
kd.ordinal_rank(x, ndim=0) -> kd.slice([[0, 0, None, 0], [0, None, 0, 0]])
kd.ordinal_rank(x, ndim=2) -> kd.slice([[0, 3, None, 5], [4, None, 2, 1]])
Args:
x: DataSlice to rank.
tie_breaker: If specified, used to break ties. If `tie_breaker` does not
fully resolve all ties, then the remaining ties are resolved by their
positions in the DataSlice.
descending: If true, items are compared in descending order. Does not affect
the order of tie breaker and position in tie-breaking compairson.
ndim: The number of dimensions to rank over. Requires 0 <= ndim <=
get_ndim(x).
Returns:
A DataSlice of ordinal ranks.
kd.slices.range(start, end=unspecified)
Aliases:
Returns a DataSlice of INT64s with range [start, end).
`start` and `end` must be broadcastable to the same shape. The resulting
DataSlice has one more dimension than the broadcasted shape.
When `end` is unspecified, `start` is used as `end` and 0 is used as `start`.
For example,
kd.range(5) -> kd.slice([0, 1, 2, 3, 4])
kd.range(2, 5) -> kd.slice([2, 3, 4])
kd.range(5, 2) -> kd.slice([]) # empty range
kd.range(kd.slice([2, 4])) -> kd.slice([[0, 1], [0, 1, 2, 3])
kd.range(kd.slice([2, 4]), 6) -> kd.slice([[2, 3, 4, 5], [4, 5])
Args:
start: A DataSlice for start (inclusive) of intervals (unless `end` is
unspecified, in which case this parameter is used as `end`).
end: A DataSlice for end (exclusive) of intervals.
Returns:
A DataSlice of INT64s with range [start, end).
kd.slices.repeat(x, sizes)
Aliases:
Returns `x` with values repeated according to `sizes`.
The resulting DataSlice has `rank = rank + 1`. The input `sizes` are
broadcasted to `x`, and each value is repeated the given number of times.
Example:
ds = kd.slice([[1, None], [3]])
sizes = kd.slice([[1, 2], [3]])
kd.repeat(ds, sizes) # -> kd.slice([[[1], [None, None]], [[3, 3, 3]]])
ds = kd.slice([[1, None], [3]])
sizes = kd.slice([2, 3])
kd.repeat(ds, sizes) # -> kd.slice([[[1, 1], [None, None]], [[3, 3, 3]]])
ds = kd.slice([[1, None], [3]])
size = kd.item(2)
kd.repeat(ds, size) # -> kd.slice([[[1, 1], [None, None]], [[3, 3]]])
Args:
x: A DataSlice of data.
sizes: A DataSlice of sizes that each value in `x` should be repeated for.
kd.slices.repeat_present(x, sizes)
Aliases:
Returns `x` with present values repeated according to `sizes`.
The resulting DataSlice has `rank = rank + 1`. The input `sizes` are
broadcasted to `x`, and each value is repeated the given number of times.
Example:
ds = kd.slice([[1, None], [3]])
sizes = kd.slice([[1, 2], [3]])
kd.repeat_present(ds, sizes) # -> kd.slice([[[1], []], [[3, 3, 3]]])
ds = kd.slice([[1, None], [3]])
sizes = kd.slice([2, 3])
kd.repeat_present(ds, sizes) # -> kd.slice([[[1, 1], []], [[3, 3, 3]]])
ds = kd.slice([[1, None], [3]])
size = kd.item(2)
kd.repeat_present(ds, size) # -> kd.slice([[[1, 1], []], [[3, 3]]])
Args:
x: A DataSlice of data.
sizes: A DataSlice of sizes that each value in `x` should be repeated for.
kd.slices.reverse(ds)
Aliases:
Returns a DataSlice with items reversed on the last dimension.
Example:
ds = kd.slice([[1, None], [2, 3, 4]])
kd.reverse(ds) -> [[None, 1], [4, 3, 2]]
ds = kd.slice([1, None, 2])
kd.reverse(ds) -> [2, None, 1]
Args:
ds: DataSlice to be reversed.
Returns:
Reversed on the last dimension DataSlice.
kd.slices.reverse_select(ds, fltr)
Alias for kd.slices.inverse_select operator.
kd.slices.select(ds, fltr, expand_filter=True)
Aliases:
Creates a new DataSlice by filtering out missing items in fltr.
It is not supported for DataItems because their sizes are always 1.
The dimensions of `fltr` needs to be compatible with the dimensions of `ds`.
By default, `fltr` is expanded to 'ds' and items in `ds` corresponding
missing items in `fltr` are removed. The last dimension of the resulting
DataSlice is changed while the first N-1 dimensions are the same as those in
`ds`.
Example:
val = kd.slice([[1, None, 4], [None], [2, 8]])
kd.select(val, val > 3) -> [[4], [], [8]]
fltr = kd.slice(
[[None, kd.present, kd.present], [kd.present], [kd.present, None]])
kd.select(val, fltr) -> [[None, 4], [None], [2]]
fltr = kd.slice([kd.present, kd.present, None])
kd.select(val, fltr) -> [[1, None, 4], [None], []]
kd.select(val, fltr, expand_filter=False) -> [[1, None, 4], [None]]
Args:
ds: DataSlice with ndim > 0 to be filtered.
fltr: filter DataSlice with dtype as kd.MASK. It can also be a Koda Functor
or a Python function which can be evalauted to such DataSlice. A Python
function will be traced for evaluation, so it cannot have Python control
flow operations such as `if` or `while`.
expand_filter: flag indicating if the 'filter' should be expanded to 'ds'
Returns:
Filtered DataSlice.
kd.slices.select_present(ds)
Aliases:
Creates a new DataSlice by removing missing items.
It is not supported for DataItems because their sizes are always 1.
Example:
val = kd.slice([[1, None, 4], [None], [2, 8]])
kd.select_present(val) -> [[1, 4], [], [2, 8]]
Args:
ds: DataSlice with ndim > 0 to be filtered.
Returns:
Filtered DataSlice.
kd.slices.size(x)
Aliases:
Returns the number of items in `x`, including missing items.
Args:
x: A DataSlice.
Returns:
The size of `x`.
kd.slices.slice(x, /, schema=None)
Aliases:
Returns a DataSlice created from `x`.
If `schema` is set, that schema is used, otherwise the schema is inferred from
`x`.
Args:
x: a Python value or a DataSlice. If it is a (nested) Python list or tuple,
a multidimensional DataSlice is created.
schema: schema DataItem to set. If `x` is already a DataSlice, this will
cast it to the given schema.
kd.slices.sort(x, sort_by=unspecified, descending=False)
Aliases:
Sorts the items in `x` over the last dimension.
When `sort_by` is specified, it is used to sort items in `x`. `sort_by` must
have the same shape as `x` and cannot be more sparse than `x`. Otherwise,
items in `x` are compared by their values. Missing items are put in the end of
the sorted list regardless of the value of `descending`.
Examples:
ds = kd.slice([[[2, 1, None, 4], [4, 1]], [[5, 4, None]]])
kd.sort(ds) -> kd.slice([[[1, 2, 4, None], [1, 4]], [[4, 5, None]]])
kd.sort(ds, descending=True) ->
kd.slice([[[4, 2, 1, None], [4, 1]], [[5, 4, None]]])
sort_by = kd.slice([[[9, 2, 1, 3], [2, 3]], [[9, 7, 9]]])
kd.sort(ds, sort_by) ->
kd.slice([[[None, 1, 4, 2], [4, 1]], [[4, 5, None]]])
kd.sort(kd.slice([1, 2, 3]), kd.slice([5, 4])) ->
raise due to different shapes
kd.sort(kd.slice([1, 2, 3]), kd.slice([5, 4, None])) ->
raise as `sort_by` is more sparse than `x`
Args:
x: DataSlice to sort.
sort_by: DataSlice used for comparisons.
descending: whether to do descending sort.
Returns:
DataSlice with last dimension sorted.
kd.slices.stack(*args, ndim=0)
Aliases:
Stacks the given DataSlices, creating a new dimension at index `rank-ndim`.
The given DataSlices must have the same rank, and the shapes of the first
`rank-ndim` dimensions must match. If they have incompatible shapes, consider
using `kd.align(*args)`, `arg.repeat(...)`, or `arg.expand_to(other_arg, ...)`
to bring them to compatible shapes first.
The result has the following shape:
1) the shape of the first `rank-ndim` dimensions remains the same
2) a new dimension is inserted at `rank-ndim` with uniform shape `len(args)`
3) the shapes of the last `ndim` dimensions are interleaved within the
groups implied by the newly-inserted dimension
Alteratively, if we think of each input DataSlice as a nested Python list,
this operator simultaneously iterates over the inputs at depth `rank-ndim`,
wrapping the corresponding nested sub-lists from each input in new lists.
For example,
a = kd.slice([[1, None, 3], [4]])
b = kd.slice([[7, 7, 7], [7]])
kd.stack(a, b, ndim=0) -> [[[1, 7], [None, 7], [3, 7]], [[4, 7]]]
kd.stack(a, b, ndim=1) -> [[[1, None, 3], [7, 7, 7]], [[4], [7]]]
kd.stack(a, b, ndim=2) -> [[[1, None, 3], [4]], [[7, 7, 7], [7]]]
kd.stack(a, b, ndim=4) -> raise an exception
kd.stack(a, b) -> the same as kd.stack(a, b, ndim=0)
Args:
*args: The DataSlices to stack.
ndim: The number of last dimensions to stack (default 0).
Returns:
The stacked DataSlice. If the input DataSlices come from different DataBags,
this will refer to a merged immutable DataBag.
kd.slices.str(x)
Aliases:
Returns kd.slice(x, kd.STRING).
kd.slices.subslice(x, *slices)
Aliases:
Slices `x` across all of its dimensions based on the provided `slices`.
`slices` is a variadic argument for slicing arguments where individual
slicing argument can be one of the following:
1) INT32/INT64 DataItem or Python integer wrapped into INT32 DataItem. It is
used to select a single item in one dimension. It reduces the number of
dimensions in the resulting DataSlice by 1.
2) INT32/INT64 DataSlice. It is used to select multiple items in one
dimension.
3) Python slice (e.g. slice(1), slice(1, 3), slice(2, -1)). It is used to
select a slice of items in one dimension. 'step' is not supported and it
results in no item if 'start' is larger than or equal to 'stop'. 'start'
and 'stop' can be either Python integers, DataItems or DataSlices, in
the latter case we can select a different range for different items,
or even select multiple ranges for the same item if the 'start'
or 'stop' have more dimensions. If an item is missing either in 'start'
or in 'stop', the corresponding slice is considered empty.
4) .../Ellipsis. It can appear at most once in `slices` and used to fill
corresponding dimensions in `x` but missing in `slices`. It means
selecting all items in these dimensions.
If the Ellipsis is not provided, it is added to the **beginning** of `slices`
by default, which is different from Numpy. Individual slicing argument is used
to slice corresponding dimension in `x`.
The slicing algorithm can be thought as:
1) implode `x` recursively to a List DataItem
2) explode the List DataItem recursively with the slicing arguments (i.e.
imploded_x[slice])
Example 1:
x = kd.slice([[[1, 2], [3]], [[4, 5, 6]], [[7], [8, 9]]])
kd.subslice(x, 0)
=> kd.slice([[1, 3], [4], [7, 8]])
Example 2:
x = kd.slice([[[1, 2], [3]], [[4, 5, 6]], [[7], [8, 9]]])
kd.subslice(x, 0, 1, kd.item(0))
=> kd.item(3)
Example 3:
x = kd.slice([[[1, 2], [3]], [[4, 5, 6]], [[7], [8, 9]]])
kd.subslice(x, slice(0, -1))
=> kd.slice([[[1], []], [[4, 5]], [[], [8]]])
Example 4:
x = kd.slice([[[1, 2], [3]], [[4, 5, 6]], [[7], [8, 9]]])
kd.subslice(x, slice(0, -1), slice(0, 1), slice(1, None))
=> kd.slice([[[2], []], [[5, 6]]])
Example 5 (also see Example 6/7 for using DataSlices for subslicing):
x = kd.slice([[[1, 2], [3]], [[4, 5, 6]], [[7], [8, 9]]])
kd.subslice(x, kd.slice([1, 2]), kd.slice([[0, 0], [1, 0]]), kd.slice(0))
=> kd.slice([[4, 4], [8, 7]])
Example 6:
x = kd.slice([[[1, 2], [3]], [[4, 5, 6]], [[7], [8, 9]]])
kd.subslice(x, kd.slice([1, 2]), ...)
=> kd.slice([[[4, 5, 6]], [[7], [8, 9]]])
Example 7:
x = kd.slice([[[1, 2], [3]], [[4, 5, 6]], [[7], [8, 9]]])
kd.subslice(x, kd.slice([1, 2]), kd.slice([[0, 0], [1, 0]]), ...)
=> kd.slice([[[4, 5, 6], [4, 5, 6]], [[8, 9], [7]]])
Example 8:
x = kd.slice([[[1, 2], [3]], [[4, 5, 6]], [[7], [8, 9]]])
kd.subslice(x, ..., slice(1, None))
=> kd.slice([[[2], []], [[5, 6]], [[], [9]]])
Example 9:
x = kd.slice([[[1, 2], [3]], [[4, 5, 6]], [[7], [8, 9]]])
kd.subslice(x, 2, ..., slice(1, None))
=> kd.slice([[], [9]])
Example 10:
x = kd.slice([[[1, 2], [3]], [[4, 5, 6]], [[7], [8, 9]]])
kd.subslice(x, ..., 2, ...)
=> error as ellipsis can only appear once
Example 11:
x = kd.slice([[[1, 2], [3]], [[4, 5, 6]], [[7], [8, 9]]])
kd.subslice(x, 1, 2, 3, 4)
=> error as at most 3 slicing arguments can be provided
Example 12:
x = kd.slice([[[1, 2], [3]], [[4, 5, 6]], [[7], [8, 9]]])
kd.subslice(x, slice(kd.slice([0, 1, 2]), None))
=> kd.slice([[[1, 2], [3]], [[5, 6]], [[], []]])
Example 13:
x = kd.slice([[[1, 2], [3]], [[4, 5, 6]], [[7], [8, 9]]])
kd.subslice(x, slice(kd.slice([0, 1, 2]), kd.slice([2, 3, None])), ...)
=> kd.slice([[[[1, 2], [3]], [[4, 5, 6]]], [[[4, 5, 6]], [[7], [8, 9]]],
[]])
Note that there is a shortcut `ds.S[*slices] for this operator which is more
commonly used and the Python slice can be written as [start:end] format. For
example:
kd.subslice(x, 0) == x.S[0]
kd.subslice(x, 0, 1, kd.item(0)) == x.S[0, 1, kd.item(0)]
kd.subslice(x, slice(0, -1)) == x.S[0:-1]
kd.subslice(x, slice(0, -1), slice(0, 1), slice(1, None))
== x.S[0:-1, 0:1, 1:]
kd.subslice(x, ..., slice(1, None)) == x.S[..., 1:]
kd.subslice(x, slice(1, None)) == x.S[1:]
Args:
x: DataSlice to slice.
*slices: variadic slicing argument.
Returns:
A DataSlice with selected items
kd.slices.take(x, indices)
Alias for kd.slices.at operator.
kd.slices.tile(x, shape)
Aliases:
Nests the whole `x` under `shape`.
Example 1:
x: [1, 2]
shape: JaggedShape([3])
result: [[1, 2], [1, 2], [1, 2]]
Example 2:
x: [1, 2]
shape: JaggedShape([2], [2, 1])
result: [[[1, 2], [1, 2]], [[1, 2]]]
Args:
x: DataSlice to expand.
shape: JaggedShape.
Returns:
Expanded DataSlice.
kd.slices.translate(keys_to, keys_from, values_from)
Aliases:
Translates `keys_to` based on `keys_from`->`values_from` mapping.
The translation is done by matching keys from `keys_from` to `keys_to` over
the last dimension of `keys_to`. `keys_from` cannot have duplicate keys within
each group of the last dimension. Also see kd.translate_group.
`values_from` is first broadcasted to `keys_from` and the first N-1 dimensions
of `keys_from` and `keys_to` must be the same. The resulting DataSlice has the
same shape as `keys_to` and the same DataBag as `values_from`.
Missing items or items with no matching keys in `keys_from` result in missing
items in the resulting DataSlice.
For example:
keys_to = kd.slice([['a', 'd'], ['c', None]])
keys_from = kd.slice([['a', 'b'], ['c', None]])
values_from = kd.slice([[1, 2], [3, 4]])
kd.translate(keys_to, keys_from, values_from) ->
kd.slice([[1, None], [3, None]])
Args:
keys_to: DataSlice of keys to be translated.
keys_from: DataSlice of keys to be matched.
values_from: DataSlice of values to be matched.
Returns:
A DataSlice of translated values.
kd.slices.translate_group(keys_to, keys_from, values_from)
Aliases:
Translates `keys_to` based on `keys_from`->`values_from` mapping.
The translation is done by creating an additional dimension under `keys_to`
and putting items in `values_from` to this dimension by matching keys from
`keys_from` to `keys_to` over the last dimension of `keys_to`.
`keys_to` can have duplicate keys within each group of the last
dimension.
`values_from` and `keys_from` must have the same shape and the first N-1
dimensions of `keys_from` and `keys_to` must be the same. The shape of
resulting DataSlice is the combination of the shape of `keys_to` and an
injected group_by dimension.
Missing items or items with no matching keys in `keys_from` result in empty
groups in the resulting DataSlice.
For example:
keys_to = kd.slice(['a', 'c', None, 'd', 'e'])
keys_from = kd.slice(['a', 'c', 'b', 'c', 'a', 'e'])
values_from = kd.slice([1, 2, 3, 4, 5, 6])
kd.translate_group(keys_to, keys_from, values_from) ->
kd.slice([[1, 5], [2, 4], [], [], [6]])
Args:
keys_to: DataSlice of keys to be translated.
keys_from: DataSlice of keys to be matched.
values_from: DataSlice of values to be matched.
Returns:
A DataSlice of translated values.
kd.slices.unique(x, sort=False)
Aliases:
Returns a DataSlice with unique values within each dimension.
The resulting DataSlice has the same rank as `x`, but a different shape.
The first `get_ndim(x) - 1` dimensions are unchanged. The last dimension
contains the unique values.
If `sort` is False elements are ordered by the appearance of the first item.
If `sort` is True:
1. Elements are ordered by the value.
2. Mixed types are not supported.
3. ExprQuote and DType are not supported.
Example 1:
x: kd.slice([1, 3, 2, 1, 2, 3, 1, 3])
sort: False
result: kd.unique([1, 3, 2])
Example 2:
x: kd.slice([[1, 2, 1, 3, 1, 3], [3, 1, 1]])
sort: False
result: kd.slice([[1, 2, 3], [3, 1]])
Example 3:
x: kd.slice([1, 3, 2, 1, None, 3, 1, None])
sort: False
result: kd.slice([1, 3, 2])
Missing values are ignored.
Example 4:
x: kd.slice([[1, 3, 2, 1, 3, 1, 3], [3, 1, 1]])
sort: True
result: kd.slice([[1, 2, 3], [1, 3]])
Args:
x: DataSlice to find unique values in.
sort: whether elements must be ordered by the value.
Returns:
DataSlice with the same rank and schema as `x` with unique values in the
last dimension.
kd.slices.val_like(x, val)
Aliases:
Creates a DataSlice with `val` masked and expanded to the shape of `x`.
Example:
x = kd.slice([0], [0, None])
kd.slices.val_like(x, 1) -> kd.slice([[1], [1, None]])
kd.slices.val_like(x, kd.slice([1, 2])) -> kd.slice([[1], [2, None]])
kd.slices.val_like(x, kd.slice([None, 2])) -> kd.slice([[None], [2, None]])
Args:
x: DataSlice to match the shape and sparsity of.
val: DataSlice to expand.
Returns:
A DataSlice with the same shape as `x` and masked by `x`.
kd.slices.val_shaped(shape, val)
Aliases:
Creates a DataSlice with `val` expanded to the given shape.
Example:
shape = kd.shapes.new([2], [1, 2])
kd.slices.val_shaped(shape, 1) -> kd.slice([[1], [1, 1]])
kd.slices.val_shaped(shape, kd.slice([None, 2])) -> kd.slice([[None], [2,
2]])
Args:
shape: shape to expand to.
val: value to expand.
Returns:
A DataSlice with the same shape as `shape`.
kd.slices.val_shaped_as(x, val)
Aliases:
Creates a DataSlice with `val` expanded to the shape of `x`.
Example:
x = kd.slice([0], [0, 0])
kd.slices.val_shaped_as(x, 1) -> kd.slice([[1], [1, 1]])
kd.slices.val_shaped_as(x, kd.slice([None, 2])) -> kd.slice([[None], [2,
2]])
Args:
x: DataSlice to match the shape of.
val: DataSlice to expand.
Returns:
A DataSlice with the same shape as `x`.
kd.slices.zip(*args)
Aliases:
Zips the given DataSlices into a new DataSlice with a new last dimension.
Input DataSlices are automatically aligned. The result has the shape of the
aligned inputs, plus a new last dimension with uniform shape `len(args)`
containing the values from each input.
For example,
a = kd.slice([1, 2, 3, 4])
b = kd.slice([5, 6, 7, 8])
c = kd.slice(['a', 'b', 'c', 'd'])
kd.zip(a, b, c) -> [[1, 5, 'a'], [2, 6, 'b'], [3, 7, 'c'], [4, 8, 'd']]
a = kd.slice([[1, None, 3], [4]])
b = kd.slice([7, None])
kd.zip(a, b) -> [[[1, 7], [None, 7], [3, 7]], [[4, None]]]
Args:
*args: The DataSlices to zip.
Returns:
The zipped DataSlice. If the input DataSlices come from different DataBags,
this will refer to a merged immutable DataBag.
Operators that work with strings data.
Operators
kd.strings.agg_join(x, sep=DataItem(None, schema: NONE), ndim=unspecified)
Returns a DataSlice of strings joined on last ndim dimensions.
Example:
ds = kd.slice([['el', 'psy', 'congroo'], ['a', 'b', 'c']))
kd.agg_join(ds, ' ') # -> kd.slice(['el psy congroo', 'a b c'])
kd.agg_join(ds, ' ', ndim=2) # -> kd.slice('el psy congroo a b c')
Args:
x: String or bytes DataSlice
sep: If specified, will join by the specified string, otherwise will be
empty string.
ndim: The number of dimensions to compute indices over. Requires 0 <= ndim
<= get_ndim(x).
kd.strings.contains(s, substr)
Returns present iff `s` contains `substr`.
Examples:
kd.strings.constains(kd.slice(['Hello', 'Goodbye']), 'lo')
# -> kd.slice([kd.present, kd.missing])
kd.strings.contains(
kd.slice([b'Hello', b'Goodbye']),
kd.slice([b'lo', b'Go']))
# -> kd.slice([kd.present, kd.present])
Args:
s: The strings to consider. Must have schema STRING or BYTES.
substr: The substrings to look for in `s`. Must have the same schema as `s`.
Returns:
The DataSlice of present/missing values with schema MASK.
kd.strings.count(s, substr)
Counts the number of occurrences of `substr` in `s`.
Examples:
kd.strings.count(kd.slice(['Hello', 'Goodbye']), 'l')
# -> kd.slice([2, 0])
kd.strings.count(
kd.slice([b'Hello', b'Goodbye']),
kd.slice([b'Hell', b'o']))
# -> kd.slice([1, 2])
Args:
s: The strings to consider.
substr: The substrings to count in `s`. Must have the same schema as `s`.
Returns:
The DataSlice of INT32 counts.
kd.strings.decode(x)
Decodes `x` as STRING using UTF-8 decoding.
kd.strings.decode_base64(x, /, *, on_invalid=unspecified)
Decodes BYTES from `x` using base64 encoding (RFC 4648 section 4).
The input strings may either have no padding, or must have the correct amount
of padding. ASCII whitespace characters anywhere in the string are ignored.
Args:
x: DataSlice of STRING or BYTES containing base64-encoded strings.
on_invalid: If unspecified (the default), any invalid base64 strings in `x`
will cause an error. Otherwise, this must be a DataSlice broadcastable to
`x` with a schema compatible with BYTES, and will be used in the result
wherever the input string was not valid base64.
Returns:
DataSlice of BYTES.
kd.strings.encode(x)
Encodes `x` as BYTES using UTF-8 encoding.
kd.strings.encode_base64(x)
Encodes BYTES `x` using base64 encoding (RFC 4648 section 4), with padding.
Args:
x: DataSlice of BYTES to encode.
Returns:
DataSlice of STRING.
kd.strings.find(s, substr, start=0, end=None)
Returns the offset of the first occurrence of `substr` in `s`.
The units of `start`, `end`, and the return value are all byte offsets if `s`
is `BYTES` and codepoint offsets if `s` is `STRING`.
Args:
s: (STRING or BYTES) Strings to search in.
substr: (STRING or BYTES) Strings to search for in `s`. Should have the same
dtype as `s`.
start: (optional int) Offset to start the search, defaults to 0.
end: (optional int) Offset to stop the search, defaults to end of the string.
Returns:
The offset of the last occurrence of `substr` in `s`, or missing if there
are no occurrences.
kd.strings.format(fmt, /, **kwargs)
Aliases:
Formats strings according to python str.format style.
Format support is slightly different from Python:
1. {x:v} is equivalent to {x} and supported for all types as default string
format.
2. Only float and integers support other format specifiers.
E.g., {x:.1f} and {x:04d}.
3. If format is missing type specifier `f` or `d` at the end, we are
adding it automatically based on the type of the argument.
Note: only keyword arguments are supported.
Examples:
kd.strings.format(kd.slice(['Hello {n}!', 'Goodbye {n}!']), n='World')
# -> kd.slice(['Hello World!', 'Goodbye World!'])
kd.strings.format('{a} + {b} = {c}', a=1, b=2, c=3)
# -> kd.slice('1 + 2 = 3')
kd.strings.format(
'{a} + {b} = {c}',
a=kd.slice([1, 2]),
b=kd.slice([2, 3]),
c=kd.slice([3, 5]))
# -> kd.slice(['1 + 2 = 3', '2 + 3 = 5'])
kd.strings.format(
'({a:03} + {b:e}) * {c:.2f} ='
' {a:02d} * {c:3d} + {b:07.3f} * {c:08.4f}'
a=5, b=5.7, c=75)
# -> kd.slice(
# '(005 + 5.700000e+00) * 75.00 = 05 * 75 + 005.700 * 075.0000')
Args:
fmt: Format string (String or Bytes).
**kwargs: Arguments to format.
Returns:
The formatted string.
kd.strings.fstr(x)
Aliases:
Evaluates Koladata f-string into DataSlice.
f-string must be created via Python f-string syntax. It must contain at least
one formatted DataSlice.
Each DataSlice must have custom format specification,
e.g. `{ds:s}` or `{ds:.2f}`.
Find more about format specification in kd.strings.format docs.
NOTE: `{ds:s}` can be used for any type to achieve default string conversion.
Examples:
countries = kd.slice(['USA', 'Schweiz'])
kd.fstr(f'Hello, {countries:s}!')
# -> kd.slice(['Hello, USA!', 'Hello, Schweiz!'])
greetings = kd.slice(['Hello', 'Gruezi'])
kd.fstr(f'{greetings:s}, {countries:s}!')
# -> kd.slice(['Hello, USA!', 'Gruezi, Schweiz!'])
states = kd.slice([['California', 'Arizona', 'Nevada'], ['Zurich', 'Bern']])
kd.fstr(f'{greetings:s}, {states:s} in {countries:s}!')
# -> kd.slice([
['Hello, California in USA!',
'Hello, Arizona in USA!',
'Hello, Nevada in USA!'],
['Gruezi, Zurich in Schweiz!',
'Gruezi, Bern in Schweiz!']]),
prices = kd.slice([35.5, 49.2])
currencies = kd.slice(['USD', 'CHF'])
kd.fstr(f'Lunch price in {countries:s} is {prices:.2f} {currencies:s}.')
# -> kd.slice(['Lunch price in USA is 35.50 USD.',
'Lunch price in Schweiz is 49.20 CHF.'])
Args:
s: f-string to evaluate.
Returns:
DataSlice with evaluated f-string.
kd.strings.join(*args)
Concatenates the given strings.
Examples:
kd.strings.join(kd.slice(['Hello ', 'Goodbye ']), 'World')
# -> kd.slice(['Hello World', 'Goodbye World'])
kd.strings.join(kd.slice([b'foo']), kd.slice([b' ']), kd.slice([b'bar']))
# -> kd.slice([b'foo bar'])
Args:
*args: The inputs to concatenate in the given order.
Returns:
The string concatenation of all the inputs.
kd.strings.length(x)
Returns a DataSlice of lengths in bytes for Byte or codepoints for String.
For example,
kd.strings.length(kd.slice(['abc', None, ''])) -> kd.slice([3, None, 0])
kd.strings.length(kd.slice([b'abc', None, b''])) -> kd.slice([3, None, 0])
kd.strings.length(kd.item('你好')) -> kd.item(2)
kd.strings.length(kd.item('你好'.encode())) -> kd.item(6)
Note that the result DataSlice always has INT32 schema.
Args:
x: String or Bytes DataSlice.
Returns:
A DataSlice of lengths.
kd.strings.lower(x)
Returns a DataSlice with the lowercase version of each string in the input.
For example,
kd.strings.lower(kd.slice(['AbC', None, ''])) -> kd.slice(['abc', None, ''])
kd.strings.lower(kd.item('FOO')) -> kd.item('foo')
Note that the result DataSlice always has STRING schema.
Args:
x: String DataSlice.
Returns:
A String DataSlice of lowercase strings.
kd.strings.lstrip(s, chars=DataItem(None, schema: NONE))
Strips whitespaces or the specified characters from the left side of `s`.
If `chars` is missing, then whitespaces are removed.
If `chars` is present, then it will strip all leading characters from `s`
that are present in the `chars` set.
Examples:
kd.strings.lstrip(kd.slice([' spacious ', '\t text \n']))
# -> kd.slice(['spacious ', 'text \n'])
kd.strings.lstrip(kd.slice(['www.example.com']), kd.slice(['cmowz.']))
# -> kd.slice(['example.com'])
kd.strings.lstrip(kd.slice([['#... Section 3.1 Issue #32 ...'], ['# ...']]),
kd.slice('.#! '))
# -> kd.slice([['Section 3.1 Issue #32 ...'], ['']])
Args:
s: (STRING or BYTES) Original string.
chars: (Optional STRING or BYTES, the same as `s`): The set of chars to
remove.
Returns:
Stripped string.
kd.strings.printf(fmt, *args)
Formats strings according to printf-style (C++) format strings.
See absl::StrFormat documentation for the format string details.
Example:
kd.strings.printf(kd.slice(['Hello %s!', 'Goodbye %s!']), 'World')
# -> kd.slice(['Hello World!', 'Goodbye World!'])
kd.strings.printf('%v + %v = %v', 1, 2, 3) # -> kd.slice('1 + 2 = 3')
Args:
fmt: Format string (String or Bytes).
*args: Arguments to format (primitive types compatible with `fmt`).
Returns:
The formatted string.
kd.strings.regex_extract(text, regex)
Extracts a substring from `text` with the capturing group of `regex`.
Regular expression matches are partial, which means `regex` is matched against
a substring of `text`.
For full matches, where the whole string must match a pattern, please enclose
the pattern in `^` and `$` characters.
The pattern must contain exactly one capturing group.
Examples:
kd.strings.regex_extract(kd.item('foo'), kd.item('f(.)'))
# kd.item('o')
kd.strings.regex_extract(kd.item('foobar'), kd.item('o(..)'))
# kd.item('ob')
kd.strings.regex_extract(kd.item('foobar'), kd.item('^o(..)$'))
# kd.item(None).with_schema(kd.STRING)
kd.strings.regex_extract(kd.item('foobar'), kd.item('^.o(..)a.$'))
# kd.item('ob')
kd.strings.regex_extract(kd.item('foobar'), kd.item('.*(b.*r)$'))
# kd.item('bar')
kd.strings.regex_extract(kd.slice(['abcd', None, '']), kd.slice('b(.*)'))
# -> kd.slice(['cd', None, None])
Args:
text: (STRING) A string.
regex: (STRING) A scalar string that represents a regular expression (RE2
syntax) with exactly one capturing group.
Returns:
For the first partial match of `regex` and `text`, returns the substring of
`text` that matches the capturing group of `regex`.
kd.strings.regex_find_all(text, regex)
Returns the captured groups of all matches of `regex` in `text`.
The strings in `text` are scanned left-to-right to find all non-overlapping
matches of `regex`. The order of the matches is preserved in the result. For
each match, the substring matched by each capturing group of `regex` is
recorded. For each item of `text`, the result contains a 2-dimensional value,
where the first dimension captures the number of matches, and the second
dimension captures the captured groups.
Examples:
# No capturing groups, but two matches:
kd.strings.regex_find_all(kd.item('foo'), kd.item('o'))
# -> kd.slice([[], []])
# One capturing group, three matches:
kd.strings.regex_find_all(kd.item('foo'), kd.item('(.)'))
# -> kd.slice([['f'], ['o'], ['o']])
# Two capturing groups:
kd.strings.regex_find_all(
kd.slice(['fooz', 'bar', '', None]),
kd.item('(.)(.)')
)
# -> kd.slice([[['f', 'o'], ['o', 'z']], [['b', 'a']], [], []])
# Get information about the entire substring of each non-overlapping match
# by enclosing the pattern in additional parentheses:
kd.strings.regex_find_all(
kd.slice([['fool', 'solo'], ['bar', 'boat']]),
kd.item('((.*)o)')
)
# -> kd.slice([[[['foo', 'fo']], [['solo', 'sol']]], [[], [['bo', 'b']]]])
Args:
text: (STRING) A string.
regex: (STRING) A scalar string that represents a regular expression (RE2
syntax).
Returns:
A DataSlice where each item of `text` is associated with a 2-dimensional
representation of its matches' captured groups.
kd.strings.regex_match(text, regex)
Returns `present` if `text` matches the regular expression `regex`.
Matches are partial, which means a substring of `text` matches the pattern.
For full matches, where the whole string must match a pattern, please enclose
the pattern in `^` and `$` characters.
Examples:
kd.strings.regex_match(kd.item('foo'), kd.item('oo'))
# -> kd.present
kd.strings.regex_match(kd.item('foo'), '^oo$')
# -> kd.missing
kd.strings.regex_match(kd.item('foo), '^foo$')
# -> kd.present
kd.strings.regex_match(kd.slice(['abc', None, '']), 'b')
# -> kd.slice([kd.present, kd.missing, kd.missing])
kd.strings.regex_match(kd.slice(['abcd', None, '']), kd.slice('b.d'))
# -> kd.slice([kd.present, kd.missing, kd.missing])
Args:
text: (STRING) A string.
regex: (STRING) A scalar string that represents a regular expression (RE2
syntax).
Returns:
`present` if `text` matches `regex`.
kd.strings.regex_replace_all(text, regex, replacement)
Replaces all non-overlapping matches of `regex` in `text`.
Examples:
# Basic with match:
kd.strings.regex_replace_all(
kd.item('banana'),
kd.item('ana'),
kd.item('ono')
) # -> kd.item('bonona')
# Basic with no match:
kd.strings.regex_replace_all(
kd.item('banana'),
kd.item('x'),
kd.item('a')
) # -> kd.item('banana')
# Reference the first capturing group in the replacement:
kd.strings.regex_replace_all(
kd.item('banana'),
kd.item('a(.)a'),
kd.item(r'o\1\1o')
) # -> kd.item('bonnona')
# Reference the whole match in the replacement with \0:
kd.strings.regex_replace_all(
kd.item('abcd'),
kd.item('(.)(.)'),
kd.item(r'\2\1\0')
) # -> kd.item('baabdccd')
# With broadcasting:
kd.strings.regex_replace_all(
kd.item('foopo'),
kd.item('o'),
kd.slice(['a', 'e']),
) # -> kd.slice(['faapa', 'feepe'])
# With missing values:
kd.strings.regex_replace_all(
kd.slice(['foobor', 'foo', None, 'bar']),
kd.item('o(.)'),
kd.slice([r'\0x\1', 'ly', 'a', 'o']),
) # -> kd.slice(['fooxoborxr', 'fly', None, 'bar'])
Args:
text: (STRING) A string.
regex: (STRING) A scalar string that represents a regular expression (RE2
syntax).
replacement: (STRING) A string that should replace each match.
Backslash-escaped digits (\1 to \9) can be used to reference the text that
matched the corresponding capturing group from the pattern, while \0
refers to the entire match. Replacements are not subject to re-matching.
Since it only replaces non-overlapping matches, replacing "ana" within
"banana" makes only one replacement, not two.
Returns:
The text string where the replacements have been made.
kd.strings.replace(s, old, new, max_subs=None)
Replaces up to `max_subs` occurrences of `old` within `s` with `new`.
If `max_subs` is missing or negative, then there is no limit on the number of
substitutions. If it is zero, then `s` is returned unchanged.
If the search string is empty, the original string is fenced with the
replacement string, for example: replace("ab", "", "-") returns "-a-b-". That
behavior is similar to Python's string replace.
Args:
s: (STRING or BYTES) Original string.
old: (STRING or BYTES, the same as `s`) String to replace.
new: (STRING or BYTES, the same as `s`) Replacement string.
max_subs: (optional INT32) Max number of substitutions. If unspecified or
negative, then there is no limit on the number of substitutions.
Returns:
String with applied substitutions.
kd.strings.rfind(s, substr, start=0, end=None)
Returns the offset of the last occurrence of `substr` in `s`.
The units of `start`, `end`, and the return value are all byte offsets if `s`
is `BYTES` and codepoint offsets if `s` is `STRING`.
Args:
s: (STRING or BYTES) Strings to search in.
substr: (STRING or BYTES) Strings to search for in `s`. Should have the same
dtype as `s`.
start: (optional int) Offset to start the search, defaults to 0.
end: (optional int) Offset to stop the search, defaults to end of the string.
Returns:
The offset of the last occurrence of `substr` in `s`, or missing if there
are no occurrences.
kd.strings.rstrip(s, chars=DataItem(None, schema: NONE))
Strips whitespaces or the specified characters from the right side of `s`.
If `chars` is missing, then whitespaces are removed.
If `chars` is present, then it will strip all tailing characters from `s` that
are present in the `chars` set.
Examples:
kd.strings.rstrip(kd.slice([' spacious ', '\t text \n']))
# -> kd.slice([' spacious', '\t text'])
kd.strings.rstrip(kd.slice(['www.example.com']), kd.slice(['cmowz.']))
# -> kd.slice(['www.example'])
kd.strings.rstrip(kd.slice([['#... Section 3.1 Issue #32 ...'], ['# ...']]),
kd.slice('.#! '))
# -> kd.slice([['#... Section 3.1 Issue #32'], ['']])
Args:
s: (STRING or BYTES) Original string.
chars (Optional STRING or BYTES, the same as `s`): The set of chars to
remove.
Returns:
Stripped string.
kd.strings.split(x, sep=DataItem(None, schema: NONE))
Returns x split by the provided separator.
Example:
ds = kd.slice(['Hello world!', 'Goodbye world!'])
kd.split(ds) # -> kd.slice([['Hello', 'world!'], ['Goodbye', 'world!']])
Args:
x: DataSlice: (can be text or bytes)
sep: If specified, will split by the specified string not omitting empty
strings, otherwise will split by whitespaces while omitting empty strings.
kd.strings.strip(s, chars=DataItem(None, schema: NONE))
Strips whitespaces or the specified characters from both sides of `s`.
If `chars` is missing, then whitespaces are removed.
If `chars` is present, then it will strip all leading and tailing characters
from `s` that are present in the `chars` set.
Examples:
kd.strings.strip(kd.slice([' spacious ', '\t text \n']))
# -> kd.slice(['spacious', 'text'])
kd.strings.strip(kd.slice(['www.example.com']), kd.slice(['cmowz.']))
# -> kd.slice(['example'])
kd.strings.strip(kd.slice([['#... Section 3.1 Issue #32 ...'], ['# ...']]),
kd.slice('.#! '))
# -> kd.slice([['Section 3.1 Issue #32'], ['']])
Args:
s: (STRING or BYTES) Original string.
chars (Optional STRING or BYTES, the same as `s`): The set of chars to
remove.
Returns:
Stripped string.
kd.strings.substr(x, start=0, end=None)
Returns a DataSlice of substrings with indices [start, end).
The usual Python rules apply:
* A negative index is computed from the end of the string.
* An empty range yields an empty string, for example when start >= end and
both are positive.
The result is broadcasted to the common shape of all inputs.
Examples:
ds = kd.slice([['Hello World!', 'Ciao bella'], ['Dolly!']])
kd.substr(ds) # -> kd.slice([['Hello World!', 'Ciao bella'],
['Dolly!']])
kd.substr(ds, 5) # -> kd.slice([[' World!', 'bella'], ['!']])
kd.substr(ds, -2) # -> kd.slice([['d!', 'la'], ['y!']])
kd.substr(ds, 1, 5) # -> kd.slice([['ello', 'iao '], ['olly']])
kd.substr(ds, 5, -1) # -> kd.slice([[' World', 'bell'], ['']])
kd.substr(ds, 4, 100) # -> kd.slice([['o World!', ' bella'], ['y!']])
kd.substr(ds, -1, -2) # -> kd.slice([['', ''], ['']])
kd.substr(ds, -2, -1) # -> kd.slice([['d', 'l'], ['y']])
# Start and end may also be multidimensional.
ds = kd.slice('Hello World!')
start = kd.slice([1, 2])
end = kd.slice([[2, 3], [4]])
kd.substr(ds, start, end) # -> kd.slice([['e', 'el'], ['ll']])
Args:
x: Text or Bytes DataSlice. If text, then `start` and `end` are codepoint
offsets. If bytes, then `start` and `end` are byte offsets.
start: The start index of the substring. Inclusive. Assumed to be 0 if
unspecified.
end: The end index of the substring. Exclusive. Assumed to be the length of
the string if unspecified.
kd.strings.upper(x)
Returns a DataSlice with the uppercase version of each string in the input.
For example,
kd.strings.upper(kd.slice(['abc', None, ''])) -> kd.slice(['ABC', None, ''])
kd.strings.upper(kd.item('foo')) -> kd.item('FOO')
Note that the result DataSlice always has STRING schema.
Args:
x: String DataSlice.
Returns:
A String DataSlice of uppercase strings.
Operators that work with streams of items. These APIs are in active development and might change often (b/424742492).
Operators
kd.streams.await_(arg)
Indicates to kd.streams.call that the argument should be awaited.
This operator acts as a marker. When the returned value is passed to
`kd.streams.call`, it signals that `kd.streams.call` should await
the underlying stream to yield a single item. This single item is then
passed to the functor.
Importantly, `stream_await` itself does not perform any awaiting or blocking.
If the input `arg` is not a stream, this operators returns `arg` unchanged.
Note: `kd.streams.call` expects an awaited stream to yield exactly one item.
Producing zero or more than one item from an awaited stream will result in
an error during the `kd.streams.call` evaluation.
Args:
arg: The input argument (the operator has effect only if `arg` is a stream).
Returns:
If `arg` was a stream, it gets labeled with 'AWAIT'. If `arg` was not
a stream, `arg` is returned without modification.
kd.streams.call(fn, *args, executor=unspecified, return_type_as=DataItem(None, schema: NONE), **kwargs)
Calls a functor on the given executor and yields the result(s) as a stream.
For stream arguments tagged with `kd.streams.await_`, `kd.streams.call` first
awaits the corresponding input streams. Each of these streams is expected to
yield exactly one item, which is then passed as the argument to the functor
`fn`. If a labeled stream is empty or yields more than one item, it is
considered an error.
The `return_type_as` parameter specifies the return type of the functor `fn`.
Unless the return type is already a stream, the result of `kd.streams.call` is
a `STREAM[return_type]` storing a single value returned by the functor.
However, if `return_type_as` is a stream, the result of `kd.streams.call` is
of the same stream type, holding the same items as the stream returned by
the functor.
It's recommended to specify the same `return_type_as` for `kd.streams.call`
calls as it would be for regular `kd.call`.
Importantly, `kd.streams.call` supports the case when `return_type_as` is
non-stream while the functor actually returns `STREAM[return_type]`. This
enables nested `kd.streams.call` calls.
Args:
fn: The functor to be called, typically created via kd.fn().
*args: The positional arguments to pass to the call. The stream arguments
tagged with `kd.streams.await_` will be awaited before the call, and
expected to yield exactly one item.
executor: The executor to use for computations.
return_type_as: The return type of the functor `fn` call.
**kwargs: The keyword arguments to pass to the call. Scalars will be
auto-boxed to DataItems.
Returns:
If the return type of the functor (as specified by `return_type_as`) is
a non-stream type, the result of `kd.streams.call` is a single-item stream
with the functor's return value. Otherwise, the result is a stream of
the same type as `return_type_as`, containing the same items as the stream
returned by the functor.
kd.streams.chain(*streams, value_type_as=unspecified)
Creates a stream that chains the given streams, in the given order.
The streams must all have the same value type. If value_type_as is
specified, it must be the same as the value type of the streams, if any.
Args:
*streams: A list of streams to be chained (concatenated).
value_type_as: A value that has the same type as the items in the streams.
It is useful to specify this explicitly if the list of streams may be
empty. If this is not specified and the list of streams is empty, the
stream will have DATA_SLICE as the value type.
Returns:
A stream that chains the given streams in the given order.
kd.streams.chain_from_stream(stream_of_streams)
Creates a stream that chains the given streams.
The resulting stream has all items from the first sub-stream, then all items
from the second sub-stream, and so on.
Example:
```
kd.streams.chain_from_stream(
kd.streams.make(
kd.streams.make(1, 2, 3),
kd.streams.make(4),
kd.streams.make(5, 6),
)
)
```
result: A stream with items [1, 2, 3, 4, 5, 6].
Args:
stream_of_streams: A stream of input streams.
Returns:
A stream that chains the input streams.
kd.streams.current_executor()
Returns the current executor.
If the current computation is running on an executor, this operator
returns it. If no executor is set for the current context, this operator
returns an error.
Note: For the convenience, in Python environments, the default executor
(see `get_default_executor`) is implicitly set as the current executor.
However, this might not be not the case for other environments.
kd.streams.flat_map_chained(stream, fn, *, executor=unspecified, value_type_as=DataItem(None, schema: NONE))
Executes flat maps over the given stream.
`fn` is called for each item in the input stream, and it must return a new
stream. The streams returned by `fn` are then chained to produce the final
result.
Example:
```
kd.streams.flat_map_interleaved(
kd.streams.make(1, 10),
lambda x: kd.streams.make(x, x * 2, x * 3),
)
```
result: A stream with items [1, 2, 3, 10, 20, 30].
Args:
stream: The stream to iterate over.
fn: The function to be executed for each item in the stream. It will receive
the stream item as the positional argument and must return a stream of
values compatible with value_type_as.
executor: An executor for scheduling asynchronous operations.
value_type_as: The type to use as element type of the resulting stream.
Returns:
The resulting interleaved results of `fn` calls.
kd.streams.flat_map_interleaved(stream, fn, *, executor=unspecified, value_type_as=DataItem(None, schema: NONE))
Executes flat maps over the given stream.
`fn` is called for each item in the input stream, and it must return a new
stream. The streams returned by `fn` are then interleaved to produce the final
result. Note that while the internal order of items within each stream
returned by `fn` is preserved, the overall order of items from different
streams is not guaranteed.
Example:
```
kd.streams.flat_map_interleaved(
kd.streams.make(1, 10),
lambda x: kd.streams.make(x, x * 2, x * 3),
)
```
result: A stream with items {1, 2, 3, 10, 20, 30}. While the relative
order within {1, 2, 3} and {10, 20, 30} is guaranteed, the overall order
of items is unspecified. For instance, the following orderings are both
possible:
* [1, 10, 2, 20, 3, 30]
* [10, 20, 30, 1, 2, 3]
Args:
stream: The stream to iterate over.
fn: The function to be executed for each item in the stream. It will receive
the stream item as the positional argument and must return a stream of
values compatible with value_type_as.
executor: An executor for scheduling asynchronous operations.
value_type_as: The type to use as element type of the resulting stream.
Returns:
The resulting interleaved results of `fn` calls.
kd.streams.foreach(stream, body_fn, *, finalize_fn=unspecified, condition_fn=unspecified, executor=unspecified, returns=unspecified, yields=unspecified, yields_interleaved=unspecified, **initial_state)
Executes a loop over the given stream.
Exactly one of `returns`, `yields`, `yields_interleaved` must be specified,
and that dictates what this operator returns.
When `returns` is specified, it is one more variable added to `initial_state`,
and the value of that variable at the end of the loop is returned in a single-
item stream.
When `yields` is specified, it must be an stream, and the value
passed there, as well as the values set to this variable in each
stream of the loop, are chained to get the resulting stream.
When `yields_interleaved` is specified, the behavior is the same as `yields`,
but the values are interleaved instead of chained.
The behavior of the loop is equivalent to the following pseudocode (with
a simplification that `stream` is an `iterable`):
state = initial_state # Also add `returns` to it if specified.
while condition_fn(state):
item = next(iterable)
if item == <end-of-iterable>:
upd = finalize_fn(**state)
else:
upd = body_fn(item, **state)
if yields/yields_interleaved is specified:
yield the corresponding data from upd, and remove it from upd.
state.update(upd)
if item == <end-of-iterable>:
break
if returns is specified:
yield state['returns']
Args:
stream: The stream to iterate over.
body_fn: The function to be executed for each item in the stream. It will
receive the stream item as the positional argument, and the loop variables
as keyword arguments (excluding `yields`/`yields_interleaved` if those are
specified), and must return a namedtuple with the new values for some or
all loop variables (including `yields`/`yields_interleaved` if those are
specified).
finalize_fn: The function to be executed when the stream is exhausted. It
will receive the same arguments as `body_fn` except the positional
argument, and must return the same namedtuple. If not specified, the state
at the end will be the same as the state after processing the last item.
Note that finalize_fn is not called if condition_fn ever returns false.
condition_fn: The function to be executed to determine whether to continue
the loop. It will receive the loop variables as keyword arguments, and
must return a MASK scalar. Can be used to terminate the loop early without
processing all items in the stream. If not specified, the loop will
continue until the stream is exhausted.
executor: The executor to use for computations.
returns: The loop variable that holds the return value of the loop.
yields: The loop variables that holds the values to yield at each iteration,
to be chained together.
yields_interleaved: The loop variables that holds the values to yield at
each iteration, to be interleaved.
**initial_state: The initial state of the loop variables.
Returns:
Either a stream with a single returns value or a stream of yielded values.
kd.streams.get_default_executor()
Returns the default executor.
kd.streams.get_eager_executor()
Returns an executor that runs tasks right away on the same thread.
kd.streams.get_stream_qtype(value_qtype)
Returns the stream qtype for the given value qtype.
kd.streams.interleave(*streams, value_type_as=unspecified)
Creates a stream that interleaves the given streams.
The resulting stream has all items from all input streams, and the order of
items from each stream is preserved. But the order of interleaving of
different streams can be arbitrary.
Having unspecified order allows the parallel execution to put the items into
the result in the order they are computed, potentially increasing the amount
of parallel processing done.
The input streams must all have the same value type. If value_type_as is
specified, it must be the same as the value type of the streams, if any.
Args:
*streams: Input streams.
value_type_as: A value that has the same type as the items in the streams.
It is useful to specify this explicitly if the list of streams may be
empty. If this is not specified and the list of streams is empty, the
resulting stream will have DATA_SLICE as the value type.
Returns:
A stream that interleaves the input streams in an unspecified order.
kd.streams.interleave_from_stream(stream_of_streams)
Creates a stream that interleaves the given streams.
The resulting stream has all items from all input streams, and the order of
items from each stream is preserved. But the order of interleaving of
different streams can be arbitrary.
Having unspecified order allows the parallel execution to put the items into
the result in the order they are computed, potentially increasing the amount
of parallel processing done.
Args:
stream_of_streams: A stream of input streams.
Returns:
A stream that interleaves the input streams in an unspecified order.
kd.streams.make(*items, value_type_as=unspecified)
Creates a stream from the given items, in the given order.
The items must all have the same type (for example data slice, or data bag).
However, in case of data slices, the items can have different shapes or
schemas.
Args:
*items: Items to be put into the stream.
value_type_as: A value that has the same type as the items. It is useful to
specify this explicitly if the list of items may be empty. If this is not
specified and the list of items is empty, the iterable will have data
slice as the value type.
Returns:
A stream with the given items.
kd.streams.make_executor(thread_limit=0)
Returns a new executor.
Note: The `thread_limit` limits the concurrency; however, the executor may
have no dedicated threads, and the actual concurrency limit might be lower.
Args:
thread_limit: The number of threads to use. Must be non-negative; 0 means
that the number of threads is selected automatically.
kd.streams.map(stream, fn, *, executor=unspecified, value_type_as=DataItem(None, schema: NONE))
Returns a new stream by applying `fn` to each item in the input stream.
For each item of the input `stream`, the `fn` is called. The single
resulting item from each call is then written into the new output stream.
Args:
stream: The input stream.
fn: The function to be executed for each item of the input stream. It will
receive an item as the positional argument and its result must be of the
same type as `value_type_as`.
executor: An executor for scheduling asynchronous operations.
value_type_as: The type to use as value type of the resulting stream.
Returns:
The resulting stream.
kd.streams.map_unordered(stream, fn, *, executor=unspecified, value_type_as=DataItem(None, schema: NONE))
Returns a new stream by applying `fn` to each item in the input `stream`.
For each item of the input `stream`, the `fn` is called. The single
resulting item from each call is then written into the new output stream.
IMPORTANT: The order of the items in the resulting stream is not guaranteed.
Args:
stream: The input stream.
fn: The function to be executed for each item of the input stream. It will
receive an item as the positional argument and its result must be of the
same type as `value_type_as`.
executor: An executor for scheduling asynchronous operations.
value_type_as: The type to use as value type of the resulting stream.
Returns:
The resulting stream.
kd.streams.reduce(fn, stream, initial_value, *, executor=unspecified)
Reduces a stream by iteratively applying a functor `fn`.
This operator applies `fn` sequentially to an accumulating value and each
item of the `stream`. The process begins with `initial_value`, then follows
this pattern:
value_0 = initial_value
value_1 = fn(value_0, stream[0])
value_2 = fn(value_1, stream[1])
...
The result of the reduction is the final computed value.
Args:
fn: A binary function that takes two positional arguments -- the current
accumulating value and the next item from the stream -- and returns a new
value. It's expected to return a value of the same type as
`initial_value`.
stream: The input stream.
initial_value: The initial value.
executor: The executor to use for computations.
Returns:
A stream with a single item containing the final result of the reduction.
kd.streams.reduce_concat(stream, initial_value, *, ndim=1, executor=unspecified)
Concatenates data slices from the stream.
A specialized version of kd.streams.reduce() designed to speed up
the concatenation of data slices.
Using a standard kd.streams.reduce() with kd.concat() would result in
an O(N**2) computational complexity. This implementation, however, achieves
an O(N) complexity.
See the docstring for `kd.concat` for more details about the concatenation
semantics.
Args:
stream: A stream of data slices to be concatenated.
initial_value: The initial value to be concatenated before items.
ndim: The number of last dimensions to concatenate.
executor: The executor to use for computations.
Returns:
A single-item stream with the concatenated data slice.
kd.streams.reduce_stack(stream, initial_value, *, ndim=0, executor=unspecified)
Stacks data slices from the stream.
A specialized version of kd.streams.reduce() designed to speed up
the concatenation of data slices.
Using a standard kd.streams.reduce() with kd.stack() would result in
an O(N**2) computational complexity. This implementation, however,
achieves an O(N) complexity.
See the docstring for `kd.stack` for more details about the stacking
semantics.
Args:
stream: A stream of data slices to be stacked.
initial_value: The initial value to be stacked before items.
ndim: The number of last dimensions to stack (default 0).
executor: The executor to use for computations.
Returns:
A single-item stream with the stacked data slice.
kd.streams.unsafe_blocking_await(stream)
Blocks until the given stream yields a single item.
IMPORTANT: This operator is inherently unsafe and should be used with extreme
caution. It's primarily intended for transitional periods when migrating
a complex, synchronous computation to a concurrent model, enabling incremental
changes instead of a complete migration in one step.
The main danger stems from its blocking nature: it blocks the calling thread
until the stream is ready. However, if the task responsible for filling
the stream is also scheduled on the same executor, and all executor threads
become blocked, that task may never execute, leading to a deadlock.
While seemingly acceptable initially, prolonged or widespread use of this
operator will eventually cause deadlocks, requiring a non-trivial refactoring
of your computation.
BEGIN-GOOGLE-INTERNAL
Note: While this operator is relatively safe to use with fibers, it's still
NOT recommended for permanent use.
END-GOOGLE-INTERNAL
Args:
stream: A single-item input stream.
Returns:
The single item from the stream.
kd.streams.while_(condition_fn, body_fn, *, executor=unspecified, returns=unspecified, yields=unspecified, yields_interleaved=unspecified, **initial_state)
Repeatedly applies a body functor while a condition is met.
Each iteration, the operator passes current state variables (including
`returns`, if specified) as keyword arguments to `condition_fn` and `body_fn`.
The loop continues if `condition_fn` returns `present`. State variables are
then updated from `body_fn`'s namedtuple return value.
This operator always returns a stream, with the concrete behaviour
depending on whether `returns`, `yields`, or `yields_interleaved` was
specified (exactly one of them must be specified):
- `returns`: a single-item stream with the final value of the `returns` state
variable;
- `yields`: a stream created by chaining the initial `yields` stream with any
subsequent streams produced by the `body_fn`;
- `yields_interleaved`: the same as for `yields`, but instead of being chained
the streams are interleaved.
Args:
condition_fn: A functor that accepts state variables (including `returns`,
if specified) as keyword arguments and returns a MASK data-item, either
directly or as a single-item stream. A `present` value indicates the loop
should continue; `missing` indicates it should stop.
body_fn: A functor that accepts state variables *including `returns`, if
specified) as keyword arguments and returns a namedtuple (see
`kd.make_namedtuple`) containing updated values for a subset of the state
variables. These updated values must retain their original types.
executor: The executor to use for computations.
returns: If present, the initial value of the 'returns' state variable.
yields: If present, the initial value of the 'yields' state variable.
yields_interleaved: If present, the initial value of the
`yields_interleaved` state variable.
**initial_state: Initial values for state variables.
Returns:
If `returns` is a state variable, the value of `returns` when the loop
ended. Otherwise, a stream combining the values of `yields` or
`yields_interleaved` from each body invocation.
Operators to create tuples.
Operators
kd.tuples.get_namedtuple_field(namedtuple, field_name)
Returns the value of the specified `field_name` from the `namedtuple`.
Args:
namedtuple: a namedtuple.
field_name: the name of the field to return.
kd.tuples.get_nth(x, n)
Returns the nth element of the tuple `x`.
Args:
x: a tuple.
n: the index of the element to return. Must be in the range [0, len(x)).
kd.tuples.namedtuple(**kwargs)
Aliases:
Returns a namedtuple-like object containing the given `**kwargs`.
kd.tuples.slice(start=unspecified, stop=unspecified, step=unspecified)
Returns a slice for the Python indexing syntax foo[start:stop:step].
Args:
start: (optional) Indexing start.
stop: (optional) Indexing stop.
step: (optional) Indexing step size.
kd.tuples.tuple(*args)
Aliases:
Returns a tuple-like object containing the given `*args`.
Operators
kd.add(x, y)
Alias for kd.math.add operator.
kd.agg_all(x, ndim=unspecified)
Alias for kd.masking.agg_all operator.
kd.agg_any(x, ndim=unspecified)
Alias for kd.masking.agg_any operator.
kd.agg_count(x, ndim=unspecified)
Alias for kd.slices.agg_count operator.
kd.agg_has(x, ndim=unspecified)
Alias for kd.masking.agg_has operator.
kd.agg_max(x, ndim=unspecified)
Alias for kd.math.agg_max operator.
kd.agg_min(x, ndim=unspecified)
Alias for kd.math.agg_min operator.
kd.agg_size(x, ndim=unspecified)
Alias for kd.slices.agg_size operator.
kd.agg_sum(x, ndim=unspecified)
Alias for kd.math.agg_sum operator.
kd.agg_uuid(x, ndim=unspecified)
Alias for kd.ids.agg_uuid operator.
kd.align(*args)
Alias for kd.slices.align operator.
kd.all(x)
Alias for kd.masking.all operator.
kd.any(x)
Alias for kd.masking.any operator.
kd.appended_list(x, append)
Alias for kd.lists.appended_list operator.
kd.apply_mask(x, y)
Alias for kd.masking.apply_mask operator.
kd.apply_py(fn, *args, return_type_as=unspecified, **kwargs)
Alias for kd.py.apply_py operator.
kd.apply_py_on_cond(yes_fn, no_fn, cond, *args, **kwargs)
Alias for kd.py.apply_py_on_cond operator.
kd.apply_py_on_selected(fn, cond, *args, **kwargs)
Alias for kd.py.apply_py_on_selected operator.
kd.argmax(x, ndim=unspecified)
Alias for kd.math.argmax operator.
kd.argmin(x, ndim=unspecified)
Alias for kd.math.argmin operator.
kd.at(x, indices)
Alias for kd.slices.at operator.
kd.attr(x, attr_name, value, overwrite_schema=False)
Alias for kd.core.attr operator.
kd.attrs(x, /, *, overwrite_schema=False, **attrs)
Alias for kd.core.attrs operator.
kd.bag()
Alias for kd.bags.new operator.
kd.bind(fn_def, /, *, return_type_as=<class 'koladata.types.data_slice.DataSlice'>, **kwargs)
Alias for kd.functor.bind operator.
kd.bitwise_and(x, y)
Alias for kd.bitwise.bitwise_and operator.
kd.bitwise_count(x)
Alias for kd.bitwise.count operator.
kd.bitwise_invert(x)
Alias for kd.bitwise.invert operator.
kd.bitwise_or(x, y)
Alias for kd.bitwise.bitwise_or operator.
kd.bitwise_xor(x, y)
Alias for kd.bitwise.bitwise_xor operator.
kd.bool(x)
Alias for kd.slices.bool operator.
kd.bytes(x)
Alias for kd.slices.bytes operator.
kd.call(fn, *args, return_type_as=DataItem(None, schema: NONE), **kwargs)
Alias for kd.functor.call operator.
kd.cast_to(x, schema)
Alias for kd.schema.cast_to operator.
kd.check_inputs(**kw_constraints)
Decorator factory for adding runtime input type checking to Koda functions.
Resulting decorators will check the schemas of DataSlice inputs of
a function at runtime, and raise TypeError in case of mismatch.
Decorated functions will preserve the original function's signature and
docstring.
Decorated functions can be traced using `kd.fn` and the inputs to the
resulting functor will be wrapped in kd.assertion.with_assertion nodes that
match the assertions of the eager version.
Example for primitive schemas:
@kd.check_inputs(hours=kd.INT32, minutes=kd.INT32)
@kd.check_output(kd.STRING)
def timestamp(hours, minutes):
return kd.str(hours) + ':' + kd.str(minutes)
timestamp(ds([10, 10, 10]), kd.ds([15, 30, 45])) # Does not raise.
timestamp(ds([10, 10, 10]), kd.ds([15.35, 30.12, 45.1])) # raises TypeError
Example for complex schemas:
Doc = kd.schema.named_schema('Doc', doc_id=kd.INT64, score=kd.FLOAT32)
Query = kd.schema.named_schema(
'Query',
query_text=kd.STRING,
query_id=kd.INT32,
docs=kd.list_schema(Doc),
)
@kd.check_inputs(query=Query)
@kd.check_output(Doc)
def get_docs(query):
return query.docs[:]
Example for an argument that should not be an Expr at tracing time:
@kd.check_inputs(x=kd.constant_when_tracing(kd.INT32))
def f(x):
return x
Args:
**kw_constraints: mapping of parameter names to type constraints. Names must
match parameter names in the decorated function. Arguments for the given
parameters must be DataSlices/DataItems that match the given type
constraint(in particular, for SchemaItems, they must have the
corresponding schema).
Returns:
A decorator that can be used to type annotate a function that accepts
DataSlices/DataItem inputs.
kd.check_output(constraint)
Decorator factory for adding runtime output type checking to Koda functions.
Resulting decorators will check the schema of the DataSlice output of
a function at runtime, and raise TypeError in case of mismatch.
Decorated functions will preserve the original function's signature and
docstring.
Decorated functions can be traced using `kd.fn` and the output of the
resulting functor will be wrapped in a kd.assertion.with_assertion node that
match the assertion of the eager version.
Example for primitive schemas:
@kd.check_inputs(hours=kd.INT32, minutes=kd.INT32)
@kd.check_output(kd.STRING)
def timestamp(hours, minutes):
return kd.to_str(hours) + ':' + kd.to_str(minutes)
timestamp(ds([10, 10, 10]), kd.ds([15, 30, 45])) # Does not raise.
timestamp(ds([10, 10, 10]), kd.ds([15.35, 30.12, 45.1])) # raises TypeError
Example for complex schemas:
Doc = kd.schema.named_schema('Doc', doc_id=kd.INT64, score=kd.FLOAT32)
Query = kd.schema.named_schema(
'Query',
query_text=kd.STRING,
query_id=kd.INT32,
docs=kd.list_schema(Doc),
)
@kd.check_inputs(query=Query)
@kd.check_output(Doc)
def get_docs(query):
return query.docs[:]
Args:
constraint: A type constraint for the output. Output of the decorated
function must be a DataSlice/DataItem that matches the constraint(in
particular, for SchemaItems, they must have the corresponding schema).
Returns:
A decorator that can be used to annotate a function returning a
DataSlice/DataItem.
kd.cityhash(x, seed)
Alias for kd.random.cityhash operator.
kd.clear_eval_cache()
Clears Koda specific eval caches.
kd.clone(x, /, *, itemid=unspecified, schema=unspecified, **overrides)
Alias for kd.core.clone operator.
kd.coalesce(x, y)
Alias for kd.masking.coalesce operator.
kd.collapse(x, ndim=unspecified)
Alias for kd.slices.collapse operator.
kd.concat(*args, ndim=1)
Alias for kd.slices.concat operator.
kd.concat_lists(*lists)
Alias for kd.lists.concat operator.
kd.cond(condition, yes, no=DataItem(None, schema: NONE))
Alias for kd.masking.cond operator.
kd.container(**attrs)
Alias for kd.core.container operator.
kd.count(x)
Alias for kd.slices.count operator.
kd.cum_count(x, ndim=unspecified)
Alias for kd.slices.cum_count operator.
kd.cum_max(x, ndim=unspecified)
Alias for kd.math.cum_max operator.
kd.decode_itemid(ds)
Alias for kd.ids.decode_itemid operator.
kd.deep_clone(x, /, schema=unspecified, **overrides)
Alias for kd.core.deep_clone operator.
kd.deep_uuid(x, /, schema=unspecified, *, seed='')
Alias for kd.ids.deep_uuid operator.
kd.del_attr(x, attr_name)
Deletes an attribute `attr_name` from `x`.
kd.dense_rank(x, descending=False, ndim=unspecified)
Alias for kd.slices.dense_rank operator.
kd.dict(items_or_keys=None, values=None, *, key_schema=None, value_schema=None, schema=None, itemid=None)
Alias for kd.dicts.new operator.
kd.dict_like(shape_and_mask_from, /, items_or_keys=None, values=None, *, key_schema=None, value_schema=None, schema=None, itemid=None)
Alias for kd.dicts.like operator.
kd.dict_schema(key_schema, value_schema)
Alias for kd.schema.dict_schema operator.
kd.dict_shaped(shape, /, items_or_keys=None, values=None, key_schema=None, value_schema=None, schema=None, itemid=None)
Alias for kd.dicts.shaped operator.
kd.dict_shaped_as(shape_from, /, items_or_keys=None, values=None, key_schema=None, value_schema=None, schema=None, itemid=None)
Alias for kd.dicts.shaped_as operator.
kd.dict_size(dict_slice)
Alias for kd.dicts.size operator.
kd.dict_update(x, keys, values=unspecified)
Alias for kd.dicts.dict_update operator.
kd.dir(x)
Returns a sorted list of unique attribute names of the given DataSlice.
This is equivalent to `kd.get_attr_names(ds, intersection=True)`. For more
finegrained control, use `kd.get_attr_names` directly instead.
In case of OBJECT schema, attribute names are fetched from the `__schema__`
attribute. In case of Entity schema, the attribute names are fetched from the
schema. In case of primitives, an empty list is returned.
Args:
x: A DataSlice.
Returns:
A list of unique attributes sorted by alphabetical order.
kd.disjoint_coalesce(x, y)
Alias for kd.masking.disjoint_coalesce operator.
kd.duck_dict(key_constraint, value_constraint)
Creates a duck dict constraint to be used in kd.check_inputs/output.
A duck_dict constraint will assert a DataSlice is a dict, checking the
key_constraint on the keys and value_constraint on the values. Use it if you
need to nest duck type constraints in dict constraints.
Example:
@kd.check_inputs(mapping=kd.duck_dict(kd.STRING,
kd.duck_type(doc_id=kd.INT64, score=kd.FLOAT32)))
def f(query):
pass
Args:
key_constraint: DuckType or SchemaItem representing the constraint to be
checked on the keys of the dict.
value_constraint: DuckType or SchemaItem representing the constraint to be
checked on the values of the dict.
Returns:
A duck type constraint to be used in kd.check_inputs or kd.check_output.
kd.duck_list(item_constraint)
Creates a duck list constraint to be used in kd.check_inputs/output.
A duck_list constraint will assert a DataSlice is a list, checking the
item_constraint on the items. Use it if you need to nest
duck type constraints in list constraints.
Example:
@kd.check_inputs(query=kd.duck_type(docs=kd.duck_list(
kd.duck_type(doc_id=kd.INT64, score=kd.FLOAT32)
)))
def f(query):
pass
Args:
item_constraint: DuckType or SchemaItem representing the constraint to be
checked on the items of the list.
Returns:
A duck type constraint to be used in kd.check_inputs or kd.check_output.
kd.duck_type(**kwargs)
Creates a duck type constraint to be used in kd.check_inputs/output.
A duck type constraint will assert that the DataSlice input/output of a
function has (at least) a certain set of attributes, as well as to specify
recursive constraints for those attributes.
Example:
@kd.check_inputs(query=kd.duck_type(query_text=kd.STRING,
docs=kd.duck_type()))
def f(query):
pass
Checks that the DataSlice input parameter `query` has a STRING attribute
`query_text`, and an attribute `docs` of any schema. `query` may also have
additional unspecified attributes.
Args:
**kwargs: mapping of attribute names to constraints. The constraints must be
either DuckTypes or SchemaItems. To assert only the presence of an
attribute, without specifying additional constraints on that attribute,
pass an empty duck type for that attribute.
Returns:
A duck type constraint to be used in kd.check_inputs or kd.check_output.
kd.dumps(x, /, *, riegeli_options=None)
Serializes a DataSlice or a DataBag.
In case of a DataSlice, we try to use `x.extract()` to avoid serializing
unnecessary DataBag data. If this is undesirable, consider serializing the
DataBag directly.
Due to current limitations of the underlying implementation, this can
only serialize data slices with up to roughly 10**8 items.
Args:
x: DataSlice or DataBag to serialize.
riegeli_options: A string with riegeli/records writer options. See
https://github.com/google/riegeli/blob/master/doc/record_writer_options.md
for details. If not provided, 'snappy' will be used.
Returns:
Serialized data.
kd.embed_schema(x)
Returns a DataSlice with OBJECT schema.
* For primitives no data change is done.
* For Entities schema is stored as '__schema__' attribute.
* Embedding Entities requires a DataSlice to be associated with a DataBag.
Args:
x: (DataSlice) whose schema is embedded.
kd.empty_shaped(shape, schema=MASK)
Alias for kd.slices.empty_shaped operator.
kd.empty_shaped_as(shape_from, schema=MASK)
Alias for kd.slices.empty_shaped_as operator.
kd.encode_itemid(ds)
Alias for kd.ids.encode_itemid operator.
kd.enriched(ds, *bag)
Alias for kd.core.enriched operator.
kd.enriched_bag(*bags)
Alias for kd.bags.enriched operator.
kd.equal(x, y)
Alias for kd.comparison.equal operator.
kd.eval(expr, self_input=DataItem(Entity(self_not_specified=present), schema: ENTITY(self_not_specified=MASK)), /, **input_values)
Returns the expr evaluated on the given `input_values`.
Only Koda Inputs from container `I` (e.g. `I.x`) can be evaluated. Other
input types must be substituted before calling this function.
Args:
expr: Koda expression with inputs from container `I`.
self_input: The value for I.self input. When not provided, it will still
have a default value that can be passed to a subroutine.
**input_values: Values to evaluate `expr` with. Note that all inputs in
`expr` must be present in the input values. All input values should either
be DataSlices or convertible to DataSlices.
kd.expand_to(x, target, ndim=unspecified)
Alias for kd.slices.expand_to operator.
kd.expand_to_shape(x, shape, ndim=unspecified)
Alias for kd.shapes.expand_to_shape operator.
kd.explode(x, ndim=1)
Alias for kd.lists.explode operator.
kd.expr_quote(x)
Alias for kd.slices.expr_quote operator.
kd.extension_type(unsafe_override=False)
Alias for kd.extension_types.extension_type operator.
kd.extract(ds, schema=unspecified)
Alias for kd.core.extract operator.
kd.extract_bag(ds, schema=unspecified)
Alias for kd.core.extract_bag operator.
kd.flat_map_chain(iterable, fn, value_type_as=DataItem(None, schema: NONE))
Alias for kd.functor.flat_map_chain operator.
kd.flat_map_interleaved(iterable, fn, value_type_as=DataItem(None, schema: NONE))
Alias for kd.functor.flat_map_interleaved operator.
kd.flatten(x, from_dim=0, to_dim=unspecified)
Alias for kd.shapes.flatten operator.
kd.flatten_end(x, n_times=1)
Alias for kd.shapes.flatten_end operator.
kd.float32(x)
Alias for kd.slices.float32 operator.
kd.float64(x)
Alias for kd.slices.float64 operator.
kd.fn(f, *, use_tracing=True, **kwargs)
Alias for kd.functor.fn operator.
kd.follow(x)
Alias for kd.core.follow operator.
kd.for_(iterable, body_fn, *, finalize_fn=unspecified, condition_fn=unspecified, returns=unspecified, yields=unspecified, yields_interleaved=unspecified, **initial_state)
Alias for kd.functor.for_ operator.
kd.format(fmt, /, **kwargs)
Alias for kd.strings.format operator.
kd.freeze(x)
Alias for kd.core.freeze operator.
kd.freeze_bag(x)
Alias for kd.core.freeze_bag operator.
kd.from_json(x, /, schema=OBJECT, default_number_schema=OBJECT, *, on_invalid=DataSlice([], schema: NONE), keys_attr='json_object_keys', values_attr='json_object_values')
Alias for kd.json.from_json operator.
kd.from_proto(messages, /, *, extensions=None, itemid=None, schema=None)
Returns a DataSlice representing proto data.
Messages, primitive fields, repeated fields, and maps are converted to
equivalent Koda structures: objects/entities, primitives, lists, and dicts,
respectively. Enums are converted to INT32. The attribute names on the Koda
objects match the field names in the proto definition. See below for methods
to convert proto extensions to attributes alongside regular fields.
Messages, primitive fields, repeated fields, and maps are converted to
equivalent Koda structures. Enums are converted to ints.
Only present values in `messages` are added. Default and missing values are
not used.
Proto extensions are ignored by default unless `extensions` is specified (or
if an explicit entity schema with parenthesized attrs is used).
The format of each extension specified in `extensions` is a dot-separated
sequence of field names and/or extension names, where extension names are
fully-qualified extension paths surrounded by parentheses. This sequence of
fields and extensions is traversed during conversion, in addition to the
default behavior of traversing all fields. For example:
"path.to.field.(package_name.some_extension)"
"path.to.repeated_field.(package_name.some_extension)"
"path.to.map_field.values.(package_name.some_extension)"
"path.(package_name.some_extension).(package_name2.nested_extension)"
Extensions are looked up using the C++ generated descriptor pool, using
`DescriptorPool::FindExtensionByName`, which requires that all extensions are
compiled in as C++ protos. The Koda attribute names for the extension fields
are parenthesized fully-qualified extension paths (e.g.
"(package_name.some_extension)" or
"(package_name.SomeMessage.some_extension)".) As the names contain '()' and
'.' characters, they cannot be directly accessed using '.name' syntax but can
be accessed using `.get_attr(name)'. For example,
ds.get_attr('(package_name.AbcExtension.abc_extension)')
ds.optional_field.get_attr('(package_name.DefExtension.def_extension)')
If `messages` is a single proto Message, the result is a DataItem. If it is a
list of proto Messages, the result is an 1D DataSlice.
Args:
messages: Message or list of Message of the same type. Any of the messages
may be None, which will produce missing items in the result.
extensions: List of proto extension paths.
itemid: The ItemId(s) to use for the root object(s). If not specified, will
allocate new id(s). If specified, will also infer the ItemIds for all
child items such as List items from this id, so that repeated calls to
this method on the same input will produce the same id(s) for everything.
Use this with care to avoid unexpected collisions.
schema: The schema to use for the return value. Can be set to kd.OBJECT to
(recursively) create an object schema. Can be set to None (default) to
create an uuschema based on the proto descriptor. When set to an entity
schema, some fields may be set to kd.OBJECT to create objects from that
point.
Returns:
A DataSlice representing the proto data.
kd.from_proto_bytes(x, proto_path, /, *, extensions=unspecified, itemids=unspecified, schema=unspecified, on_invalid=unspecified)
Alias for kd.proto.from_proto_bytes operator.
kd.from_proto_json(x, proto_path, /, *, extensions=unspecified, itemids=unspecified, schema=unspecified, on_invalid=unspecified)
Alias for kd.proto.from_proto_json operator.
kd.from_py(py_obj, *, dict_as_obj=False, itemid=None, schema=None, from_dim=0)
Aliases:
Converts Python object into DataSlice.
Can convert nested lists/dicts into Koda objects recursively as well.
Args:
py_obj: Python object to convert.
dict_as_obj: If True, will convert dicts with string keys into Koda objects
instead of Koda dicts.
itemid: The ItemId to use for the root object. If not specified, will
allocate a new id. If specified, will also infer the ItemIds for all child
items such as list items from this id, so that repeated calls to this
method on the same input will produce the same id for everything. Use this
with care to avoid unexpected collisions.
schema: The schema to use for the return value. When this schema or one of
its attributes is OBJECT (which is also the default), recursively creates
objects from that point on.
from_dim: The dimension to start creating Koda objects/lists/dicts from.
`py_obj` must be a nested list of at least from_dim depth, and the outer
from_dim dimensions will become the returned DataSlice dimensions. When
from_dim is 0, the return value is therefore a DataItem.
Returns:
A DataItem with the converted data.
kd.from_pytree(py_obj, *, dict_as_obj=False, itemid=None, schema=None, from_dim=0)
Alias for kd.from_py operator.
kd.fstr(x)
Alias for kd.strings.fstr operator.
kd.full_equal(x, y)
Alias for kd.comparison.full_equal operator.
kd.get_attr(x, attr_name, default=unspecified)
Alias for kd.core.get_attr operator.
kd.get_attr_names(x, *, intersection)
Returns a sorted list of unique attribute names of the given DataSlice.
In case of OBJECT schema, attribute names are fetched from the `__schema__`
attribute. In case of Entity schema, the attribute names are fetched from the
schema. In case of primitives, an empty list is returned.
Args:
x: A DataSlice.
intersection: If True, the intersection of all object attributes is
returned. Otherwise, the union is returned.
Returns:
A list of unique attributes sorted by alphabetical order.
kd.get_bag(ds)
Alias for kd.core.get_bag operator.
kd.get_dtype(ds)
Alias for kd.schema.get_dtype operator.
kd.get_item(x, key_or_index)
Alias for kd.core.get_item operator.
kd.get_item_schema(list_schema)
Alias for kd.schema.get_item_schema operator.
kd.get_itemid(x)
Alias for kd.schema.get_itemid operator.
kd.get_key_schema(dict_schema)
Alias for kd.schema.get_key_schema operator.
kd.get_keys(dict_ds)
Alias for kd.dicts.get_keys operator.
kd.get_metadata(x)
Alias for kd.core.get_metadata operator.
kd.get_ndim(x)
Alias for kd.slices.get_ndim operator.
kd.get_nofollowed_schema(schema)
Alias for kd.schema.get_nofollowed_schema operator.
kd.get_obj_schema(x)
Alias for kd.schema.get_obj_schema operator.
kd.get_primitive_schema(ds)
Alias for kd.schema.get_dtype operator.
kd.get_repr(x, /, *, depth=25, item_limit=200, item_limit_per_dimension=25, format_html=False, max_str_len=100, show_attributes=True, show_databag_id=False, show_shape=False, show_schema=False)
Alias for kd.slices.get_repr operator.
kd.get_schema(x)
Alias for kd.schema.get_schema operator.
kd.get_shape(x)
Alias for kd.shapes.get_shape operator.
kd.get_value_schema(dict_schema)
Alias for kd.schema.get_value_schema operator.
kd.get_values(dict_ds, key_ds=unspecified)
Alias for kd.dicts.get_values operator.
kd.greater(x, y)
Alias for kd.comparison.greater operator.
kd.greater_equal(x, y)
Alias for kd.comparison.greater_equal operator.
kd.group_by(x, *args, sort=False)
Alias for kd.slices.group_by operator.
kd.group_by_indices(*args, sort=False)
Alias for kd.slices.group_by_indices operator.
kd.has(x)
Alias for kd.masking.has operator.
kd.has_attr(x, attr_name)
Alias for kd.core.has_attr operator.
kd.has_bag(ds)
Alias for kd.core.has_bag operator.
kd.has_dict(x)
Alias for kd.dicts.has_dict operator.
kd.has_entity(x)
Alias for kd.core.has_entity operator.
kd.has_fn(x)
Alias for kd.functor.has_fn operator.
kd.has_list(x)
Alias for kd.lists.has_list operator.
kd.has_not(x)
Alias for kd.masking.has_not operator.
kd.has_primitive(x)
Alias for kd.core.has_primitive operator.
kd.hash_itemid(x)
Alias for kd.ids.hash_itemid operator.
kd.if_(cond, yes_fn, no_fn, *args, return_type_as=DataItem(None, schema: NONE), **kwargs)
Alias for kd.functor.if_ operator.
kd.implode(x, /, ndim=1, itemid=None)
Alias for kd.lists.implode operator.
kd.index(x, dim=-1)
Alias for kd.slices.index operator.
kd.int32(x)
Alias for kd.slices.int32 operator.
kd.int64(x)
Alias for kd.slices.int64 operator.
kd.inverse_mapping(x, ndim=unspecified)
Alias for kd.slices.inverse_mapping operator.
kd.inverse_select(ds, fltr)
Alias for kd.slices.inverse_select operator.
kd.is_dict(x)
Alias for kd.dicts.is_dict operator.
kd.is_empty(x)
Alias for kd.slices.is_empty operator.
kd.is_entity(x)
Alias for kd.core.is_entity operator.
kd.is_expandable_to(x, target, ndim=unspecified)
Alias for kd.slices.is_expandable_to operator.
kd.is_expr(obj)
Returns kd.present if the given object is an Expr and kd.missing otherwise.
kd.is_fn(obj)
Alias for kd.functor.is_fn operator.
kd.is_item(obj)
Returns kd.present if the given object is a scalar DataItem and kd.missing otherwise.
kd.is_list(x)
Alias for kd.lists.is_list operator.
kd.is_nan(x)
Alias for kd.math.is_nan operator.
kd.is_null_bag(bag)
Alias for kd.bags.is_null_bag operator.
kd.is_primitive(x)
Alias for kd.core.is_primitive operator.
kd.is_shape_compatible(x, y)
Alias for kd.slices.is_shape_compatible operator.
kd.is_slice(obj)
Returns kd.present if the given object is a DataSlice and kd.missing otherwise.
kd.isin(x, y)
Alias for kd.slices.isin operator.
kd.item(x, /, schema=None)
Alias for kd.slices.item operator.
kd.less(x, y)
Alias for kd.comparison.less operator.
kd.less_equal(x, y)
Alias for kd.comparison.less_equal operator.
kd.list(items=None, *, item_schema=None, schema=None, itemid=None)
Creates list(s) by collapsing `items` into an immutable list.
If there is no argument, returns an empty Koda List.
If the argument is a Python list, creates a nested Koda List.
Examples:
list() -> a single empty Koda List
list([1, 2, 3]) -> Koda List with items 1, 2, 3
list([[1, 2, 3], [4, 5]]) -> nested Koda List [[1, 2, 3], [4, 5]]
# items are Koda lists.
Args:
items: The items to use. If not specified, an empty list of OBJECTs will be
created.
item_schema: the schema of the list items. If not specified, it will be
deduced from `items` or defaulted to OBJECT.
schema: The schema to use for the list. If specified, then item_schema must
not be specified.
itemid: Optional ITEMID DataSlice used as ItemIds of the resulting lists.
Returns:
The slice with list/lists.
kd.list_append_update(x, append)
Alias for kd.lists.list_append_update operator.
kd.list_like(shape_and_mask_from, /, items=None, *, item_schema=None, schema=None, itemid=None)
Alias for kd.lists.like operator.
kd.list_schema(item_schema)
Alias for kd.schema.list_schema operator.
kd.list_shaped(shape, /, items=None, *, item_schema=None, schema=None, itemid=None)
Alias for kd.lists.shaped operator.
kd.list_shaped_as(shape_from, /, items=None, *, item_schema=None, schema=None, itemid=None)
Alias for kd.lists.shaped_as operator.
kd.list_size(list_slice)
Alias for kd.lists.size operator.
kd.loads(x)
Deserializes a DataSlice or a DataBag.
kd.map(fn, *args, include_missing=False, **kwargs)
Alias for kd.functor.map operator.
kd.map_py(fn, *args, schema=DataItem(None, schema: NONE), max_threads=1, ndim=0, include_missing=DataItem(None, schema: NONE), item_completed_callback=DataItem(None, schema: NONE), **kwargs)
Alias for kd.py.map_py operator.
kd.map_py_on_cond(true_fn, false_fn, cond, *args, schema=DataItem(None, schema: NONE), max_threads=1, item_completed_callback=DataItem(None, schema: NONE), **kwargs)
Alias for kd.py.map_py_on_cond operator.
kd.map_py_on_selected(fn, cond, *args, schema=DataItem(None, schema: NONE), max_threads=1, item_completed_callback=DataItem(None, schema: NONE), **kwargs)
Alias for kd.py.map_py_on_selected operator.
kd.mask(x)
Alias for kd.slices.mask operator.
kd.mask_and(x, y)
Alias for kd.masking.mask_and operator.
kd.mask_equal(x, y)
Alias for kd.masking.mask_equal operator.
kd.mask_not_equal(x, y)
Alias for kd.masking.mask_not_equal operator.
kd.mask_or(x, y)
Alias for kd.masking.mask_or operator.
kd.max(x)
Alias for kd.math.max operator.
kd.maximum(x, y)
Alias for kd.math.maximum operator.
kd.maybe(x, attr_name)
Alias for kd.core.maybe operator.
kd.metadata(x, /, **attrs)
Alias for kd.core.metadata operator.
kd.min(x)
Alias for kd.math.min operator.
kd.minimum(x, y)
Alias for kd.math.minimum operator.
kd.mutable_bag()
Aliases:
Returns an empty mutable DataBag. Only works in eager mode.
kd.named_container()
Container that automatically names Exprs.
For non-expr inputs, in tracing mode it will be converted to an Expr,
while in non-tracing mode it will be stored as is. This allows to use
NamedContainer eager code that will later be traced.
For example:
c = kd.ext.expr_container.NamedContainer()
c.x_plus_y = I.x + I.y
c.x_plus_y # Returns (I.x + I.y).with_name('x_plus_y')
c.foo = 5
c.foo # Returns 5
Functions and lambdas are automatically traced in tracing mode.
For example:
def foo(x):
c = kd.ext.expr_container.NamedContainer()
c.x = x
c.update = lambda x: x + 1
return c.update(c.x)
fn = kd.fn(foo)
fn(x=5) # Returns 6
kd.named_schema(name, /, **kwargs)
Alias for kd.schema.named_schema operator.
kd.namedtuple(**kwargs)
Alias for kd.tuples.namedtuple operator.
kd.new(arg=unspecified, /, *, schema=None, overwrite_schema=False, itemid=None, **attrs)
Alias for kd.entities.new operator.
kd.new_dictid()
Alias for kd.allocation.new_dictid operator.
kd.new_dictid_like(shape_and_mask_from)
Alias for kd.allocation.new_dictid_like operator.
kd.new_dictid_shaped(shape)
Alias for kd.allocation.new_dictid_shaped operator.
kd.new_dictid_shaped_as(shape_from)
Alias for kd.allocation.new_dictid_shaped_as operator.
kd.new_itemid()
Alias for kd.allocation.new_itemid operator.
kd.new_itemid_like(shape_and_mask_from)
Alias for kd.allocation.new_itemid_like operator.
kd.new_itemid_shaped(shape)
Alias for kd.allocation.new_itemid_shaped operator.
kd.new_itemid_shaped_as(shape_from)
Alias for kd.allocation.new_itemid_shaped_as operator.
kd.new_like(shape_and_mask_from, /, *, schema=None, overwrite_schema=False, itemid=None, **attrs)
Alias for kd.entities.like operator.
kd.new_listid()
Alias for kd.allocation.new_listid operator.
kd.new_listid_like(shape_and_mask_from)
Alias for kd.allocation.new_listid_like operator.
kd.new_listid_shaped(shape)
Alias for kd.allocation.new_listid_shaped operator.
kd.new_listid_shaped_as(shape_from)
Alias for kd.allocation.new_listid_shaped_as operator.
kd.new_shaped(shape, /, *, schema=None, overwrite_schema=False, itemid=None, **attrs)
Alias for kd.entities.shaped operator.
kd.new_shaped_as(shape_from, /, *, schema=None, overwrite_schema=False, itemid=None, **attrs)
Alias for kd.entities.shaped_as operator.
kd.no_bag(ds)
Alias for kd.core.no_bag operator.
kd.nofollow(x)
Alias for kd.core.nofollow operator.
kd.nofollow_schema(schema)
Alias for kd.schema.nofollow_schema operator.
kd.not_equal(x, y)
Alias for kd.comparison.not_equal operator.
kd.obj(arg=unspecified, /, *, itemid=None, **attrs)
Alias for kd.objs.new operator.
kd.obj_like(shape_and_mask_from, /, *, itemid=None, **attrs)
Alias for kd.objs.like operator.
kd.obj_shaped(shape, /, *, itemid=None, **attrs)
Alias for kd.objs.shaped operator.
kd.obj_shaped_as(shape_from, /, *, itemid=None, **attrs)
Alias for kd.objs.shaped_as operator.
kd.ordinal_rank(x, tie_breaker=unspecified, descending=False, ndim=unspecified)
Alias for kd.slices.ordinal_rank operator.
kd.present_like(x)
Alias for kd.masking.present_like operator.
kd.present_shaped(shape)
Alias for kd.masking.present_shaped operator.
kd.present_shaped_as(x)
Alias for kd.masking.present_shaped_as operator.
kd.pwl_curve(p, adjustments)
Alias for kd.curves.pwl_curve operator.
kd.py_fn(f, *, return_type_as=<class 'koladata.types.data_slice.DataSlice'>, **defaults)
Alias for kd.functor.py_fn operator.
kd.py_reference(obj)
Wraps into a Arolla QValue using reference for serialization.
py_reference can be used to pass arbitrary python objects through
kd.apply_py/kd.py_fn.
Note that using reference for serialization means that the resulting
QValue (and Exprs created using it) will only be valid within the
same process. Trying to deserialize it in a different process
will result in an exception.
Args:
obj: the python object to wrap.
Returns:
The wrapped python object as Arolla QValue.
kd.randint_like(x, low=unspecified, high=unspecified, seed=unspecified)
Alias for kd.random.randint_like operator.
kd.randint_shaped(shape, low=unspecified, high=unspecified, seed=unspecified)
Alias for kd.random.randint_shaped operator.
kd.randint_shaped_as(x, low=unspecified, high=unspecified, seed=unspecified)
Alias for kd.random.randint_shaped_as operator.
kd.range(start, end=unspecified)
Alias for kd.slices.range operator.
kd.ref(ds)
Alias for kd.core.ref operator.
kd.register_py_fn(f, *, return_type_as=<class 'koladata.types.data_slice.DataSlice'>, unsafe_override=False, **defaults)
Alias for kd.functor.register_py_fn operator.
kd.reify(ds, source)
Alias for kd.core.reify operator.
kd.repeat(x, sizes)
Alias for kd.slices.repeat operator.
kd.repeat_present(x, sizes)
Alias for kd.slices.repeat_present operator.
kd.reshape(x, shape)
Alias for kd.shapes.reshape operator.
kd.reshape_as(x, shape_from)
Alias for kd.shapes.reshape_as operator.
kd.reverse(ds)
Alias for kd.slices.reverse operator.
kd.reverse_select(ds, fltr)
Alias for kd.slices.inverse_select operator.
kd.sample(x, ratio, seed, key=unspecified)
Alias for kd.random.sample operator.
kd.sample_n(x, n, seed, key=unspecified)
Alias for kd.random.sample_n operator.
kd.schema_from_proto(message_class, /, *, extensions=None)
Returns a Koda schema representing a proto message class.
This is similar to `from_proto(x).get_schema()` when `x` is an instance of
`message_class`, except that it eagerly adds all non-extension fields to the
schema instead of only adding fields that have data populated in `x`.
The returned schema is a uuschema whose itemid is a function of the proto
message class' fully qualified name, and any child message classes' schemas
are also uuschemas derived in the same way. The returned schema has the same
itemid as `from_proto(message_class()).get_schema()`.
The format of each extension specified in `extensions` is a dot-separated
sequence of field names and/or extension names, where extension names are
fully-qualified extension paths surrounded by parentheses. For example:
"path.to.field.(package_name.some_extension)"
"path.to.repeated_field.(package_name.some_extension)"
"path.to.map_field.values.(package_name.some_extension)"
"path.(package_name.some_extension).(package_name2.nested_extension)"
Args:
message_class: A proto message class to convert.
extensions: List of proto extension paths.
Returns:
A DataItem containing the converted schema.
kd.schema_from_proto_path(proto_path, /, *, extensions=DataItem(Entity:#5ikYYvXepp19g47QDLnJR2, schema: ITEMID))
Alias for kd.proto.schema_from_proto_path operator.
kd.schema_from_py(tpe)
Alias for kd.schema.schema_from_py operator.
kd.select(ds, fltr, expand_filter=True)
Alias for kd.slices.select operator.
kd.select_items(ds, fltr)
Alias for kd.lists.select_items operator.
kd.select_keys(ds, fltr)
Alias for kd.dicts.select_keys operator.
kd.select_present(ds)
Alias for kd.slices.select_present operator.
kd.select_values(ds, fltr)
Alias for kd.dicts.select_values operator.
kd.set_attr(x, attr_name, value, overwrite_schema=False)
Sets an attribute `attr_name` to `value`.
If `overwrite_schema` is True and `x` is either an Entity with explicit schema
or an Object where some items are entities with explicit schema, it will get
updated with `value`'s schema first.
Args:
x: a DataSlice on which to set the attribute. Must have DataBag attached.
attr_name: attribute name
value: a DataSlice or convertible to a DataSlice that will be assigned as an
attribute.
overwrite_schema: whether to overwrite the schema before setting an
attribute.
kd.set_attrs(x, *, overwrite_schema=False, **attrs)
Sets multiple attributes on an object / entity.
Args:
x: a DataSlice on which attributes are set. Must have DataBag attached.
overwrite_schema: whether to overwrite the schema before setting an
attribute.
**attrs: attribute values that are converted to DataSlices with DataBag
adoption.
kd.set_schema(x, schema)
Returns a copy of `x` with the provided `schema`.
If `schema` is an Entity schema and has a different DataBag than `x`, it is
merged into the DataBag of `x`.
It only changes the schemas of `x` and does not change the items in `x`. To
change the items in `x`, use `kd.cast_to` instead. For example,
kd.set_schema(kd.ds([1, 2, 3]), kd.FLOAT32) -> fails because the items in
`x` are not compatible with FLOAT32.
kd.cast_to(kd.ds([1, 2, 3]), kd.FLOAT32) -> kd.ds([1.0, 2.0, 3.0])
When items in `x` are primitives or `schemas` is a primitive schema, it checks
items and schema are compatible. When items are ItemIds and `schema` is a
non-primitive schema, it does not check the underlying data matches the
schema. For example,
kd.set_schema(kd.ds([1, 2, 3], schema=kd.OBJECT), kd.INT32)
-> kd.ds([1, 2, 3])
kd.set_schema(kd.ds([1, 2, 3]), kd.INT64) -> fail
kd.set_schema(kd.ds(1).with_bag(kd.bag()), kd.schema.new_schema(x=kd.INT32))
->
fail
kd.set_schema(kd.new(x=1), kd.INT32) -> fail
kd.set_schema(kd.new(x=1), kd.schema.new_schema(x=kd.INT64)) -> work
Args:
x: DataSlice to change the schema of.
schema: DataSlice containing the new schema.
Returns:
DataSlice with the new schema.
kd.shallow_clone(x, /, *, itemid=unspecified, schema=unspecified, **overrides)
Alias for kd.core.shallow_clone operator.
kd.shuffle(x, /, ndim=unspecified, seed=unspecified)
Alias for kd.random.shuffle operator.
kd.size(x)
Alias for kd.slices.size operator.
kd.slice(x, /, schema=None)
Alias for kd.slices.slice operator.
kd.sort(x, sort_by=unspecified, descending=False)
Alias for kd.slices.sort operator.
kd.stack(*args, ndim=0)
Alias for kd.slices.stack operator.
kd.static_when_tracing(base_type=None)
A constraint that the argument is static when tracing.
It is used to check that the argument is not an expression during tracing to
prevent a common mistake.
Examples:
- combined with checking the type:
@type_checking.check_inputs(value=kd.static_when_tracing(kd.INT32))
- without checking the type:
@type_checking.check_inputs(pick_a=kd.static_when_tracing())
Args:
base_type: (optional)The base type to check against. If not specified, only
checks that the argument is a static when tracing.
Returns:
A constraint that the argument is a static when tracing.
kd.str(x)
Alias for kd.slices.str operator.
kd.strict_attrs(x, /, **attrs)
Alias for kd.core.strict_attrs operator.
kd.strict_with_attrs(x, /, **attrs)
Alias for kd.core.strict_with_attrs operator.
kd.stub(x, attrs=DataSlice([], schema: NONE))
Alias for kd.core.stub operator.
kd.subslice(x, *slices)
Alias for kd.slices.subslice operator.
kd.sum(x)
Alias for kd.math.sum operator.
kd.take(x, indices)
Alias for kd.slices.at operator.
kd.tile(x, shape)
Alias for kd.slices.tile operator.
kd.to_expr(x)
Alias for kd.schema.to_expr operator.
kd.to_itemid(x)
Alias for kd.schema.get_itemid operator.
kd.to_json(x, /, *, indent=DataItem(None, schema: NONE), ensure_ascii=True, keys_attr='json_object_keys', values_attr='json_object_values', include_missing_values=True)
Alias for kd.json.to_json operator.
kd.to_none(x)
Alias for kd.schema.to_none operator.
kd.to_object(x)
Alias for kd.schema.to_object operator.
kd.to_proto(x, /, message_class)
Converts a DataSlice or DataItem to one or more proto messages.
If `x` is a DataItem, this returns a single proto message object. Otherwise,
`x` must be a 1-D DataSlice, and this returns a list of proto message objects
with the same size as the input. Missing items in the input are returned as
python None in place of a message.
Koda data structures are converted to equivalent proto messages, primitive
fields, repeated fields, maps, and enums, based on the proto schema. Koda
entity attributes are converted to message fields with the same name, if
those fields exist, otherwise they are ignored.
Koda slices with mixed underlying dtypes are tolerated wherever the proto
conversion is defined for all dtypes, regardless of schema.
Koda entity attributes that are parenthesized fully-qualified extension
paths (e.g. "(package_name.some_extension)") are converted to extensions,
if those extensions exist in the descriptor pool of the messages' common
descriptor, otherwise they are ignored.
Args:
x: DataSlice to convert.
message_class: A proto message class.
Returns:
A converted proto message or list of converted proto messages.
kd.to_proto_bytes(x, proto_path, /)
Alias for kd.proto.to_proto_bytes operator.
kd.to_proto_json(x, proto_path, /)
Alias for kd.proto.to_proto_json operator.
kd.to_py(ds, max_depth=2, obj_as_dict=False, include_missing_attrs=True)
Returns a readable python object from a DataSlice.
Attributes, lists, and dicts are recursively converted to Python objects.
Args:
ds: A DataSlice
max_depth: Maximum depth for recursive conversion. Each attribute, list item
and dict keys / values access represent 1 depth increment. Use -1 for
unlimited depth.
obj_as_dict: Whether to convert objects to python dicts. By default objects
are converted to automatically constructed 'Obj' dataclass instances.
include_missing_attrs: whether to include attributes with None value in
objects.
kd.to_pylist(x)
Expands the outermost DataSlice dimension into a list of DataSlices.
kd.to_pytree(ds, max_depth=2, include_missing_attrs=True)
No description
kd.to_schema(x)
Alias for kd.schema.to_schema operator.
kd.trace_as_fn(*, name=None, return_type_as=None, functor_factory=None)
Alias for kd.functor.trace_as_fn operator.
kd.trace_py_fn(f, *, auto_variables=True, **defaults)
Alias for kd.functor.trace_py_fn operator.
kd.translate(keys_to, keys_from, values_from)
Alias for kd.slices.translate operator.
kd.translate_group(keys_to, keys_from, values_from)
Alias for kd.slices.translate_group operator.
kd.tuple(*args)
Alias for kd.tuples.tuple operator.
kd.unique(x, sort=False)
Alias for kd.slices.unique operator.
kd.update_schema(obj, **attr_schemas)
Updates the schema of `obj` DataSlice using given schemas for attrs.
kd.updated(ds, *bag)
Alias for kd.core.updated operator.
kd.updated_bag(*bags)
Alias for kd.bags.updated operator.
kd.uu(seed=None, *, schema=None, overwrite_schema=False, **attrs)
Alias for kd.entities.uu operator.
kd.uu_schema(seed='', **kwargs)
Alias for kd.schema.uu_schema operator.
kd.uuid(seed='', **kwargs)
Alias for kd.ids.uuid operator.
kd.uuid_for_dict(seed='', **kwargs)
Alias for kd.ids.uuid_for_dict operator.
kd.uuid_for_list(seed='', **kwargs)
Alias for kd.ids.uuid_for_list operator.
kd.uuids_with_allocation_size(seed='', *, size)
Alias for kd.ids.uuids_with_allocation_size operator.
kd.uuobj(seed=None, **attrs)
Alias for kd.objs.uu operator.
kd.val_like(x, val)
Alias for kd.slices.val_like operator.
kd.val_shaped(shape, val)
Alias for kd.slices.val_shaped operator.
kd.val_shaped_as(x, val)
Alias for kd.slices.val_shaped_as operator.
kd.while_(condition_fn, body_fn, *, returns=unspecified, yields=unspecified, yields_interleaved=unspecified, **initial_state)
Alias for kd.functor.while_ operator.
kd.with_attr(x, attr_name, value, overwrite_schema=False)
Alias for kd.core.with_attr operator.
kd.with_attrs(x, /, *, overwrite_schema=False, **attrs)
Alias for kd.core.with_attrs operator.
kd.with_bag(ds, bag)
Alias for kd.core.with_bag operator.
kd.with_dict_update(x, keys, values=unspecified)
Alias for kd.dicts.with_dict_update operator.
kd.with_list_append_update(x, append)
Alias for kd.lists.with_list_append_update operator.
kd.with_merged_bag(ds)
Alias for kd.core.with_merged_bag operator.
kd.with_metadata(x, /, **attrs)
Alias for kd.core.with_metadata operator.
kd.with_name(obj, name)
Alias for kd.annotation.with_name operator.
kd.with_print(x, *args, sep=' ', end='\n')
Alias for kd.core.with_print operator.
kd.with_schema(x, schema)
Alias for kd.schema.with_schema operator.
kd.with_schema_from_obj(x)
Alias for kd.schema.with_schema_from_obj operator.
kd.xor(x, y)
Alias for kd.masking.xor operator.
kd.zip(*args)
Alias for kd.slices.zip operator.
kd_ext
operatorsOperators under the kd_ext.xxx
modules for extension utilities. Importing from
the following module is needed:
from koladata import kd_ext
Namespaces
External contributions not necessarily endorsed by Koda.
Operators
kd_ext.contrib.value_counts(x)
Returns Dicts mapping entries in `x` to their count over the last dim.
Similar to Pandas' `value_counts`.
The output is a `x.get_ndim() - 1`-dimensional DataSlice containing one
Dict per aggregated row in `x`. Each Dict maps the values to the number of
occurrences (as an INT64) in the final dimension.
Example:
x = kd.slice([[4, 3, 4], [None, 2], [2, 1, 4, 1], [None]])
kd_ext.contrib.value_counts(x)
# -> [Dict{4: 2, 3: 1}, Dict{2: 1}, Dict{2: 1, 1: 2, 4: 1}, Dict{}]
Args:
x: the non-scalar DataSlice to compute occurrences for.
Utilities for manipulating nested data.
Operators
kd_ext.nested_data.selected_path_update(root_ds, selection_ds_path, selection_ds)
Returns a DataBag where only the selected items are present in child lists.
The selection_ds_path must contain at least one list attribute. In general,
all lists must use an explicit list schema; this function does not work for
lists stored as kd.OBJECT.
Example:
```
selection_ds = root_ds.a[:].b.c[:].x > 1
ds = root_ds.updated(selected_path(root_ds, ['a', 'b', 'c'], selection_ds))
assert not kd.any(ds.a[:].b.c[:].x <= 1)
```
Args:
root_ds: the DataSlice to be filtered / selected.
selection_ds_path: the path in root_ds where selection_ds should be applied.
selection_ds: the DataSlice defining what is filtered / selected, or a
functor or a Python function that can be evaluated to this DataSlice
passing the given root_ds as its argument.
Returns:
A DataBag where child items along the given path are filtered according to
the @selection_ds. When all items at a level are removed, their parent is
also removed. The output DataBag only contains modified lists, and it may
need to be combined with the @root_ds via
@root_ds.updated(selected_path(....)).
Tools for Numpy <-> Koda interoperability.
Operators
kd_ext.npkd.from_array(arr)
Converts a numpy array to a DataSlice.
kd_ext.npkd.get_elements_indices_from_ds(ds)
Returns a list of np arrays representing the DataSlice's indices.
You can consider this as a n-dimensional coordinates of the items, p.ex. for a
two-dimensional DataSlice:
[[a, b],
[],
[c, d]] -> [[0, 0, 2, 2], [0, 1, 0, 1]]
Let's explain this:
- 'a' is in the first row and first column, its coordinates are (0, 0)
- 'b' is in the first row and second column, its coordinates are (0, 1)
- 'c' is in the third row and first column, its coordinates are (2, 0)
- 'd' is in the third row and second column, its coordinates are (2, 1)
if we write first y-coordinates, then x-coordinates, we get the following:
[[0, 0, 2, 2], [0, 1, 0, 1]]
The following conditions are satisfied:
- result is always a two-dimensional array;
- number of rows of the result equals the dimensionality of the input;
- each row of the result has the same length and it corresponds to the total
number of items in the DataSlice.
Args:
ds: DataSlice to get indices for.
Returns:
list of np arrays representing the DataSlice's elements indices.
kd_ext.npkd.reshape_based_on_indices(ds, indices)
Reshapes a DataSlice corresponding to the given indices.
Inverse operation to get_elements_indices_from_ds.
Let's explain this based on the following example:
ds: [a, b, c, d]
indices: [[0, 0, 2, 2], [0, 1, 0, 1]]
result: [[a, b], [], [c, d]]
Indices represent y- and x-coordinates of the items in the DataSlice.
- 'a': according to the indices, its coordinates are (0, 0) (first element
from the first and second row of indices conrrespondingly);
it will be placed in the first row and first column of the result;
- 'b': its coordinates are (0, 1); it will be placed in the first row and
second column of the result;
- 'c': its coordinates are (2, 0); it will be placed in the third row and
first column of the result;
- 'd': its coordinates are (2, 1); it will be placed in the third row and
second column of the result.
The result DataSlice will have the same number of items as the original
DataSlice. Its dimensionality will be equal to the number of rows in the
indices.
Args:
ds: DataSlice to reshape; can only be 1D.
indices: list of np arrays representing the DataSlice's indices; it has to
be a list of one-dimensional arrays where each row has equal number of
elements corresponding to the number of items in the DataSlice.
Returns:
DataSlice reshaped based on the given indices.
kd_ext.npkd.to_array(ds)
Converts a DataSlice to a numpy array.
Tools for Pandas <-> Koda interoperability.
Operators
kd_ext.pdkd.df(ds, cols=None, include_self=False)
Aliases:
Creates a pandas DataFrame from the given DataSlice.
If `ds` has no dimension, it will be converted to a single row DataFrame. If
it has one dimension, it willbe converted an 1D DataFrame. If it has more than
one dimension, it will be converted to a MultiIndex DataFrame with index
columns corresponding to each dimension.
When `cols` is not specified, DataFrame columns are inferred from `ds`.
1) If `ds` has primitives, lists, dicts or ITEMID schema, a single
column named 'self_' is used and items themselves are extracted.
2) If `ds` has entity schema, all attributes from `ds` are extracted as
columns.
3) If `ds` has OBJECT schema, the union of attributes from all objects in
`ds` are used as columns. Missing values are filled if objects do not
have corresponding attributes.
For example,
ds = kd.slice([1, 2, 3])
to_dataframe(ds) -> extract 'self_'
ds = kd.new(x=kd.slice([1, 2, 3]), y=kd.slice([4, 5, 6]))
to_dataframe(ds) -> extract 'x' and 'y'
ds = kd.slice([kd.obj(x=1, y='a'), kd.obj(x=2), kd.obj(x=3, y='c')])
to_dataframe(ds) -> extract 'x', 'y'
`cols` can be used to specify which data from the DataSlice should be
extracted as DataFrame columns. It can contain either the string names of
attributes or Exprs which can be evaluated on the DataSlice. If `ds` has
OBJECT schema, specified attributes must present in all objects in `ds`. To
ignore objects which do not have specific attributes, one can use
`S.maybe(attr)` in `cols`. For example,
ds = kd.slice([1, 2, 3])
to_dataframe(ds) -> extract 'self_'
ds = kd.new(x=kd.slice([1, 2, 3]), y=kd.slice([4, 5, 6]))
to_dataframe(ds, ['x']) -> extract 'x'
to_dataframe(ds, [S.x, S.x + S.y]) -> extract 'S.x' and 'S.x + S.y'
ds = kd.slice([kd.obj(x=1, y='a'), kd.obj(x=2), kd.obj(x=3, y='c')])
to_dataframe(ds, ['x']) -> extract 'x'
to_dataframe(ds, [S.y]) -> raise an exception as 'y' does not exist in
kd.obj(x=2)
to_dataframe(ds, [S.maybe('y')]) -> extract 'y' but ignore items which
do not have 'x' attribute.
If extracted column DataSlices have different shapes, they will be aligned to
the same dimensions. For example,
ds = kd.new(
x = kd.slice([1, 2, 3]),
y=kd.list(kd.new(z=kd.slice([[4], [5], [6]]))),
z=kd.list(kd.new(z=kd.slice([[4, 5], [], [6]]))),
)
to_dataframe(ds, cols=[S.x, S.y[:].z]) -> extract 'S.x' and 'S.y[:].z':
'x' 'y[:].z'
0 0 1 4
1 1 5
2 0 3 6
to_dataframe(ds, cols=[S.y[:].z, S.z[:].z]) -> error: shapes mismatch
The conversion adheres to:
* All output data will be of nullable types (e.g. `Int64Dtype()` rather than
`np.int64`)
* `pd.NA` is used for missing values.
* Numeric dtypes, booleans and strings will use corresponding pandas dtypes.
* MASK will be converted to pd.BooleanDtype(), with `kd.present => True` and
`kd.missing => pd.NA`.
* All other dtypes (including a mixed DataSlice) will use the `object` dtype
holding python data, with missing values represented through `pd.NA`.
`kd.present` is converted to True.
Args:
ds: DataSlice to convert.
cols: list of columns to extract from DataSlice. If None all attributes will
be extracted.
include_self: whether to include the 'self_' column. 'self_' column is
always included if `cols` is None and `ds` contains primitives/lists/dicts
or it has ITEMID schema.
Returns:
DataFrame with columns from DataSlice fields.
kd_ext.pdkd.from_dataframe(df_, as_obj=False)
Creates a DataSlice from the given pandas DataFrame.
The DataFrame must have at least one column. It will be converted to a
DataSlice of entities/objects with attributes corresponding to the DataFrame
columns. Supported column dtypes include all primitive dtypes and ItemId.
If the DataFrame has MultiIndex, it will be converted to a DataSlice with
the shape derived from the MultiIndex.
When `as_obj` is set, the resulting DataSlice will be a DataSlice of objects
instead of entities.
The conversion adheres to:
* All missing values (according to `pd.isna`) become missing values in the
resulting DataSlice.
* Data with `object` dtype is converted to an OBJECT DataSlice.
* Data with other dtypes is converted to a DataSlice with corresponding
schema.
Args:
df_: pandas DataFrame to convert.
as_obj: whether to convert the resulting DataSlice to Objects.
Returns:
DataSlice of items with attributes from DataFrame columns.
kd_ext.pdkd.to_dataframe(ds, cols=None, include_self=False)
Alias for kd_ext.pdkd.df operator.
Koda visualization functionality.
Operators
kd_ext.vis.AccessType(*values)
Aliases:
Types of accesses that can appear in an access path.
kd_ext.vis.DataSliceVisOptions(num_items=48, unbounded_type_max_len=256, detail_width=None, detail_height=300, attr_limit=20, item_limit=20)
Aliases:
Options for visualizing a DataSlice.
kd_ext.vis.DescendMode(*values)
Aliases:
kd_ext.vis.register_formatters()
Aliases:
Register DataSlice visualization in IPython.
kd_ext.vis.visualize_slice(ds, options=None)
Aliases:
Visualizes a DataSlice as a html widget.
Operators
kd_ext.Fn(f, *, use_tracing=True, **kwargs)
Alias for kd.functor.fn operator.
kd_ext.PyFn(f, *, return_type_as=<class 'koladata.types.data_slice.DataSlice'>, **defaults)
Alias for kd.functor.py_fn operator.
kd_ext.py_cloudpickle(obj)
Wraps into a Arolla QValue using cloudpickle for serialization.
DataSlice
methodsDataSlice
represents a jagged array of items (i.e. primitive values, ItemIds).
Operators
DataSlice.L
ListSlicing helper for DataSlice.
x.L on DataSlice returns a ListSlicingHelper, which treats the first dimension
of DataSlice x as a a list.
DataSlice.S
Slicing helper for DataSlice.
It is a syntactic sugar for kd.subslice. That is, kd.subslice(ds, *slices)
is equivalent to ds.S[*slices]. For example,
kd.subslice(x, 0) == x.S[0]
kd.subslice(x, 0, 1, kd.item(0)) == x.S[0, 1, kd.item(0)]
kd.subslice(x, slice(0, -1)) == x.S[0:-1]
kd.subslice(x, slice(0, -1), slice(0, 1), slice(1, None))
== x.S[0:-1, 0:1, 1:]
kd.subslice(x, ..., slice(1, None)) == x.S[..., 1:]
kd.subslice(x, slice(1, None)) == x.S[1:]
Please see kd.subslice for more detailed explanations and examples.
DataSlice.append(value, /)
Aliases:
Append a value to each list in this DataSlice
DataSlice.clear()
Aliases:
Clears all dicts or lists in this DataSlice
DataSlice.clone(self, *, itemid=unspecified, schema=unspecified, **overrides)
Aliases:
Creates a DataSlice with clones of provided entities in a new DataBag.
The entities themselves are cloned (with new ItemIds) and their attributes are
extracted (with the same ItemIds).
Also see kd.shallow_clone and kd.deep_clone.
Note that unlike kd.deep_clone, if there are multiple references to the same
entity, the returned DataSlice will have multiple clones of it rather than
references to the same clone.
Args:
x: The DataSlice to copy.
itemid: The ItemId to assign to cloned entities. If not specified, new
ItemIds will be allocated.
schema: The schema to resolve attributes, and also to assign the schema to
the resulting DataSlice. If not specified, will use the schema of `x`.
**overrides: attribute overrides.
Returns:
A copy of the entities where entities themselves are cloned (new ItemIds)
and all of the rest extracted.
DataSlice.deep_clone(self, schema=unspecified, **overrides)
Aliases:
Creates a slice with a (deep) copy of the given slice.
The entities themselves and all their attributes including both top-level and
non-top-level attributes are cloned (with new ItemIds).
Also see kd.shallow_clone and kd.clone.
Note that unlike kd.clone, if there are multiple references to the same entity
in `x`, or multiple ways to reach one entity through attributes, there will be
exactly one clone made per entity.
Args:
x: The slice to copy.
schema: The schema to use to find attributes to clone, and also to assign
the schema to the resulting DataSlice. If not specified, will use the
schema of 'x'.
**overrides: attribute overrides.
Returns:
A (deep) copy of the given DataSlice.
All referenced entities will be copied with newly allocated ItemIds. Note
that UUIDs will be copied as ItemIds.
DataSlice.deep_uuid(self, schema=unspecified, *, seed=DataItem('', schema: STRING))
Aliases:
Recursively computes uuid for x.
Args:
x: The slice to take uuid on.
schema: The schema to use to resolve '*' and '**' tokens. If not specified,
will use the schema of the 'x' DataSlice.
seed: The seed to use for uuid computation.
Returns:
Result of recursive uuid application `x`.
DataSlice.dict_size(self)
Aliases:
Returns size of a Dict.
DataSlice.display(self, options=None)
Aliases:
Visualizes a DataSlice as an html widget.
Args:
self: The DataSlice to visualize.
options: This should be a `koladata.ext.vis.DataSliceVisOptions`.
DataSlice.embed_schema()
Aliases:
Returns a DataSlice with OBJECT schema.
* For primitives no data change is done.
* For Entities schema is stored as '__schema__' attribute.
* Embedding Entities requires a DataSlice to be associated with a DataBag.
DataSlice.enriched(self, *bag)
Aliases:
Returns a copy of a DataSlice with a additional fallback DataBag(s).
Values in the original DataBag of `ds` take precedence over the ones in
`*bag`.
The DataBag attached to the result is a new immutable DataBag that falls back
to the DataBag of `ds` if present and then to `*bag`.
`enriched(x, a, b)` is equivalent to `enriched(enriched(x, a), b)`, and so on
for additional DataBag args.
Args:
ds: DataSlice.
*bag: additional fallback DataBag(s).
Returns:
DataSlice with additional fallbacks.
DataSlice.expand_to(self, target, ndim=unspecified)
Aliases:
Expands `x` based on the shape of `target`.
When `ndim` is not set, expands `x` to the shape of
`target`. The dimensions of `x` must be the same as the first N
dimensions of `target` where N is the number of dimensions of `x`. For
example,
Example 1:
x: kd.slice([[1, 2], [3]])
target: kd.slice([[[0], [0, 0]], [[0, 0, 0]]])
result: kd.slice([[[1], [2, 2]], [[3, 3, 3]]])
Example 2:
x: kd.slice([[1, 2], [3]])
target: kd.slice([[[0]], [[0, 0, 0]]])
result: incompatible shapes
Example 3:
x: kd.slice([[1, 2], [3]])
target: kd.slice([0, 0])
result: incompatible shapes
When `ndim` is set, the expansion is performed in 3 steps:
1) the last N dimensions of `x` are first imploded into lists
2) the expansion operation is performed on the DataSlice of lists
3) the lists in the expanded DataSlice are exploded
The result will have M + ndim dimensions where M is the number
of dimensions of `target`.
For example,
Example 4:
x: kd.slice([[1, 2], [3]])
target: kd.slice([[1], [2, 3]])
ndim: 1
result: kd.slice([[[1, 2]], [[3], [3]]])
Example 5:
x: kd.slice([[1, 2], [3]])
target: kd.slice([[1], [2, 3]])
ndim: 2
result: kd.slice([[[[1, 2], [3]]], [[[1, 2], [3]], [[1, 2], [3]]]])
Args:
x: DataSlice to expand.
target: target DataSlice.
ndim: the number of dimensions to implode during expansion.
Returns:
Expanded DataSlice
DataSlice.explode(self, ndim=DataItem(1, schema: INT64))
Aliases:
Explodes a List DataSlice `x` a specified number of times.
A single list "explosion" converts a rank-K DataSlice of LIST[T] to a
rank-(K+1) DataSlice of T, by unpacking the items in the Lists in the original
DataSlice as a new DataSlice dimension in the result. Missing values in the
original DataSlice are treated as empty lists.
A single list explosion can also be done with `x[:]`.
If `ndim` is set to a non-negative integer, explodes recursively `ndim` times.
An `ndim` of zero is a no-op.
If `ndim` is set to a negative integer, explodes as many times as possible,
until at least one of the items of the resulting DataSlice is not a List.
Args:
x: DataSlice of Lists to explode
ndim: the number of explosion operations to perform, defaults to 1
Returns:
DataSlice
DataSlice.extract(self, schema=unspecified)
Aliases:
Creates a DataSlice with a new DataBag containing only reachable attrs.
Args:
ds: DataSlice to extract.
schema: schema of the extracted DataSlice.
Returns:
A DataSlice with a new immutable DataBag attached.
DataSlice.extract_bag(self, schema=unspecified)
Aliases:
Creates a new DataBag containing only reachable attrs from 'ds'.
Args:
ds: DataSlice to extract.
schema: schema of the extracted DataSlice.
Returns:
A new immutable DataBag with only the reachable attrs from 'ds'.
DataSlice.flatten(self, from_dim=DataItem(0, schema: INT64), to_dim=unspecified)
Aliases:
Returns `x` with dimensions `[from_dim:to_dim]` flattened.
Indexing works as in python:
* If `to_dim` is unspecified, `to_dim = rank()` is used.
* If `to_dim < from_dim`, `to_dim = from_dim` is used.
* If `to_dim < 0`, `max(0, to_dim + rank())` is used. The same goes for
`from_dim`.
* If `to_dim > rank()`, `rank()` is used. The same goes for `from_dim`.
The above-mentioned adjustments places both `from_dim` and `to_dim` in the
range `[0, rank()]`. After adjustments, the new DataSlice has `rank() ==
old_rank - (to_dim - from_dim) + 1`. Note that if `from_dim == to_dim`, a
"unit" dimension is inserted at `from_dim`.
Example:
# Flatten the last two dimensions into a single dimension, producing a
# DataSlice with `rank = old_rank - 1`.
kd.get_shape(x) # -> JaggedShape(..., [2, 1], [7, 5, 3])
flat_x = kd.flatten(x, -2)
kd.get_shape(flat_x) # -> JaggedShape(..., [12, 3])
# Flatten all dimensions except the last, producing a DataSlice with
# `rank = 2`.
kd.get_shape(x) # -> jaggedShape(..., [7, 5, 3])
flat_x = kd.flatten(x, 0, -1)
kd.get_shape(flat_x) # -> JaggedShape([3], [7, 5, 3])
# Flatten all dimensions.
kd.get_shape(x) # -> JaggedShape([3], [7, 5, 3])
flat_x = kd.flatten(x)
kd.get_shape(flat_x) # -> JaggedShape([15])
Args:
x: a DataSlice.
from_dim: start of dimensions to flatten. Defaults to `0` if unspecified.
to_dim: end of dimensions to flatten. Defaults to `rank()` if unspecified.
DataSlice.flatten_end(self, n_times=DataItem(1, schema: INT64))
Aliases:
Returns `x` with a shape flattened `n_times` from the end.
The new shape has x.get_ndim() - n_times dimensions.
Given that flattening happens from the end, only positive integers are
allowed. For more control over flattening, please use `kd.flatten`, instead.
Args:
x: a DataSlice.
n_times: number of dimensions to flatten from the end
(0 <= n_times <= rank).
DataSlice.follow(self)
Aliases:
Returns the original DataSlice from a NoFollow DataSlice.
When a DataSlice is wrapped into a NoFollow DataSlice, it's attributes
are not further traversed during extract, clone, deep_clone, etc.
`kd.follow` operator inverses the DataSlice back to a traversable DataSlice.
Inverse of `nofollow`.
Args:
x: DataSlice to unwrap, if nofollowed.
DataSlice.fork_bag(self)
Aliases:
Returns a copy of the DataSlice with a forked mutable DataBag.
DataSlice.freeze_bag()
Aliases:
Returns a frozen DataSlice equivalent to `self`.
DataSlice.from_vals(x, /, schema=None)
Alias for kd.slices.slice operator.
DataSlice.get_attr(attr_name, /, default=None)
Aliases:
Gets attribute `attr_name` where missing items are filled from `default`.
Args:
attr_name: name of the attribute to get.
default: optional default value to fill missing items.
Note that this value can be fully omitted.
DataSlice.get_attr_names(*, intersection)
Aliases:
Returns a sorted list of unique attribute names of this DataSlice.
In case of OBJECT schema, attribute names are fetched from the `__schema__`
attribute. In case of Entity schema, the attribute names are fetched from the
schema. In case of primitives, an empty list is returned.
Args:
intersection: If True, the intersection of all object attributes is returned.
Otherwise, the union is returned.
Returns:
A list of unique attributes sorted by alphabetical order.
DataSlice.get_bag()
Aliases:
Returns the attached DataBag.
DataSlice.get_dtype(self)
Aliases:
Returns a primitive schema representing the underlying items' dtype.
If `ds` has a primitive schema, this returns that primitive schema, even if
all items in `ds` are missing. If `ds` has an OBJECT schema but contains
primitive values of a single dtype, it returns the schema for that primitive
dtype.
In case of items in `ds` have non-primitive types or mixed dtypes, returns
a missing schema (i.e. `kd.item(None, kd.SCHEMA)`).
Examples:
kd.get_primitive_schema(kd.slice([1, 2, 3])) -> kd.INT32
kd.get_primitive_schema(kd.slice([None, None, None], kd.INT32)) -> kd.INT32
kd.get_primitive_schema(kd.slice([1, 2, 3], kd.OBJECT)) -> kd.INT32
kd.get_primitive_schema(kd.slice([1, 'a', 3], kd.OBJECT)) -> missing schema
kd.get_primitive_schema(kd.obj())) -> missing schema
Args:
ds: DataSlice to get dtype from.
Returns:
a primitive schema DataSlice.
DataSlice.get_item_schema(self)
Aliases:
Returns the item schema of a List schema`.
DataSlice.get_itemid(self)
Aliases:
Casts `x` to ITEMID using explicit (permissive) casting rules.
DataSlice.get_key_schema(self)
Aliases:
Returns the key schema of a Dict schema`.
DataSlice.get_keys()
Aliases:
Returns keys of all dicts in this DataSlice.
DataSlice.get_ndim(self)
Aliases:
Returns the number of dimensions of DataSlice `x`.
DataSlice.get_obj_schema(self)
Aliases:
Returns a DataSlice of schemas for Objects and primitives in `x`.
DataSlice `x` must have OBJECT schema.
Examples:
db = kd.bag()
s = db.new_schema(a=kd.INT32)
obj = s(a=1).embed_schema()
kd.get_obj_schema(kd.slice([1, None, 2.0, obj]))
-> kd.slice([kd.INT32, NONE, kd.FLOAT32, s])
Args:
x: OBJECT DataSlice
Returns:
A DataSlice of schemas.
DataSlice.get_present_count(self)
Aliases:
Returns the count of present items over all dimensions.
The result is a zero-dimensional DataItem.
Args:
x: A DataSlice of numbers.
DataSlice.get_schema()
Aliases:
Returns a schema DataItem with type information about this DataSlice.
DataSlice.get_shape()
Aliases:
Returns the shape of the DataSlice.
DataSlice.get_size(self)
Aliases:
Returns the number of items in `x`, including missing items.
Args:
x: A DataSlice.
Returns:
The size of `x`.
DataSlice.get_sizes(self)
Aliases:
Returns a DataSlice of sizes of the DataSlice's shape.
DataSlice.get_value_schema(self)
Aliases:
Returns the value schema of a Dict schema`.
DataSlice.get_values(self, key_ds=unspecified)
Aliases:
Returns values corresponding to `key_ds` for dicts in `dict_ds`.
When `key_ds` is specified, it is equivalent to dict_ds[key_ds].
When `key_ds` is unspecified, it returns all values in `dict_ds`. The result
DataSlice has one more dimension used to represent values in each dict than
`dict_ds`. While the order of values within a dict is arbitrary, it is the
same as get_keys().
Args:
dict_ds: DataSlice of Dicts.
key_ds: DataSlice of keys or unspecified.
Returns:
A DataSlice of values.
DataSlice.has_attr(self, attr_name)
Aliases:
Indicates whether the items in `x` DataSlice have the given attribute.
This function checks for attributes based on data rather than "schema" and may
be slow in some cases.
Args:
x: DataSlice
attr_name: Name of the attribute to check.
Returns:
A MASK DataSlice with the same shape as `x` that contains present if the
attribute exists for the corresponding item.
DataSlice.has_bag()
Aliases:
Returns `present` if DataSlice `ds` has a DataBag attached.
DataSlice.implode(self, ndim=DataItem(1, schema: INT64), itemid=unspecified)
Aliases:
Implodes a Dataslice `x` a specified number of times.
A single list "implosion" converts a rank-(K+1) DataSlice of T to a rank-K
DataSlice of LIST[T], by folding the items in the last dimension of the
original DataSlice into newly-created Lists.
If `ndim` is set to a non-negative integer, implodes recursively `ndim` times.
If `ndim` is set to a negative integer, implodes as many times as possible,
until the result is a DataItem (i.e. a rank-0 DataSlice) containing a single
nested List.
Args:
x: the DataSlice to implode
ndim: the number of implosion operations to perform
itemid: optional ITEMID DataSlice used as ItemIds of the resulting lists.
Returns:
DataSlice of nested Lists
DataSlice.internal_as_arolla_value()
Aliases:
Converts primitive DataSlice / DataItem into an equivalent Arolla value.
DataSlice.internal_as_dense_array()
Aliases:
Converts primitive DataSlice to an Arolla DenseArray with appropriate qtype.
DataSlice.internal_as_py()
Aliases:
Returns a Python object equivalent to this DataSlice.
If the values in this DataSlice represent objects, then the returned python
structure will contain DataItems.
DataSlice.internal_is_itemid_schema()
Aliases:
Returns present iff this DataSlice is ITEMID Schema.
DataSlice.is_dict()
Aliases:
Returns present iff this DataSlice has Dict schema or contains only dicts.
DataSlice.is_dict_schema()
Aliases:
Returns present iff this DataSlice is a Dict Schema.
DataSlice.is_empty()
Aliases:
Returns present iff this DataSlice is empty.
DataSlice.is_entity()
Aliases:
Returns present iff this DataSlice has Entity schema or contains only entities.
DataSlice.is_entity_schema()
Aliases:
Returns present iff this DataSlice represents an Entity Schema.
DataSlice.is_list()
Aliases:
Returns present iff this DataSlice has List schema or contains only lists.
DataSlice.is_list_schema()
Aliases:
Returns present iff this DataSlice is a List Schema.
DataSlice.is_mutable()
Aliases:
Returns present iff the attached DataBag is mutable.
DataSlice.is_primitive(self)
Aliases:
Returns whether x is a primitive DataSlice.
`x` is a primitive DataSlice if it meets one of the following conditions:
1) it has a primitive schema
2) it has OBJECT/SCHEMA schema and only has primitives
Also see `kd.has_primitive` for a pointwise version. But note that
`kd.all(kd.has_primitive(x))` is not always equivalent to
`kd.is_primitive(x)`. For example,
kd.is_primitive(kd.int32(None)) -> kd.present
kd.all(kd.has_primitive(kd.int32(None))) -> invalid for kd.all
kd.is_primitive(kd.int32([None])) -> kd.present
kd.all(kd.has_primitive(kd.int32([None]))) -> kd.missing
Args:
x: DataSlice to check.
Returns:
A MASK DataItem.
DataSlice.is_primitive_schema()
Aliases:
Returns present iff this DataSlice is a primitive (scalar) Schema.
DataSlice.is_struct_schema()
Aliases:
Returns present iff this DataSlice represents a Struct Schema.
DataSlice.list_size(self)
Aliases:
Returns size of a List.
DataSlice.maybe(self, attr_name)
Aliases:
A shortcut for kd.get_attr(x, attr_name, default=None).
DataSlice.new(self, **attrs)
Aliases:
Returns a new Entity with this Schema.
DataSlice.no_bag()
Aliases:
Returns a copy of DataSlice without DataBag.
DataSlice.pop(index, /)
Aliases:
Pop a value from each list in this DataSlice
DataSlice.ref(self)
Aliases:
Returns `ds` with the DataBag removed.
Unlike `no_bag`, `ds` is required to hold ItemIds and no primitives are
allowed.
The result DataSlice still has the original schema. If the schema is an Entity
schema (including List/Dict schema), it is treated an ItemId after the DataBag
is removed.
Args:
ds: DataSlice of ItemIds.
DataSlice.repeat(self, sizes)
Aliases:
Returns `x` with values repeated according to `sizes`.
The resulting DataSlice has `rank = rank + 1`. The input `sizes` are
broadcasted to `x`, and each value is repeated the given number of times.
Example:
ds = kd.slice([[1, None], [3]])
sizes = kd.slice([[1, 2], [3]])
kd.repeat(ds, sizes) # -> kd.slice([[[1], [None, None]], [[3, 3, 3]]])
ds = kd.slice([[1, None], [3]])
sizes = kd.slice([2, 3])
kd.repeat(ds, sizes) # -> kd.slice([[[1, 1], [None, None]], [[3, 3, 3]]])
ds = kd.slice([[1, None], [3]])
size = kd.item(2)
kd.repeat(ds, size) # -> kd.slice([[[1, 1], [None, None]], [[3, 3]]])
Args:
x: A DataSlice of data.
sizes: A DataSlice of sizes that each value in `x` should be repeated for.
DataSlice.reshape(self, shape)
Aliases:
Returns a DataSlice with the provided shape.
Examples:
x = kd.slice([1, 2, 3, 4])
# Using a shape.
kd.reshape(x, kd.shapes.new(2, 2)) # -> kd.slice([[1, 2], [3, 4]])
# Using a tuple of sizes.
kd.reshape(x, kd.tuple(2, 2)) # -> kd.slice([[1, 2], [3, 4]])
# Using a tuple of sizes and a placeholder dimension.
kd.reshape(x, kd.tuple(-1, 2)) # -> kd.slice([[1, 2], [3, 4]])
# Using a tuple of sizes and a placeholder dimension.
kd.reshape(x, kd.tuple(-1, 2)) # -> kd.slice([[1, 2], [3, 4]])
# Using a tuple of slices and a placeholder dimension.
kd.reshape(x, kd.tuple(-1, kd.slice([3, 1])))
# -> kd.slice([[1, 2, 3], [4]])
# Reshaping a scalar.
kd.reshape(1, kd.tuple(1, 1)) # -> kd.slice([[1]])
# Reshaping an empty slice.
kd.reshape(kd.slice([]), kd.tuple(2, 0)) # -> kd.slice([[], []])
Args:
x: a DataSlice.
shape: a JaggedShape or a tuple of dimensions that forms a shape through
`kd.shapes.new`, with additional support for a `-1` placeholder dimension.
DataSlice.reshape_as(self, shape_from)
Aliases:
Returns a DataSlice x reshaped to the shape of DataSlice shape_from.
DataSlice.select(self, fltr, expand_filter=DataItem(True, schema: BOOLEAN))
Aliases:
Creates a new DataSlice by filtering out missing items in fltr.
It is not supported for DataItems because their sizes are always 1.
The dimensions of `fltr` needs to be compatible with the dimensions of `ds`.
By default, `fltr` is expanded to 'ds' and items in `ds` corresponding
missing items in `fltr` are removed. The last dimension of the resulting
DataSlice is changed while the first N-1 dimensions are the same as those in
`ds`.
Example:
val = kd.slice([[1, None, 4], [None], [2, 8]])
kd.select(val, val > 3) -> [[4], [], [8]]
fltr = kd.slice(
[[None, kd.present, kd.present], [kd.present], [kd.present, None]])
kd.select(val, fltr) -> [[None, 4], [None], [2]]
fltr = kd.slice([kd.present, kd.present, None])
kd.select(val, fltr) -> [[1, None, 4], [None], []]
kd.select(val, fltr, expand_filter=False) -> [[1, None, 4], [None]]
Args:
ds: DataSlice with ndim > 0 to be filtered.
fltr: filter DataSlice with dtype as kd.MASK. It can also be a Koda Functor
or a Python function which can be evalauted to such DataSlice. A Python
function will be traced for evaluation, so it cannot have Python control
flow operations such as `if` or `while`.
expand_filter: flag indicating if the 'filter' should be expanded to 'ds'
Returns:
Filtered DataSlice.
DataSlice.select_items(self, fltr)
Aliases:
Selects List items by filtering out missing items in fltr.
Also see kd.select.
Args:
ds: List DataSlice to be filtered
fltr: filter can be a DataSlice with dtype as kd.MASK. It can also be a Koda
Functor or a Python function which can be evalauted to such DataSlice. A
Python function will be traced for evaluation, so it cannot have Python
control flow operations such as `if` or `while`.
Returns:
Filtered DataSlice.
DataSlice.select_keys(self, fltr)
Aliases:
Selects Dict keys by filtering out missing items in `fltr`.
Also see kd.select.
Args:
ds: Dict DataSlice to be filtered
fltr: filter DataSlice with dtype as kd.MASK or a Koda Functor or a Python
function which can be evalauted to such DataSlice. A Python function will
be traced for evaluation, so it cannot have Python control flow operations
such as `if` or `while`.
Returns:
Filtered DataSlice.
DataSlice.select_present(self)
Aliases:
Creates a new DataSlice by removing missing items.
It is not supported for DataItems because their sizes are always 1.
Example:
val = kd.slice([[1, None, 4], [None], [2, 8]])
kd.select_present(val) -> [[1, 4], [], [2, 8]]
Args:
ds: DataSlice with ndim > 0 to be filtered.
Returns:
Filtered DataSlice.
DataSlice.select_values(self, fltr)
Aliases:
Selects Dict values by filtering out missing items in `fltr`.
Also see kd.select.
Args:
ds: Dict DataSlice to be filtered
fltr: filter DataSlice with dtype as kd.MASK or a Koda Functor or a Python
function which can be evalauted to such DataSlice. A Python function will
be traced for evaluation, so it cannot have Python control flow operations
such as `if` or `while`.
Returns:
Filtered DataSlice.
DataSlice.set_attr(attr_name, value, /, overwrite_schema=False)
Aliases:
Sets an attribute `attr_name` to `value`.
Requires DataSlice to have a mutable DataBag attached. Compared to
`__setattr__`, it allows overwriting the schema for attribute `attr_name` when
`overwrite_schema` is True. Additionally, it allows `attr_name` to be a
non-Python-identifier (e.g. "123-f", "5", "%#$", etc.). `attr_name` still has to
be a valid UTF-8 unicode.
Args:
attr_name: UTF-8 unicode representing the attribute name.
value: new value for attribute `attr_name`.
overwrite_schema: if True, schema for attribute is always updated.
DataSlice.set_attrs(*, overwrite_schema=False, **attrs)
Aliases:
Sets multiple attributes on an object / entity.
Args:
overwrite_schema: (bool) overwrite schema if attribute schema is missing or
incompatible.
**attrs: attribute values that are converted to DataSlices with DataBag
adoption.
DataSlice.set_schema(schema, /)
Aliases:
Returns a copy of DataSlice with the provided `schema`.
If `schema` has a different DataBag than the DataSlice, `schema` is merged into
the DataBag of the DataSlice. See kd.set_schema for more details.
Args:
schema: schema DataSlice to set.
Returns:
DataSlice with the provided `schema`.
DataSlice.shallow_clone(self, *, itemid=unspecified, schema=unspecified, **overrides)
Aliases:
Creates a DataSlice with shallow clones of immediate attributes.
The entities themselves get new ItemIds and their top-level attributes are
copied by reference.
Also see kd.clone and kd.deep_clone.
Note that unlike kd.deep_clone, if there are multiple references to the same
entity, the returned DataSlice will have multiple clones of it rather than
references to the same clone.
Args:
x: The DataSlice to copy.{SELF}
itemid: The ItemId to assign to cloned entities. If not specified, will
allocate new ItemIds.
schema: The schema to resolve attributes, and also to assign the schema to
the resulting DataSlice. If not specified, will use the schema of 'x'.
**overrides: attribute overrides.
Returns:
A copy of the entities with new ItemIds where all top-level attributes are
copied by reference.
DataSlice.strict_with_attrs(self, **attrs)
Aliases:
Returns a DataSlice with a new DataBag containing updated attrs in `x`.
Strict version of kd.attrs disallowing adding new attributes.
Args:
x: Entity for which the attributes update is being created.
**attrs: attrs to set in the update.
DataSlice.stub(self, attrs=DataSlice([], schema: NONE))
Aliases:
Copies a DataSlice's schema stub to a new DataBag.
The "schema stub" of a DataSlice is a subset of its schema (including embedded
schemas) that contains just enough information to support direct updates to
that DataSlice.
Optionally copies `attrs` schema attributes to the new DataBag as well.
This method works for items, objects, and for lists and dicts stored as items
or objects. The intended usage is to add new attributes to the object in the
new bag, or new items to the dict in the new bag, and then to be able
to merge the bags to obtain a union of attributes/values. For lists, we
extract the list with stubs for list items, which also works recursively so
nested lists are deep-extracted. Note that if you modify the list afterwards
by appending or removing items, you will no longer be able to merge the result
with the original bag.
Args:
x: DataSlice to extract the schema stub from.
attrs: Optional list of additional schema attribute names to copy. The
schemas for those attributes will be copied recursively (so including
attributes of those attributes etc).
Returns:
DataSlice with the same schema stub in the new DataBag.
DataSlice.take(self, indices)
Aliases:
Returns a new DataSlice with items at provided indices.
`indices` must have INT32 or INT64 dtype or OBJECT schema holding INT32 or
INT64 items.
Indices in the DataSlice `indices` are based on the last dimension of the
DataSlice `x`. Negative indices are supported and out-of-bound indices result
in missing items.
If ndim(x) - 1 > ndim(indices), indices are broadcasted to shape(x)[:-1].
If ndim(x) <= ndim(indices), indices are unchanged but shape(x)[:-1] must be
broadcastable to shape(indices).
Example:
x = kd.slice([[1, None, 2], [3, 4]])
kd.take(x, kd.item(1)) # -> kd.slice([[None, 4]])
kd.take(x, kd.slice([0, 1])) # -> kd.slice([1, 4])
kd.take(x, kd.slice([[0, 1], [1]])) # -> kd.slice([[1, None], [4]])
kd.take(x, kd.slice([[[0, 1], []], [[1], [0]]]))
# -> kd.slice([[[1, None]], []], [[4], [3]]])
kd.take(x, kd.slice([3, -3])) # -> kd.slice([None, None])
kd.take(x, kd.slice([-1, -2])) # -> kd.slice([2, 3])
kd.take(x, kd.slice('1')) # -> dtype mismatch error
kd.take(x, kd.slice([1, 2, 3])) -> incompatible shape
Args:
x: DataSlice to be indexed
indices: indices used to select items
Returns:
A new DataSlice with items selected by indices.
DataSlice.to_py(ds, max_depth=2, obj_as_dict=False, include_missing_attrs=True)
Aliases:
Returns a readable python object from a DataSlice.
Attributes, lists, and dicts are recursively converted to Python objects.
Args:
ds: A DataSlice
max_depth: Maximum depth for recursive printing. Each attribute, list, and
dict increments the depth by 1. Use -1 for unlimited depth.
obj_as_dict: Whether to convert objects to python dicts. By default objects
are converted to automatically constructed 'Obj' dataclass instances.
include_missing_attrs: whether to include attributes with None value in
objects.
DataSlice.to_pytree(ds, max_depth=2, include_missing_attrs=True)
Aliases:
Returns a readable python object from a DataSlice.
Attributes, lists, and dicts are recursively converted to Python objects.
Objects are converted to Python dicts.
Same as kd.to_py(..., obj_as_dict=True)
Args:
ds: A DataSlice
max_depth: Maximum depth for recursive printing. Each attribute, list, and
dict increments the depth by 1. Use -1 for unlimited depth.
include_missing_attrs: whether to include attributes with None value in
objects.
DataSlice.updated(self, *bag)
Aliases:
Returns a copy of a DataSlice with DataBag(s) of updates applied.
Values in `*bag` take precedence over the ones in the original DataBag of
`ds`.
The DataBag attached to the result is a new immutable DataBag that falls back
to the DataBag of `ds` if present and then to `*bag`.
`updated(x, a, b)` is equivalent to `updated(updated(x, b), a)`, and so on
for additional DataBag args.
Args:
ds: DataSlice.
*bag: DataBag(s) of updates.
Returns:
DataSlice with additional fallbacks.
DataSlice.with_attr(self, attr_name, value, overwrite_schema=DataItem(False, schema: BOOLEAN))
Aliases:
Returns a DataSlice with a new DataBag containing a single updated attribute.
This operator is useful if attr_name cannot be used as a key in keyword
arguments. E.g.: "123-f", "5", "%#$", etc. It still has to be a valid utf-8
unicode.
See kd.with_attrs docstring for more details on the rules and regarding
`overwrite` argument.
Args:
x: Entity / Object for which the attribute update is being created.
attr_name: utf-8 unicode representing the attribute name.
value: new value for attribute `attr_name`.
overwrite_schema: if True, schema for attribute is always updated.
DataSlice.with_attrs(self, *, overwrite_schema=DataItem(False, schema: BOOLEAN), **attrs)
Aliases:
Returns a DataSlice with a new DataBag containing updated attrs in `x`.
This is a shorter version of `x.updated(kd.attrs(x, ...))`.
Example:
x = x.with_attrs(foo=..., bar=...)
# Or equivalent:
# x = kd.with_attrs(x, foo=..., bar=...)
In case some attribute "foo" already exists and the update contains "foo",
either:
1) the schema of "foo" in the update must be implicitly castable to
`x.foo.get_schema()`; or
2) `x` is an OBJECT, in which case schema for "foo" will be overwritten.
An exception to (2) is if it was an Entity that was casted to an OBJECT using
kd.obj, e.g. then update for "foo" also must be castable to
`x.foo.get_schema()`. If this is not the case, an Error is raised.
This behavior can be overwritten by passing `overwrite=True`, which will cause
the schema for attributes to always be updated.
Args:
x: Entity / Object for which the attributes update is being created.
overwrite_schema: if True, schema for attributes is always updated.
**attrs: attrs to set in the update.
DataSlice.with_bag(bag, /)
Aliases:
Returns a copy of DataSlice with DataBag `db`.
DataSlice.with_dict_update(self, keys, values=unspecified)
Aliases:
Returns a DataSlice with a new DataBag containing updated dicts.
This operator has two forms:
kd.with_dict_update(x, keys, values) where keys and values are slices
kd.with_dict_update(x, dict_updates) where dict_updates is a DataSlice of
dicts
If both keys and values are specified, they must both be broadcastable to the
shape of `x`. If only keys is specified (as dict_updates), it must be
broadcastable to 'x'.
Args:
x: DataSlice of dicts to update.
keys: A DataSlice of keys, or a DataSlice of dicts of updates.
values: A DataSlice of values, or unspecified if `keys` contains dicts.
DataSlice.with_list_append_update(self, append)
Aliases:
Returns a DataSlice with a new DataBag containing updated appended lists.
The updated lists are the lists in `x` with the specified items appended at
the end.
`x` and `append` must have compatible shapes.
The resulting lists maintain the same ItemIds. Also see kd.appended_list()
which works similarly but resulting lists have new ItemIds.
Args:
x: DataSlice of lists.
append: DataSlice of values to append to each list in `x`.
Returns:
A DataSlice of lists in a new immutable DataBag.
DataSlice.with_merged_bag(self)
Aliases:
Returns a DataSlice with the DataBag of `ds` merged with its fallbacks.
Note that a DataBag has multiple fallback DataBags and fallback DataBags can
have fallbacks as well. This operator merges all of them into a new immutable
DataBag.
If `ds` has no attached DataBag, it raises an exception. If the DataBag of
`ds` does not have fallback DataBags, it is equivalent to `ds.freeze_bag()`.
Args:
ds: DataSlice to merge fallback DataBags of.
Returns:
A new DataSlice with an immutable DataBags.
DataSlice.with_name(obj, name)
Alias for kd.annotation.with_name operator.
DataSlice.with_schema(schema, /)
Aliases:
Returns a copy of DataSlice with the provided `schema`.
`schema` must have no DataBag or the same DataBag as the DataSlice. If `schema`
has a different DataBag, use `set_schema` instead. See kd.with_schema for more
details.
Args:
schema: schema DataSlice to set.
Returns:
DataSlice with the provided `schema`.
DataSlice.with_schema_from_obj(self)
Aliases:
Returns `x` with its embedded common schema set as the schema.
* `x` must have OBJECT schema.
* All items in `x` must have a common schema.
* If `x` is empty, the schema is set to NONE.
* If `x` contains mixed primitives without a common primitive type, the output
will have OBJECT schema.
Args:
x: An OBJECT DataSlice.
DataBag
methodsDataBag
is a set of triples (Entity.Attribute => Value).
Operators
DataBag.adopt(slice, /)
Adopts all data reachable from the given slice into this DataBag.
Args:
slice: DataSlice to adopt data from.
Returns:
The DataSlice with this DataBag (including adopted data) attached.
DataBag.adopt_stub(slice, /)
Copies the given DataSlice's schema stub into this DataBag.
The "schema stub" of a DataSlice is a subset of its schema (including embedded
schemas) that contains just enough information to support direct updates to
that DataSlice. See kd.stub() for more details.
Args:
slice: DataSlice to extract the schema stub from.
Returns:
The "stub" with this DataBag attached.
DataBag.concat_lists(self, /, *lists)
Returns a DataSlice of Lists concatenated from the List items of `lists`.
Each input DataSlice must contain only present List items, and the item
schemas of each input must be compatible. Input DataSlices are aligned (see
`kd.align`) automatically before concatenation.
If `lists` is empty, this returns a single empty list.
The specified `db` is used to create the new concatenated lists, and is the
DataBag used by the result DataSlice. If `db` is not specified, a new DataBag
is created for this purpose.
Args:
*lists: the DataSlices of Lists to concatenate
db: optional DataBag to populate with the result
Returns:
DataSlice of concatenated Lists
DataBag.contents_repr(self, /, *, triple_limit=1000)
Returns a representation of the DataBag contents.
DataBag.data_triples_repr(self, *, triple_limit=1000)
Returns a representation of the DataBag contents, omitting schema triples.
DataBag.dict(self, /, items_or_keys=None, values=None, *, key_schema=None, value_schema=None, schema=None, itemid=None)
Creates a Koda dict.
Acceptable arguments are:
1) no argument: a single empty dict
2) a Python dict whose keys are either primitives or DataItems and values
are primitives, DataItems, Python list/dict which can be converted to a
List/Dict DataItem, or a DataSlice which can folded into a List DataItem:
a single dict
3) two DataSlices/DataItems as keys and values: a DataSlice of dicts whose
shape is the last N-1 dimensions of keys/values DataSlice
Examples:
dict() -> returns a single new dict
dict({1: 2, 3: 4}) -> returns a single new dict
dict({1: [1, 2]}) -> returns a single dict, mapping 1->List[1, 2]
dict({1: kd.slice([1, 2])}) -> returns a single dict, mapping 1->List[1, 2]
dict({db.uuobj(x=1, y=2): 3}) -> returns a single dict, mapping uuid->3
dict(kd.slice([1, 2]), kd.slice([3, 4])) -> returns a dict, mapping 1->3 and
2->4
dict(kd.slice([[1], [2]]), kd.slice([3, 4])) -> returns two dicts, one
mapping
1->3 and another mapping 2->4
dict('key', 12) -> returns a single dict mapping 'key'->12
Args:
items_or_keys: a Python dict in case of items and a DataSlice in case of
keys.
values: a DataSlice. If provided, `items_or_keys` must be a DataSlice as
keys.
key_schema: the schema of the dict keys. If not specified, it will be
deduced from keys or defaulted to OBJECT.
value_schema: the schema of the dict values. If not specified, it will be
deduced from values or defaulted to OBJECT.
schema: The schema to use for the newly created Dict. If specified, then
key_schema and value_schema must not be specified.
itemid: Optional ITEMID DataSlice used as ItemIds of the resulting dicts.
Returns:
A DataSlice with the dict.
DataBag.dict_like(self, shape_and_mask_from, /, items_or_keys=None, values=None, *, key_schema=None, value_schema=None, schema=None, itemid=None)
Creates new Koda dicts with shape and sparsity of `shape_and_mask_from`.
If items_or_keys and values are not provided, creates empty dicts. Otherwise,
the function assigns the given keys and values to the newly created dicts. So
the keys and values must be either broadcastable to shape_and_mask_from
shape, or one dimension higher.
Args:
self: the DataBag.
shape_and_mask_from: a DataSlice with the shape and sparsity for the desired
dicts.
items_or_keys: either a Python dict (if `values` is None) or a DataSlice
with keys. The Python dict case is supported only for scalar
shape_and_mask_from.
values: a DataSlice of values, when `items_or_keys` represents keys.
key_schema: the schema of the dict keys. If not specified, it will be
deduced from keys or defaulted to OBJECT.
value_schema: the schema of the dict values. If not specified, it will be
deduced from values or defaulted to OBJECT.
schema: The schema to use for the newly created Dict. If specified, then
key_schema and value_schema must not be specified.
itemid: Optional ITEMID DataSlice used as ItemIds of the resulting dicts.
Returns:
A DataSlice with the dicts.
DataBag.dict_schema(key_schema, value_schema)
Returns a dict schema from the schemas of the keys and values
DataBag.dict_shaped(self, shape, /, items_or_keys=None, values=None, *, key_schema=None, value_schema=None, schema=None, itemid=None)
Creates new Koda dicts with the given shape.
If items_or_keys and values are not provided, creates empty dicts. Otherwise,
the function assigns the given keys and values to the newly created dicts. So
the keys and values must be either broadcastable to `shape` or one dimension
higher.
Args:
self: the DataBag.
shape: the desired shape.
items_or_keys: either a Python dict (if `values` is None) or a DataSlice
with keys. The Python dict case is supported only for scalar shape.
values: a DataSlice of values, when `items_or_keys` represents keys.
key_schema: the schema of the dict keys. If not specified, it will be
deduced from keys or defaulted to OBJECT.
value_schema: the schema of the dict values. If not specified, it will be
deduced from values or defaulted to OBJECT.
schema: The schema to use for the newly created Dict. If specified, then
key_schema and value_schema must not be specified.
itemid: Optional ITEMID DataSlice used as ItemIds of the resulting dicts.
Returns:
A DataSlice with the dicts.
DataBag.empty()
Alias for kd.bags.new operator.
DataBag.empty_mutable()
Alias for kd.mutable_bag operator.
DataBag.fork(mutable=True)
Returns a newly created DataBag with the same content as self.
Changes to either DataBag will not be reflected in the other.
Args:
mutable: If true (default), returns a mutable DataBag. If false, the DataBag
will be immutable.
Returns:
data_bag.DataBag
DataBag.freeze(self)
Returns a frozen DataBag equivalent to `self`.
DataBag.get_approx_size()
Returns approximate size of the DataBag.
DataBag.implode(self, x, /, ndim=1, itemid=None)
Implodes a Dataslice `x` a specified number of times.
A single list "implosion" converts a rank-(K+1) DataSlice of T to a rank-K
DataSlice of LIST[T], by folding the items in the last dimension of the
original DataSlice into newly-created Lists.
If `ndim` is set to a non-negative integer, implodes recursively `ndim` times.
If `ndim` is set to a negative integer, implodes as many times as possible,
until the result is a DataItem (i.e. a rank-0 DataSlice) containing a single
nested List.
The specified `db` is used to create any new Lists, and is the DataBag of the
result DataSlice. If `db` is not specified, a new, empty DataBag is created
for this purpose.
Args:
x: the DataSlice to implode
ndim: the number of implosion operations to perform
itemid: Optional ITEMID DataSlice used as ItemIds of the resulting lists.
db: optional DataBag where Lists are created from
Returns:
DataSlice of nested Lists
DataBag.is_mutable()
Returns present iff this DataBag is mutable.
DataBag.list(self, /, items=None, *, item_schema=None, schema=None, itemid=None)
Creates list(s) by collapsing `items`.
If there is no argument, returns an empty Koda List.
If the argument is a Python list, creates a nested Koda List.
Examples:
list() -> a single empty Koda List
list([1, 2, 3]) -> Koda List with items 1, 2, 3
list([[1, 2, 3], [4, 5]]) -> nested Koda List [[1, 2, 3], [4, 5]]
# items are Koda lists.
Args:
items: The items to use. If not specified, an empty list of OBJECTs will be
created.
item_schema: the schema of the list items. If not specified, it will be
deduced from `items` or defaulted to OBJECT.
schema: The schema to use for the list. If specified, then item_schema must
not be specified.
itemid: Optional ITEMID DataSlice used as ItemIds of the resulting lists.
Returns:
A DataSlice with the list/lists.
DataBag.list_like(self, shape_and_mask_from, /, items=None, *, item_schema=None, schema=None, itemid=None)
Creates new Koda lists with shape and sparsity of `shape_and_mask_from`.
Args:
shape_and_mask_from: a DataSlice with the shape and sparsity for the desired
lists.
items: optional items to assign to the newly created lists. If not given,
the function returns empty lists.
item_schema: the schema of the list items. If not specified, it will be
deduced from `items` or defaulted to OBJECT.
schema: The schema to use for the list. If specified, then item_schema must
not be specified.
itemid: Optional ITEMID DataSlice used as ItemIds of the resulting lists.
Returns:
A DataSlice with the lists.
DataBag.list_schema(item_schema)
Returns a list schema from the schema of the items
DataBag.list_shaped(self, shape, /, items=None, *, item_schema=None, schema=None, itemid=None)
Creates new Koda lists with the given shape.
Args:
shape: the desired shape.
items: optional items to assign to the newly created lists. If not given,
the function returns empty lists.
item_schema: the schema of the list items. If not specified, it will be
deduced from `items` or defaulted to OBJECT.
schema: The schema to use for the list. If specified, then item_schema must
not be specified.
itemid: Optional ITEMID DataSlice used as ItemIds of the resulting lists.
Returns:
A DataSlice with the lists.
DataBag.merge_fallbacks()
Returns a new DataBag with all the fallbacks merged.
DataBag.merge_inplace(self, other_bags, /, *, overwrite=True, allow_data_conflicts=True, allow_schema_conflicts=False)
Copies all data from `other_bags` to this DataBag.
Args:
other_bags: Either a DataBag or a list of DataBags to merge into the current
DataBag.
overwrite: In case of conflicts, whether the new value (or the rightmost of
the new values, if multiple) should be used instead of the old value. Note
that this flag has no effect when allow_data_conflicts=False and
allow_schema_conflicts=False. Note that db1.fork().inplace_merge(db2,
overwrite=False) and db2.fork().inplace_merge(db1, overwrite=True) produce
the same result.
allow_data_conflicts: Whether we allow the same attribute to have different
values in the bags being merged. When True, the overwrite= flag controls
the behavior in case of a conflict. By default, both this flag and
overwrite= are True, so we overwrite with the new values in case of a
conflict.
allow_schema_conflicts: Whether we allow the same attribute to have
different types in an explicit schema. Note that setting this flag to True
can be dangerous, as there might be some objects with the old schema that
are not overwritten, and therefore will end up in an inconsistent state
with their schema after the overwrite. When True, overwrite= flag controls
the behavior in case of a conflict.
Returns:
self, so that multiple DataBag modifications can be chained.
DataBag.named_schema(name, /, **attrs)
Creates a named schema with ItemId derived only from its name.
DataBag.new(arg, *, schema=None, overwrite_schema=False, itemid=None, **attrs)
Creates Entities with given attrs.
Args:
arg: optional Python object to be converted to an Entity.
schema: optional DataSlice schema. If not specified, a new explicit schema
will be automatically created based on the schemas of the passed **attrs.
overwrite_schema: if schema attribute is missing and the attribute is being
set through `attrs`, schema is successfully updated.
itemid: optional ITEMID DataSlice used as ItemIds of the resulting entities.
itemid will only be set when the args is not a primitive or primitive slice
if args present.
**attrs: attrs to set in the returned Entity.
Returns:
data_slice.DataSlice with the given attrs.
DataBag.new_like(shape_and_mask_from, *, schema=None, overwrite_schema=False, itemid=None, **attrs)
Creates new Entities with the shape and sparsity from shape_and_mask_from.
Args:
shape_and_mask_from: DataSlice, whose shape and sparsity the returned
DataSlice will have.
schema: optional DataSlice schema. If not specified, a new explicit schema
will be automatically created based on the schemas of the passed **attrs.
overwrite_schema: if schema attribute is missing and the attribute is being
set through `attrs`, schema is successfully updated.
itemid: optional ITEMID DataSlice used as ItemIds of the resulting entities.
**attrs: attrs to set in the returned Entity.
Returns:
data_slice.DataSlice with the given attrs.
DataBag.new_schema(**attrs)
Creates new schema object with given types of attrs.
DataBag.new_shaped(shape, *, schema=None, overwrite_schema=False, itemid=None, **attrs)
Creates new Entities with the given shape.
Args:
shape: JaggedShape that the returned DataSlice will have.
schema: optional DataSlice schema. If not specified, a new explicit schema
will be automatically created based on the schemas of the passed **attrs.
overwrite_schema: if schema attribute is missing and the attribute is being
set through `attrs`, schema is successfully updated.
itemid: optional ITEMID DataSlice used as ItemIds of the resulting entities.
**attrs: attrs to set in the returned Entity.
Returns:
data_slice.DataSlice with the given attrs.
DataBag.obj(arg, *, itemid=None, **attrs)
Creates new Objects with an implicit stored schema.
Returned DataSlice has OBJECT schema.
Args:
arg: optional Python object to be converted to an Object.
itemid: optional ITEMID DataSlice used as ItemIds of the resulting obj(s).
itemid will only be set when the args is not a primitive or primitive slice
if args presents.
**attrs: attrs to set on the returned object.
Returns:
data_slice.DataSlice with the given attrs and kd.OBJECT schema.
DataBag.obj_like(shape_and_mask_from, *, itemid=None, **attrs)
Creates Objects with shape and sparsity from shape_and_mask_from.
Returned DataSlice has OBJECT schema.
Args:
shape_and_mask_from: DataSlice, whose shape and sparsity the returned
DataSlice will have.
itemid: optional ITEMID DataSlice used as ItemIds of the resulting obj(s).
db: optional DataBag where entities are created.
**attrs: attrs to set in the returned Entity.
Returns:
data_slice.DataSlice with the given attrs.
DataBag.obj_shaped(shape, *, itemid=None, **attrs)
Creates Objects with the given shape.
Returned DataSlice has OBJECT schema.
Args:
shape: JaggedShape that the returned DataSlice will have.
itemid: optional ITEMID DataSlice used as ItemIds of the resulting obj(s).
**attrs: attrs to set in the returned Entity.
Returns:
data_slice.DataSlice with the given attrs.
DataBag.schema_triples_repr(self, *, triple_limit=1000)
Returns a representation of schema triples in the DataBag.
DataBag.uu(seed, *, schema=None, overwrite_schema=False, **kwargs)
Creates an item whose ids are uuid(s) with the set attributes.
In order to create a different "Type" from the same arguments, use
`seed` key with the desired value, e.g.
kd.uu(seed='type_1', x=kd.slice([1, 2, 3]), y=kd.slice([4, 5, 6]))
and
kd.uu(seed='type_2', x=kd.slice([1, 2, 3]), y=kd.slice([4, 5, 6]))
have different ids.
If 'schema' is provided, the resulting DataSlice has the provided schema.
Otherwise, uses the corresponding uuschema instead.
Args:
seed: (str) Allows different item(s) to have different ids when created
from the same inputs.
schema: schema for the resulting DataSlice
overwrite_schema: if true, will overwrite schema attributes in the schema's
corresponding db from the argument values.
**kwargs: key-value pairs of object attributes where values are DataSlices
or can be converted to DataSlices using kd.new.
Returns:
data_slice.DataSlice
DataBag.uu_schema(seed, **attrs)
Creates new uuschema from given types of attrs.
DataBag.uuobj(seed, **kwargs)
Creates object(s) whose ids are uuid(s) with the provided attributes.
In order to create a different "Type" from the same arguments, use
`seed` key with the desired value, e.g.
kd.uuobj(seed='type_1', x=kd.slice([1, 2, 3]), y=kd.slice([4, 5, 6]))
and
kd.uuobj(seed='type_2', x=kd.slice([1, 2, 3]), y=kd.slice([4, 5, 6]))
have different ids.
Args:
seed: (str) Allows different uuobj(s) to have different ids when created
from the same inputs.
**kwargs: key-value pairs of object attributes where values are DataSlices
or can be converted to DataSlices using kd.new.
Returns:
data_slice.DataSlice
DataBag.with_name(obj, name)
Alias for kd.annotation.with_name operator.
DataItem
methodsDataItem
represents a single item (i.e. primitive value, ItemId).
Operators
DataItem.L
ListSlicing helper for DataSlice.
x.L on DataSlice returns a ListSlicingHelper, which treats the first dimension
of DataSlice x as a a list.
DataItem.S
Slicing helper for DataSlice.
It is a syntactic sugar for kd.subslice. That is, kd.subslice(ds, *slices)
is equivalent to ds.S[*slices]. For example,
kd.subslice(x, 0) == x.S[0]
kd.subslice(x, 0, 1, kd.item(0)) == x.S[0, 1, kd.item(0)]
kd.subslice(x, slice(0, -1)) == x.S[0:-1]
kd.subslice(x, slice(0, -1), slice(0, 1), slice(1, None))
== x.S[0:-1, 0:1, 1:]
kd.subslice(x, ..., slice(1, None)) == x.S[..., 1:]
kd.subslice(x, slice(1, None)) == x.S[1:]
Please see kd.subslice for more detailed explanations and examples.
DataItem.append(value, /)
Alias for DataSlice.append operator.
DataItem.bind(self, *, return_type_as=<class 'koladata.types.data_slice.DataSlice'>, **kwargs)
Returns a Koda functor that partially binds a function to `kwargs`.
This function is intended to work the same as functools.partial in Python.
More specifically, for every "k=something" argument that you pass to this
function, whenever the resulting functor is called, if the user did not
provide "k=something_else" at call time, we will add "k=something".
Note that you can only provide defaults for the arguments passed as keyword
arguments this way. Positional arguments must still be provided at call time.
Moreover, if the user provides a value for a positional-or-keyword argument
positionally, and it was previously bound using this method, an exception
will occur.
You can pass expressions with their own inputs as values in `kwargs`. Those
inputs will become inputs of the resulting functor, will be used to compute
those expressions, _and_ they will also be passed to the underying functor.
Use kdf.call_fn for a more clear separation of those inputs.
Example:
f = kd.fn(I.x + I.y).bind(x=0)
kd.call(f, y=1) # 1
Args:
self: A Koda functor.
return_type_as: The return type of the functor is expected to be the same as
the type of this value. This needs to be specified if the functor does not
return a DataSlice. kd.types.DataSlice, kd.types.DataBag and
kd.types.JaggedShape can also be passed here.
**kwargs: Partial parameter binding. The values in this map may be Koda
expressions or DataItems. When they are expressions, they must evaluate to
a DataSlice/DataItem or a primitive that will be automatically wrapped
into a DataItem. This function creates auxiliary variables with names
starting with '_aux_fn', so it is not recommended to pass variables with
such names.
Returns:
A new Koda functor with some parameters bound.
DataItem.clear()
Alias for DataSlice.clear operator.
DataItem.clone(self, *, itemid=unspecified, schema=unspecified, **overrides)
Alias for DataSlice.clone operator.
DataItem.deep_clone(self, schema=unspecified, **overrides)
Alias for DataSlice.deep_clone operator.
DataItem.deep_uuid(self, schema=unspecified, *, seed=DataItem('', schema: STRING))
Alias for DataSlice.deep_uuid operator.
DataItem.dict_size(self)
Alias for DataSlice.dict_size operator.
DataItem.display(self, options=None)
Alias for DataSlice.display operator.
DataItem.embed_schema()
Alias for DataSlice.embed_schema operator.
DataItem.enriched(self, *bag)
Alias for DataSlice.enriched operator.
DataItem.expand_to(self, target, ndim=unspecified)
Alias for DataSlice.expand_to operator.
DataItem.explode(self, ndim=DataItem(1, schema: INT64))
Alias for DataSlice.explode operator.
DataItem.extract(self, schema=unspecified)
Alias for DataSlice.extract operator.
DataItem.extract_bag(self, schema=unspecified)
Alias for DataSlice.extract_bag operator.
DataItem.flatten(self, from_dim=DataItem(0, schema: INT64), to_dim=unspecified)
Alias for DataSlice.flatten operator.
DataItem.flatten_end(self, n_times=DataItem(1, schema: INT64))
Alias for DataSlice.flatten_end operator.
DataItem.follow(self)
Alias for DataSlice.follow operator.
DataItem.fork_bag(self)
Alias for DataSlice.fork_bag operator.
DataItem.freeze_bag()
Alias for DataSlice.freeze_bag operator.
DataItem.from_vals(x, /, schema=None)
Alias for kd.slices.item operator.
DataItem.get_attr(attr_name, /, default=None)
Alias for DataSlice.get_attr operator.
DataItem.get_attr_names(*, intersection)
Alias for DataSlice.get_attr_names operator.
DataItem.get_bag()
Alias for DataSlice.get_bag operator.
DataItem.get_dtype(self)
Alias for DataSlice.get_dtype operator.
DataItem.get_item_schema(self)
Alias for DataSlice.get_item_schema operator.
DataItem.get_itemid(self)
Alias for DataSlice.get_itemid operator.
DataItem.get_key_schema(self)
Alias for DataSlice.get_key_schema operator.
DataItem.get_keys()
Alias for DataSlice.get_keys operator.
DataItem.get_ndim(self)
Alias for DataSlice.get_ndim operator.
DataItem.get_obj_schema(self)
Alias for DataSlice.get_obj_schema operator.
DataItem.get_present_count(self)
Alias for DataSlice.get_present_count operator.
DataItem.get_schema()
Alias for DataSlice.get_schema operator.
DataItem.get_shape()
Alias for DataSlice.get_shape operator.
DataItem.get_size(self)
Alias for DataSlice.get_size operator.
DataItem.get_sizes(self)
Alias for DataSlice.get_sizes operator.
DataItem.get_value_schema(self)
Alias for DataSlice.get_value_schema operator.
DataItem.get_values(self, key_ds=unspecified)
Alias for DataSlice.get_values operator.
DataItem.has_attr(self, attr_name)
Alias for DataSlice.has_attr operator.
DataItem.has_bag()
Alias for DataSlice.has_bag operator.
DataItem.implode(self, ndim=DataItem(1, schema: INT64), itemid=unspecified)
Alias for DataSlice.implode operator.
DataItem.internal_as_arolla_value()
Alias for DataSlice.internal_as_arolla_value operator.
DataItem.internal_as_dense_array()
Alias for DataSlice.internal_as_dense_array operator.
DataItem.internal_as_py()
Alias for DataSlice.internal_as_py operator.
DataItem.internal_is_itemid_schema()
Alias for DataSlice.internal_is_itemid_schema operator.
DataItem.is_dict()
Alias for DataSlice.is_dict operator.
DataItem.is_dict_schema()
Alias for DataSlice.is_dict_schema operator.
DataItem.is_empty()
Alias for DataSlice.is_empty operator.
DataItem.is_entity()
Alias for DataSlice.is_entity operator.
DataItem.is_entity_schema()
Alias for DataSlice.is_entity_schema operator.
DataItem.is_list()
Alias for DataSlice.is_list operator.
DataItem.is_list_schema()
Alias for DataSlice.is_list_schema operator.
DataItem.is_mutable()
Alias for DataSlice.is_mutable operator.
DataItem.is_primitive(self)
Alias for DataSlice.is_primitive operator.
DataItem.is_primitive_schema()
Alias for DataSlice.is_primitive_schema operator.
DataItem.is_struct_schema()
Alias for DataSlice.is_struct_schema operator.
DataItem.list_size(self)
Alias for DataSlice.list_size operator.
DataItem.maybe(self, attr_name)
Alias for DataSlice.maybe operator.
DataItem.new(self, **attrs)
Alias for DataSlice.new operator.
DataItem.no_bag()
Alias for DataSlice.no_bag operator.
DataItem.pop(index, /)
Alias for DataSlice.pop operator.
DataItem.ref(self)
Alias for DataSlice.ref operator.
DataItem.repeat(self, sizes)
Alias for DataSlice.repeat operator.
DataItem.reshape(self, shape)
Alias for DataSlice.reshape operator.
DataItem.reshape_as(self, shape_from)
Alias for DataSlice.reshape_as operator.
DataItem.select(self, fltr, expand_filter=DataItem(True, schema: BOOLEAN))
Alias for DataSlice.select operator.
DataItem.select_items(self, fltr)
Alias for DataSlice.select_items operator.
DataItem.select_keys(self, fltr)
Alias for DataSlice.select_keys operator.
DataItem.select_present(self)
Alias for DataSlice.select_present operator.
DataItem.select_values(self, fltr)
Alias for DataSlice.select_values operator.
DataItem.set_attr(attr_name, value, /, overwrite_schema=False)
Alias for DataSlice.set_attr operator.
DataItem.set_attrs(*, overwrite_schema=False, **attrs)
Alias for DataSlice.set_attrs operator.
DataItem.set_schema(schema, /)
Alias for DataSlice.set_schema operator.
DataItem.shallow_clone(self, *, itemid=unspecified, schema=unspecified, **overrides)
Alias for DataSlice.shallow_clone operator.
DataItem.strict_with_attrs(self, **attrs)
Alias for DataSlice.strict_with_attrs operator.
DataItem.stub(self, attrs=DataSlice([], schema: NONE))
Alias for DataSlice.stub operator.
DataItem.take(self, indices)
Alias for DataSlice.take operator.
DataItem.to_py(ds, max_depth=2, obj_as_dict=False, include_missing_attrs=True)
Alias for DataSlice.to_py operator.
DataItem.to_pytree(ds, max_depth=2, include_missing_attrs=True)
Alias for DataSlice.to_pytree operator.
DataItem.updated(self, *bag)
Alias for DataSlice.updated operator.
DataItem.with_attr(self, attr_name, value, overwrite_schema=DataItem(False, schema: BOOLEAN))
Alias for DataSlice.with_attr operator.
DataItem.with_attrs(self, *, overwrite_schema=DataItem(False, schema: BOOLEAN), **attrs)
Alias for DataSlice.with_attrs operator.
DataItem.with_bag(bag, /)
Alias for DataSlice.with_bag operator.
DataItem.with_dict_update(self, keys, values=unspecified)
Alias for DataSlice.with_dict_update operator.
DataItem.with_list_append_update(self, append)
Alias for DataSlice.with_list_append_update operator.
DataItem.with_merged_bag(self)
Alias for DataSlice.with_merged_bag operator.
DataItem.with_name(obj, name)
Alias for kd.annotation.with_name operator.
DataItem.with_schema(schema, /)
Alias for DataSlice.with_schema operator.
DataItem.with_schema_from_obj(self)
Alias for DataSlice.with_schema_from_obj operator.