koladata

Home
Overview
Fundamentals
Glossary
Cheatsheet
API Reference
Quick Recipes
Deep Dive
Common Pitfalls and Gotchas
Persistent Storage

View the Project on GitHub google/koladata

kd_ext.pdkd API

Tools for Pandas <-> Koda interoperability.

kd_ext.pdkd.df(ds: DataSlice, cols: Sequence[str | Expr] | Mapping[str, str | Expr] | None = None, include_self: bool = False) -> DataFrame

Aliases:

Creates a pandas DataFrame from the given DataSlice.

If `ds` has no dimension, it will be converted to a single row DataFrame. If
it has one dimension, it willbe converted an 1D DataFrame. If it has more than
one dimension, it will be converted to a MultiIndex DataFrame with index
columns corresponding to each dimension.

When `cols` is not specified, DataFrame columns are inferred from `ds`.
  1) If `ds` has primitives, lists, dicts or ITEMID schema, a single
     column named 'self_' is used and items themselves are extracted.
  2) If `ds` has entity schema, all attributes from `ds` are extracted as
     columns.
  3) If `ds` has OBJECT schema, the union of attributes from all objects in
     `ds` are used as columns. Missing values are filled if objects do not
     have corresponding attributes.

For example,

  ds = kd.slice([1, 2, 3])
  to_dataframe(ds) -> extract 'self_'

  ds = kd.new(x=kd.slice([1, 2, 3]), y=kd.slice([4, 5, 6]))
  to_dataframe(ds) -> extract 'x' and 'y'

  ds = kd.slice([kd.obj(x=1, y='a'), kd.obj(x=2), kd.obj(x=3, y='c')])
  to_dataframe(ds) -> extract 'x', 'y'

`cols` can be used to specify which data from the DataSlice should be
extracted as DataFrame columns. It can be a sequence of string names of
attributes, sequence of Exprs, or a mapping column names to string names of
attributes or Exprs. If `ds` has OBJECT schema, specified attributes must
be present in all objects in `ds`. To ignore objects which do not have
specific attributes, one can use `S.maybe(attr)` in `cols`. For example,

  ds = kd.slice([1, 2, 3])
  to_dataframe(ds) -> extract 'self_'

  ds = kd.new(x=kd.slice([1, 2, 3]), y=kd.slice([4, 5, 6]))
  to_dataframe(ds, ['x']) -> extract 'x'
  to_dataframe(ds, [S.x, S.x + S.y]) -> extract 'S.x' and 'S.x + S.y'
  to_dataframe(ds, {'my_x': S.x, 'my_y': 'y'}) -> extract 'my_x' and 'my_y'

  ds = kd.slice([kd.obj(x=1, y='a'), kd.obj(x=2), kd.obj(x=3, y='c')])
  to_dataframe(ds, ['x']) -> extract 'x'
  to_dataframe(ds, [S.y]) -> raise an exception as 'y' does not exist in
      kd.obj(x=2)
  to_dataframe(ds, [S.maybe('y')]) -> extract 'y' but ignore items which
      do not have 'x' attribute.

If extracted column DataSlices have different shapes, they will be aligned to
the same dimensions. For example,

  ds = kd.new(
      x = kd.slice([1, 2, 3]),
      y=kd.list(kd.new(z=kd.slice([[4], [5], [6]]))),
      z=kd.list(kd.new(z=kd.slice([[4, 5], [], [6]]))),
  )
  to_dataframe(ds, cols=[S.x, S.y[:].z]) -> extract 'S.x' and 'S.y[:].z':
         'x' 'y[:].z'
    0 0   1     4
      1   1     5
    2 0   3     6
  to_dataframe(ds, cols=[S.y[:].z, S.z[:].z]) -> error: shapes mismatch

The conversion adheres to:
  * All output data will be of nullable types (e.g. `Int64Dtype()` rather than
    `np.int64`)
  * `pd.NA` is used for missing values.
  * Numeric dtypes, booleans and strings will use corresponding pandas dtypes.
  * MASK will be converted to pd.BooleanDtype(), with `kd.present => True` and
    `kd.missing => pd.NA`.
  * All other dtypes (including a mixed DataSlice) will use the `object` dtype
    holding python data, with missing values represented through `pd.NA`.
    `kd.present` is converted to True.

Args:
  ds: DataSlice to convert.
  cols: list of columns to extract or a dictionary mapping output column names
    to columns to extract from DataSlice. If None all attributes will be
    extracted.
  include_self: whether to include the 'self_' column. 'self_' column is
    always included if `cols` is None and `ds` contains primitives/lists/dicts
    or it has ITEMID schema.

Returns:
  DataFrame with columns from DataSlice fields.

kd_ext.pdkd.from_dataframe(df_: DataFrame, as_obj: bool = False) -> DataSlice

Creates a DataSlice from the given pandas DataFrame.

The DataFrame must have at least one column. It will be converted to a
DataSlice of entities/objects with attributes corresponding to the DataFrame
columns. Supported column dtypes include all primitive dtypes and ItemId.

If the DataFrame has MultiIndex, it will be converted to a DataSlice with
the shape derived from the MultiIndex.

When `as_obj` is set, the resulting DataSlice will be a DataSlice of objects
instead of entities.

The conversion adheres to:
* All missing values (according to `pd.isna`) become missing values in the
  resulting DataSlice.
* Data with `object` dtype is converted to an OBJECT DataSlice.
* Data with other dtypes is converted to a DataSlice with corresponding
  schema.

Args:
 df_: pandas DataFrame to convert.
 as_obj: whether to convert the resulting DataSlice to Objects.

Returns:
  DataSlice of items with attributes from DataFrame columns.

kd_ext.pdkd.from_series(series: Series) -> DataSlice

Creates a DataSlice from the given pandas Series.

The Series is first converted to a DataFrame with a single column named
'self_', and then `from_dataframe` is used to convert it to a DataSlice.

Args:
  series: pandas Series to convert.

Returns:
  DataSlice representing the content of the Series.

kd_ext.pdkd.to_dataframe(ds: DataSlice, cols: Sequence[str | Expr] | Mapping[str, str | Expr] | None = None, include_self: bool = False) -> DataFrame

Alias for kd_ext.pdkd.df

kd_ext.pdkd.to_series(ds: DataSlice, col: str | Expr | None = None) -> Series

Creates a pandas Series from the given DataSlice.

If `col` is not provided, it behaves like `to_dataframe` with no columns
specified and extracts 'self_' or raises an error if the inference would
yield multiple columns.

Args:
  ds: DataSlice to convert.
  col: the column to extract from the DataSlice. If None, inference is used.

Returns:
  Series representing the extracted column.