koladata

Home
Overview
Fundamentals
Glossary
Cheatsheet
API Reference
Quick Recipes
Deep Dive
Common Pitfalls and Gotchas
Persistent Storage

View the Project on GitHub google/koladata

kd.json API

JSON serialization and parsing operators.

kd.json.filter_json(x, field_to_extract, *, ignore_errors=False)

Extracts requested field from given JSONs.

It automatically fixes some errors in the input JSON: replaces single quotes
with double quotes, quotes unquoted keys and values, handles linebreaks in
string literals. Also removes all spaces and linebreaks outside of string
literals.

Args:
  x: Slice of strings, each one is a separate JSON.
  field_to_extract: JSONPath string (e.g. "$.docs[*].name"), specifies a field
    to extract from the input JSONs. Only a subset of JSONPath features is
    supported. List indices can be specified only as `[*]`.
  ignore_errors: Boolean. If True, then errors are ignored -- in case of
    an error the rest of the invalid JSON is considered to have no matches.

Returns:
  A slice of strings with one dimension more than `x`. Each value is a JSON
  corresponding to the given JSONPath.

kd.json.from_json(x, /, schema=OBJECT, default_number_schema=OBJECT, *, on_invalid=[], keys_attr='json_object_keys', values_attr='json_object_values')

Aliases:

Parses a DataSlice `x` of JSON strings.

The result will have the same shape as `x`, and missing items in `x` will be
missing in the result. The result will use a new immutable DataBag.

If `schema` is OBJECT (the default), the schema is inferred from the JSON
data, and the result will have an OBJECT schema. The decoded data will only
have BOOLEAN, numeric, STRING, LIST[OBJECT], and entity schemas, corresponding
to JSON primitives, arrays, and objects.

If `default_number_schema` is OBJECT (the default), then the inferred schema
of each JSON number will be INT32, INT64, or FLOAT32, depending on its value
and on whether it contains a decimal point or exponent, matching the combined
behavior of python json and `kd.from_py`. Otherwise, `default_number_schema`
must be a numeric schema, and the inferred schema of all JSON numbers will be
that schema.

For example:

  kd.from_json(None) -> kd.obj(None)
  kd.from_json('null') -> kd.obj(None)
  kd.from_json('true') -> kd.obj(True)
  kd.from_json('[true, false, null]') -> kd.obj([True, False, None])
  kd.from_json('[1, 2.0]') -> kd.obj([1, 2.0])
  kd.from_json('[1, 2.0]', kd.OBJECT, kd.FLOAT64)
    -> kd.obj([kd.float64(1.0), kd.float64(2.0)])

JSON objects parsed using an OBJECT schema will record the object key order on
the attribute specified by `keys_attr` as a LIST[STRING], and also redundantly
record a copy of the object values as a parallel LIST on the attribute
specified by `values_attr`. If there are duplicate keys, the last value is the
one stored on the Koda object attribute. If a key conflicts with `keys_attr`
or `values_attr`, it is only available in the `values_attr` list. These
behaviors can be disabled by setting `keys_attr` and/or `values_attr` to None.

For example:

  kd.from_json('{"a": 1, "b": "y", "c": null}') ->
      kd.obj(a=1.0, b='y', c=None,
             json_object_keys=kd.list(['a', 'b', 'c']),
             json_object_values=kd.list([1.0, 'y', None]))
  kd.from_json('{"a": 1, "b": "y", "c": null}',
               keys_attr=None, values_attr=None) ->
      kd.obj(a=1.0, b='y', c=None)
  kd.from_json('{"a": 1, "b": "y", "c": null}',
               keys_attr='my_keys', values_attr='my_values') ->
      kd.obj(a=1.0, b='y', c=None,
             my_keys=kd.list(['a', 'b', 'c']),
             my_values=kd.list([1.0, 'y', None]))
  kd.from_json('{"a": 1, "a": 2", "a": 3}') ->
      kd.obj(a=3.0,
             json_object_keys=kd.list(['a', 'a', 'a']),
             json_object_values=kd.list([1.0, 2.0, 3.0]))
  kd.from_json('{"json_object_keys": ["x", "y"]}') ->
      kd.obj(json_object_keys=kd.list(['json_object_keys']),
             json_object_values=kd.list([["x", "y"]]))

If `schema` is explicitly specified, it is used to validate the JSON data,
and the result DataSlice will have `schema` as its schema.

OBJECT schemas inside subtrees of `schema` are allowed, and will use the
inference behavior described above.

Primitive schemas in `schema` will attempt to cast any JSON primitives using
normal Koda explicit casting rules, and if those fail, using the following
additional rules:
- BYTES will accept JSON strings containing base64 (RFC 4648 section 4)

If entity schemas in `schema` have attributes matching `keys_attr` and/or
`values_attr`, then the object key and/or value order (respectively) will be
recorded as lists on those attributes, similar to the behavior for OBJECT
described above. These attributes must have schemas LIST[STRING] and
LIST[T] (for a T compatible with the contained values) if present.

For example:

  kd.from_json('null', kd.MASK) -> kd.missing
  kd.from_json('null', kd.STRING) -> kd.str(None)
  kd.from_json('123', kd.INT32) -> kd.int32(123)
  kd.from_json('123', kd.FLOAT32) -> kd.int32(123.0)
  kd.from_json('"123"', kd.STRING) -> kd.str('123')
  kd.from_json('"123"', kd.INT32) -> kd.int32(123)
  kd.from_json('"123"', kd.FLOAT32) -> kd.float32(123.0)
  kd.from_json('"MTIz"', kd.BYTES) -> kd.bytes(b'123')
  kd.from_json('"inf"', kd.FLOAT32) -> kd.float32(float('inf'))
  kd.from_json('"1e100"', kd.FLOAT32) -> kd.float32(float('inf'))
  kd.from_json('[1, 2, 3]', kd.list_schema(kd.INT32)) -> kd.list([1, 2, 3])
  kd.from_json('{"a": 1}', kd.schema.new_schema(a=kd.INT32)) -> kd.new(a=1)
  kd.from_json('{"a": 1}', kd.dict_schema(kd.STRING, kd.INT32)
    -> kd.dict({"a": 1})

  kd.from_json('{"b": 1, "a": 2}',
               kd.new_schema(
                   a=kd.INT32, json_object_keys=kd.list_schema(kd.STRING))) ->
    kd.new(a=1, json_object_keys=kd.list(['b', 'a', 'c']))
  kd.from_json('{"b": 1, "a": 2, "c": 3}',
               kd.new_schema(a=kd.INT32,
                             json_object_keys=kd.list_schema(kd.STRING),
                             json_object_values=kd.list_schema(kd.OBJECT))) ->
    kd.new(a=1, c=3.0,
           json_object_keys=kd.list(['b', 'a', 'c']),
           json_object_values=kd.list([1, 2.0, 3.0]))

In general:

  `kd.to_json(kd.from_json(x))` is equivalent to `x`, ignoring differences in
  JSON number formatting and padding.

  `kd.from_json(kd.to_json(x), kd.get_schema(x))` is equivalent to `x` if `x`
  has a concrete (no OBJECT) schema, ignoring differences in Koda itemids.
  In other words, `to_json` doesn't capture the full information of `x`, but
  the original schema of `x` has enough additional information to recover it.

Args:
  x: A DataSlice of STRING containing JSON strings to parse.
  schema: A SCHEMA DataItem containing the desired result schema. Defaults to
    kd.OBJECT.
  default_number_schema: A SCHEMA DataItem containing a numeric schema, or
    None to infer all number schemas using python-boxing-like rules.
  on_invalid: If specified, a DataItem to use in the result wherever the
    corresponding JSON string in `x` was invalid. If unspecified, any invalid
    JSON strings in `x` will cause an operator error.
  keys_attr: A STRING DataItem that controls which entity attribute is used to
    record json object key order, if it is present on the schema.
  values_attr: A STRING DataItem that controls which entity attribute is used
    to record json object values, if it is present on the schema.

Returns:
  A DataSlice with the same shape as `x` and schema `schema`.

kd.json.salvage(x, /, *, allow_nan=False, ensure_ascii=False, max_depth=100)

Normalizes a DataSlice of strings containing JSON-like syntax to JSON.

This operator tries its best to interpret the input as JSON.

Basic guarantees:
- Each present output string is the concatenation of zero or more
  '\n'-newline-separated valid JSON values.
- If the input is the concatenation of zero or more ASCII-whitespace-separated
  valid JSON values with container nesting depth at most `max_depth`, the
  output is JSON-value-equivalent to the input.
  - Strings are equivalent by sequence of represented unicode code points,
    and numbers are equivalent by numeric value with unlimited precision.

Supports the following additional syntax, to tolerate "variant" JSON:
- All of JSON5 according to https://spec.json5.org/
  - Non-decimal integer literal magnitudes (ignoring sign) are 64 bit.
  - Decimal number literals use unlimited precision.
- Additional syntax from Python (not covered by JSON5):
  - Line comments starting with a single hash character `#`.
  - False, True, and None (as true, false, and null).
  - Tuple literals (interpreted as arrays).
  - Single-triple-quoted and double-triple-quoted strings.
  - \a string escape interpreted as U+0007.
  - \o \oo \ooo octal string escapes.
  - \UXXXXXXXX 32-bit hexadecimal string escapes.
  - u"" and b"" string prefixes (accepted and ignored).
  - Underscores in numeric literals (like 123_456).
  - Octal (0o) and binary (0b) integer literals.
    - Magnitudes (ignoring sign) are 64 bit.
  - l and L integer suffixes (accepted and ignored).
- Additional syntax from JavaScript (not covered by Python/JSON5):
  - \u{...} variable-length hexadecimal string escapes.
  - n integer suffix (accepted and ignored).

All other input is handled in an implementation-defined way and is subject
to change in future versions.

Args:
  x: A DataSlice of STRING containing JSON-like strings to parse.
  allow_nan: A BOOLEAN DataItem. If true, like in python `json.dumps`, the
    non-standard JSON number literals `NaN` and `Infinity` and `-Infinity` are
    allowed in the output.
  ensure_ascii: A BOOLEAN DataItem. If `True`, the output will contain only
    ASCII-range characters. If `False` (the default), non-ASCII code points in
    output JSON strings will use UTF-8 instead of JSON escape sequences.
  max_depth: A present INT32 or INT64 DataItem. If the input contains nested
    containers deeper than `max_depth`, the output is no longer guaranteed to
    match the input value, even if the input is valid JSON. This is mainly a
    safeguard to prevent unbounded memory usage on large inputs.

Returns:
  A DataSlice of STRING with the same shape and sparsity as `x`.

kd.json.to_json(x, /, *, indent=None, ensure_ascii=True, keys_attr='json_object_keys', values_attr='json_object_values', include_missing_values=True)

Aliases:

Converts `x` to a DataSlice of JSON strings.

The following schemas are allowed:
- STRING, BYTES, INT32, INT64, FLOAT32, FLOAT64, MASK, BOOLEAN
- LIST[T] where T is an allowed schema
- DICT{K, V} where K is one of {STRING, BYTES, INT32, INT64}, and V is an
  allowed schema
- Entity schemas where all attribute values have allowed schemas
- OBJECT schemas resolving to allowed schemas

Itemid cycles are not allowed.

Missing DataSlice items in the input are missing in the result. Missing values
inside of lists/entities/etc. are encoded as JSON `null` (or `false` for
`kd.missing`). If `include_missing_values` is `False`, entity attributes with
missing values are omitted from the JSON output.

For example:

  kd.to_json(None) -> kd.str(None)
  kd.to_json(kd.missing) -> kd.str(None)
  kd.to_json(kd.present) -> 'true'
  kd.to_json(True) -> 'true'
  kd.to_json(kd.slice([1, None, 3])) -> ['1', None, '3']
  kd.to_json(kd.list([1, None, 3])) -> '[1, null, 3]'
  kd.to_json(kd.dict({'a': 1, 'b':'2'}) -> '{"a": 1, "b": "2"}'
  kd.to_json(kd.new(a=1, b='2')) -> '{"a": 1, "b": "2"}'
  kd.to_json(kd.new(x=None)) -> '{"x": null}'
  kd.to_json(kd.new(x=kd.missing)) -> '{"x": false}'
  kd.to_json(kd.new(a=1, b=None), include_missing_values=False)
    -> '{"a": 1}'

Koda BYTES values are converted to base64 strings (RFC 4648 section 4).

Integers are always stored exactly in decimal. Finite floating point values
are formatted similar to python format string `%.17g`, except that a decimal
point and at least one decimal digit are always present if the format doesn't
use scientific notation. This appears to match the behavior of python json.

Non-finite floating point values are stored as the strings "inf", "-inf" and
"nan". This differs from python json, which emits non-standard JSON tokens
`Infinity` and `NaN`. This also differs from javascript, which stores these
values as `null`, which would be ambiguous with Koda missing values. There is
unfortunately no standard way to express these values in JSON.

By default, JSON objects are written with keys in sorted order. However, it is
also possible to control the key order of JSON objects using the `keys_attr`
argument. If an entity has the attribute specified by `keys_attr`, then that
attribute must have schema LIST[STRING], and the JSON object will have exactly
the key order specified in that list, including duplicate keys.

To write duplicate JSON object keys with different values, use `values_attr`
to designate an attribute to hold a parallel list of values to write.

For example:

  kd.to_json(kd.new(x=1, y=2)) -> '{"x": 2, "y": 1}'
  kd.to_json(kd.new(x=1, y=2, json_object_keys=kd.list(['y', 'x'])))
    -> '{"y": 2, "x": 1}'
  kd.to_json(kd.new(x=1, y=2, foo=kd.list(['y', 'x'])), keys_attr='foo')
    -> '{"y": 2, "x": 1}'
  kd.to_json(kd.new(x=1, y=2, z=3, json_object_keys=kd.list(['x', 'z', 'x'])))
    -> '{"x": 1, "z": 3, "x": 1}'

  kd.to_json(kd.new(json_object_keys=kd.list(['x', 'z', 'x']),
                    json_object_values=kd.list([1, 2, 3])))
    -> '{"x": 1, "z": 2, "x": 3}'
  kd.to_json(kd.new(a=kd.list(['x', 'z', 'x']), b=kd.list([1, 2, 3])),
             keys_attr='a', values_attr='b')
    -> '{"x": 1, "z": 2, "x": 3}'


The `indent` and `ensure_ascii` arguments control JSON formatting:
- If `indent` is negative, then the JSON is formatted without any whitespace.
- If `indent` is None (the default), the JSON is formatted with a single
  padding space only after ',' and ':' and no other whitespace.
- If `indent` is zero or positive, the JSON is pretty-printed, with that
  number of spaces used for indenting each level.
- If `ensure_ascii` is True (the default) then all non-ASCII code points in
  strings will be escaped, and the result strings will be ASCII-only.
  Otherwise, they will be left as-is.

For example:

  kd.to_json(kd.list([1, 2, 3]), indent=-1) -> '[1,2,3]'
  kd.to_json(kd.list([1, 2, 3]), indent=2) -> '[\n  1,\n  2,\n  3\n]'

  kd.to_json('✨', ensure_ascii=True) -> '"\\u2728"'
  kd.to_json('✨', ensure_ascii=False) -> '"✨"'

Args:
  x: The DataSlice to convert.
  indent: An INT32 DataItem that describes how the result should be indented.
  ensure_ascii: A BOOLEAN DataItem that controls non-ASCII escaping.
  keys_attr: A STRING DataItem that controls which entity attribute controls
    json object key order, or None to always use sorted order. Defaults to
    `json_object_keys`.
  values_attr: A STRING DataItem that can be used with `keys_attr` to give
    full control over json object contents. Defaults to
    `json_object_values`.
  include_missing_values: A BOOLEAN DataItem. If `False`, attributes with
    missing values will be omitted from entity JSON objects. Defaults to
    `True`.