Home
Overview
Fundamentals
Glossary
Cheatsheet
API Reference
Quick Recipes
Deep Dive
Common Pitfalls and Gotchas
Persistent Storage
Operators that work with strings data.
kd.strings.agg_join(x, sep=None, ndim=unspecified)Returns a DataSlice of strings joined on last ndim dimensions.
Example:
ds = kd.slice([['el', 'psy', 'congroo'], ['a', 'b', 'c']))
kd.agg_join(ds, ' ') # -> kd.slice(['el psy congroo', 'a b c'])
kd.agg_join(ds, ' ', ndim=2) # -> kd.slice('el psy congroo a b c')
Args:
x: String or bytes DataSlice
sep: If specified, will join by the specified string, otherwise will be
empty string.
ndim: The number of dimensions to compute indices over. Requires 0 <= ndim
<= get_ndim(x).
kd.strings.contains(s, substr)Returns present iff `s` contains `substr`.
Examples:
kd.strings.constains(kd.slice(['Hello', 'Goodbye']), 'lo')
# -> kd.slice([kd.present, kd.missing])
kd.strings.contains(
kd.slice([b'Hello', b'Goodbye']),
kd.slice([b'lo', b'Go']))
# -> kd.slice([kd.present, kd.present])
Args:
s: The strings to consider. Must have schema STRING or BYTES.
substr: The substrings to look for in `s`. Must have the same schema as `s`.
Returns:
The DataSlice of present/missing values with schema MASK.
kd.strings.count(s, substr)Counts the number of occurrences of `substr` in `s`.
Examples:
kd.strings.count(kd.slice(['Hello', 'Goodbye']), 'l')
# -> kd.slice([2, 0])
kd.strings.count(
kd.slice([b'Hello', b'Goodbye']),
kd.slice([b'Hell', b'o']))
# -> kd.slice([1, 2])
Args:
s: The strings to consider.
substr: The substrings to count in `s`. Must have the same schema as `s`.
Returns:
The DataSlice of INT32 counts.
kd.strings.decode(x, errors='strict')Decodes bytes to string element-wise (using utf-8 coding).
Args:
x: DataSlice of BYTES.
errors: DataSlice of STRING, signalling how to treat utf-8 decode errors.
Supported options are 'strict': raise an error on any invalid byte,
'ignore': omit invalid bytes in the result without raising, 'replace':
replace invalid bytes with U+FFFD.
Returns:
Decoded STRING DataSlice, same dimensionality as `x`.
kd.strings.decode_base64(x, /, *, on_invalid=unspecified)Decodes BYTES from `x` using base64 encoding (RFC 4648 section 4).
The input strings may either have no padding, or must have the correct amount
of padding. ASCII whitespace characters anywhere in the string are ignored.
Args:
x: DataSlice of STRING or BYTES containing base64-encoded strings.
on_invalid: If unspecified (the default), any invalid base64 strings in `x`
will cause an error. Otherwise, this must be a DataSlice broadcastable to
`x` with a schema compatible with BYTES, and will be used in the result
wherever the input string was not valid base64.
Returns:
DataSlice of BYTES.
kd.strings.encode(x)Encodes `x` as BYTES using UTF-8 encoding.
kd.strings.encode_base64(x)Encodes BYTES `x` using base64 encoding (RFC 4648 section 4), with padding.
Args:
x: DataSlice of BYTES to encode.
Returns:
DataSlice of STRING.
kd.strings.find(s, substr, start=0, end=None)Returns the offset of the first occurrence of `substr` in `s`.
The units of `start`, `end`, and the return value are all byte offsets if `s`
is `BYTES` and codepoint offsets if `s` is `STRING`.
Args:
s: (STRING or BYTES) Strings to search in.
substr: (STRING or BYTES) Strings to search for in `s`. Should have the same
dtype as `s`.
start: (optional int) Offset to start the search, defaults to 0.
end: (optional int) Offset to stop the search, defaults to end of the string.
Returns:
The offset of the first occurrence of `substr` in `s`, or missing if there
are no occurrences.
kd.strings.format(fmt, /, **kwargs)Aliases:
Formats strings according to python str.format style.
Format support is slightly different from Python:
1. {x:v} is equivalent to {x} and supported for all types as default string
format.
2. Only float and integers support other format specifiers.
E.g., {x:.1f} and {x:04d}.
3. If format is missing type specifier `f` or `d` at the end, we are
adding it automatically based on the type of the argument.
Note: only keyword arguments are supported.
Examples:
kd.strings.format(kd.slice(['Hello {n}!', 'Goodbye {n}!']), n='World')
# -> kd.slice(['Hello World!', 'Goodbye World!'])
kd.strings.format('{a} + {b} = {c}', a=1, b=2, c=3)
# -> kd.slice('1 + 2 = 3')
kd.strings.format(
'{a} + {b} = {c}',
a=kd.slice([1, 2]),
b=kd.slice([2, 3]),
c=kd.slice([3, 5]))
# -> kd.slice(['1 + 2 = 3', '2 + 3 = 5'])
kd.strings.format(
'({a:03} + {b:e}) * {c:.2f} ='
' {a:02d} * {c:3d} + {b:07.3f} * {c:08.4f}'
a=5, b=5.7, c=75)
# -> kd.slice(
# '(005 + 5.700000e+00) * 75.00 = 05 * 75 + 005.700 * 075.0000')
Args:
fmt: Format string (String or Bytes).
**kwargs: Arguments to format.
Returns:
The formatted string.
kd.strings.fstr(x)Aliases:
Evaluates Koladata f-string into DataSlice.
f-string must be created via Python f-string syntax. It must contain at least
one formatted DataSlice.
Each DataSlice must have custom format specification,
e.g. `{ds:s}` or `{ds:.2f}`.
Find more about format specification in kd.strings.format docs.
NOTE: `{ds:s}` can be used for any type to achieve default string conversion.
Examples:
countries = kd.slice(['USA', 'Schweiz'])
kd.fstr(f'Hello, {countries:s}!')
# -> kd.slice(['Hello, USA!', 'Hello, Schweiz!'])
greetings = kd.slice(['Hello', 'Gruezi'])
kd.fstr(f'{greetings:s}, {countries:s}!')
# -> kd.slice(['Hello, USA!', 'Gruezi, Schweiz!'])
states = kd.slice([['California', 'Arizona', 'Nevada'], ['Zurich', 'Bern']])
kd.fstr(f'{greetings:s}, {states:s} in {countries:s}!')
# -> kd.slice([
['Hello, California in USA!',
'Hello, Arizona in USA!',
'Hello, Nevada in USA!'],
['Gruezi, Zurich in Schweiz!',
'Gruezi, Bern in Schweiz!']]),
prices = kd.slice([35.5, 49.2])
currencies = kd.slice(['USD', 'CHF'])
kd.fstr(f'Lunch price in {countries:s} is {prices:.2f} {currencies:s}.')
# -> kd.slice(['Lunch price in USA is 35.50 USD.',
'Lunch price in Schweiz is 49.20 CHF.'])
Args:
s: f-string to evaluate.
Returns:
DataSlice with evaluated f-string.
kd.strings.join(*args)Concatenates the given strings.
Examples:
kd.strings.join(kd.slice(['Hello ', 'Goodbye ']), 'World')
# -> kd.slice(['Hello World', 'Goodbye World'])
kd.strings.join(kd.slice([b'foo']), kd.slice([b' ']), kd.slice([b'bar']))
# -> kd.slice([b'foo bar'])
Args:
*args: The inputs to concatenate in the given order.
Returns:
The string concatenation of all the inputs.
kd.strings.length(x)Returns a DataSlice of lengths in bytes for Byte or codepoints for String.
For example,
kd.strings.length(kd.slice(['abc', None, ''])) -> kd.slice([3, None, 0])
kd.strings.length(kd.slice([b'abc', None, b''])) -> kd.slice([3, None, 0])
kd.strings.length(kd.item('你好')) -> kd.item(2)
kd.strings.length(kd.item('你好'.encode())) -> kd.item(6)
Note that the result DataSlice always has INT32 schema.
Args:
x: String or Bytes DataSlice.
Returns:
A DataSlice of lengths.
kd.strings.lower(x)Returns a DataSlice with the lowercase version of each string in the input.
For example,
kd.strings.lower(kd.slice(['AbC', None, ''])) -> kd.slice(['abc', None, ''])
kd.strings.lower(kd.item('FOO')) -> kd.item('foo')
Note that the result DataSlice always has STRING schema.
Args:
x: String DataSlice.
Returns:
A String DataSlice of lowercase strings.
kd.strings.lstrip(s, chars=None)Strips whitespaces or the specified characters from the left side of `s`.
If `chars` is missing, then whitespaces are removed.
If `chars` is present, then it will strip all leading characters from `s`
that are present in the `chars` set.
Examples:
kd.strings.lstrip(kd.slice([' spacious ', '\t text \n']))
# -> kd.slice(['spacious ', 'text \n'])
kd.strings.lstrip(kd.slice(['www.example.com']), kd.slice(['cmowz.']))
# -> kd.slice(['example.com'])
kd.strings.lstrip(kd.slice([['#... Section 3.1 Issue #32 ...'], ['# ...']]),
kd.slice('.#! '))
# -> kd.slice([['Section 3.1 Issue #32 ...'], ['']])
Args:
s: (STRING or BYTES) Original string.
chars: (Optional STRING or BYTES, the same as `s`): The set of chars to
remove.
Returns:
Stripped string.
kd.strings.printf(fmt, *args)Formats strings according to printf-style (C++) format strings.
See absl::StrFormat documentation for the format string details.
Example:
kd.strings.printf(kd.slice(['Hello %s!', 'Goodbye %s!']), 'World')
# -> kd.slice(['Hello World!', 'Goodbye World!'])
kd.strings.printf('%v + %v = %v', 1, 2, 3) # -> kd.slice('1 + 2 = 3')
Args:
fmt: Format string (String or Bytes).
*args: Arguments to format (primitive types compatible with `fmt`).
Returns:
The formatted string.
kd.strings.regex_extract(text, regex)Extracts a substring from `text` with the capturing group of `regex`.
Regular expression matches are partial, which means `regex` is matched against
a substring of `text`.
For full matches, where the whole string must match a pattern, please enclose
the pattern in `^` and `$` characters.
The pattern must contain exactly one capturing group.
Examples:
kd.strings.regex_extract(kd.item('foo'), kd.item('f(.)'))
# kd.item('o')
kd.strings.regex_extract(kd.item('foobar'), kd.item('o(..)'))
# kd.item('ob')
kd.strings.regex_extract(kd.item('foobar'), kd.item('^o(..)$'))
# kd.item(None).with_schema(kd.STRING)
kd.strings.regex_extract(kd.item('foobar'), kd.item('^.o(..)a.$'))
# kd.item('ob')
kd.strings.regex_extract(kd.item('foobar'), kd.item('.*(b.*r)$'))
# kd.item('bar')
kd.strings.regex_extract(kd.slice(['abcd', None, '']), kd.slice('b(.*)'))
# -> kd.slice(['cd', None, None])
Args:
text: (STRING) A string.
regex: (STRING) A scalar string that represents a regular expression (RE2
syntax) with exactly one capturing group.
Returns:
For the first partial match of `regex` and `text`, returns the substring of
`text` that matches the capturing group of `regex`.
kd.strings.regex_find_all(text, regex)Returns the captured groups of all matches of `regex` in `text`.
The strings in `text` are scanned left-to-right to find all non-overlapping
matches of `regex`. The order of the matches is preserved in the result. For
each match, the substring matched by each capturing group of `regex` is
recorded. For each item of `text`, the result contains a 2-dimensional value,
where the first dimension captures the number of matches, and the second
dimension captures the captured groups.
Examples:
# No capturing groups, but two matches:
kd.strings.regex_find_all(kd.item('foo'), kd.item('o'))
# -> kd.slice([[], []])
# One capturing group, three matches:
kd.strings.regex_find_all(kd.item('foo'), kd.item('(.)'))
# -> kd.slice([['f'], ['o'], ['o']])
# Two capturing groups:
kd.strings.regex_find_all(
kd.slice(['fooz', 'bar', '', None]),
kd.item('(.)(.)')
)
# -> kd.slice([[['f', 'o'], ['o', 'z']], [['b', 'a']], [], []])
# Get information about the entire substring of each non-overlapping match
# by enclosing the pattern in additional parentheses:
kd.strings.regex_find_all(
kd.slice([['fool', 'solo'], ['bar', 'boat']]),
kd.item('((.*)o)')
)
# -> kd.slice([[[['foo', 'fo']], [['solo', 'sol']]], [[], [['bo', 'b']]]])
Args:
text: (STRING) A string.
regex: (STRING) A scalar string that represents a regular expression (RE2
syntax).
Returns:
A DataSlice where each item of `text` is associated with a 2-dimensional
representation of its matches' captured groups.
kd.strings.regex_match(text, regex)Returns `present` if `text` matches the regular expression `regex`.
Matches are partial, which means a substring of `text` matches the pattern.
For full matches, where the whole string must match a pattern, please enclose
the pattern in `^` and `$` characters.
Examples:
kd.strings.regex_match(kd.item('foo'), kd.item('oo'))
# -> kd.present
kd.strings.regex_match(kd.item('foo'), '^oo$')
# -> kd.missing
kd.strings.regex_match(kd.item('foo), '^foo$')
# -> kd.present
kd.strings.regex_match(kd.slice(['abc', None, '']), 'b')
# -> kd.slice([kd.present, kd.missing, kd.missing])
kd.strings.regex_match(kd.slice(['abcd', None, '']), kd.slice('b.d'))
# -> kd.slice([kd.present, kd.missing, kd.missing])
Args:
text: (STRING) A string.
regex: (STRING) A scalar string that represents a regular expression (RE2
syntax).
Returns:
`present` if `text` matches `regex`.
kd.strings.regex_replace_all(text, regex, replacement)Replaces all non-overlapping matches of `regex` in `text`.
Examples:
# Basic with match:
kd.strings.regex_replace_all(
kd.item('banana'),
kd.item('ana'),
kd.item('ono')
) # -> kd.item('bonona')
# Basic with no match:
kd.strings.regex_replace_all(
kd.item('banana'),
kd.item('x'),
kd.item('a')
) # -> kd.item('banana')
# Reference the first capturing group in the replacement:
kd.strings.regex_replace_all(
kd.item('banana'),
kd.item('a(.)a'),
kd.item(r'o\1\1o')
) # -> kd.item('bonnona')
# Reference the whole match in the replacement with \0:
kd.strings.regex_replace_all(
kd.item('abcd'),
kd.item('(.)(.)'),
kd.item(r'\2\1\0')
) # -> kd.item('baabdccd')
# With broadcasting:
kd.strings.regex_replace_all(
kd.item('foopo'),
kd.item('o'),
kd.slice(['a', 'e']),
) # -> kd.slice(['faapa', 'feepe'])
# With missing values:
kd.strings.regex_replace_all(
kd.slice(['foobor', 'foo', None, 'bar']),
kd.item('o(.)'),
kd.slice([r'\0x\1', 'ly', 'a', 'o']),
) # -> kd.slice(['fooxoborxr', 'fly', None, 'bar'])
Args:
text: (STRING) A string.
regex: (STRING) A scalar string that represents a regular expression (RE2
syntax).
replacement: (STRING) A string that should replace each match.
Backslash-escaped digits (\1 to \9) can be used to reference the text that
matched the corresponding capturing group from the pattern, while \0
refers to the entire match. Replacements are not subject to re-matching.
Since it only replaces non-overlapping matches, replacing "ana" within
"banana" makes only one replacement, not two.
Returns:
The text string where the replacements have been made.
kd.strings.replace(s, old, new, max_subs=None)Replaces up to `max_subs` occurrences of `old` within `s` with `new`.
If `max_subs` is missing or negative, then there is no limit on the number of
substitutions. If it is zero, then `s` is returned unchanged.
If the search string is empty, the original string is fenced with the
replacement string, for example: replace("ab", "", "-") returns "-a-b-". That
behavior is similar to Python's string replace.
Args:
s: (STRING or BYTES) Original string.
old: (STRING or BYTES, the same as `s`) String to replace.
new: (STRING or BYTES, the same as `s`) Replacement string.
max_subs: (optional INT32) Max number of substitutions. If unspecified or
negative, then there is no limit on the number of substitutions.
Returns:
String with applied substitutions.
kd.strings.rfind(s, substr, start=0, end=None)Returns the offset of the last occurrence of `substr` in `s`.
The units of `start`, `end`, and the return value are all byte offsets if `s`
is `BYTES` and codepoint offsets if `s` is `STRING`.
Args:
s: (STRING or BYTES) Strings to search in.
substr: (STRING or BYTES) Strings to search for in `s`. Should have the same
dtype as `s`.
start: (optional int) Offset to start the search, defaults to 0.
end: (optional int) Offset to stop the search, defaults to end of the string.
Returns:
The offset of the last occurrence of `substr` in `s`, or missing if there
are no occurrences.
kd.strings.rstrip(s, chars=None)Strips whitespaces or the specified characters from the right side of `s`.
If `chars` is missing, then whitespaces are removed.
If `chars` is present, then it will strip all tailing characters from `s` that
are present in the `chars` set.
Examples:
kd.strings.rstrip(kd.slice([' spacious ', '\t text \n']))
# -> kd.slice([' spacious', '\t text'])
kd.strings.rstrip(kd.slice(['www.example.com']), kd.slice(['cmowz.']))
# -> kd.slice(['www.example'])
kd.strings.rstrip(kd.slice([['#... Section 3.1 Issue #32 ...'], ['# ...']]),
kd.slice('.#! '))
# -> kd.slice([['#... Section 3.1 Issue #32'], ['']])
Args:
s: (STRING or BYTES) Original string.
chars (Optional STRING or BYTES, the same as `s`): The set of chars to
remove.
Returns:
Stripped string.
kd.strings.split(x, sep=None)Returns x split by the provided separator.
Example:
ds = kd.slice(['Hello world!', 'Goodbye world!'])
kd.split(ds) # -> kd.slice([['Hello', 'world!'], ['Goodbye', 'world!']])
Args:
x: DataSlice: (can be text or bytes)
sep: If specified, will split by the specified string not omitting empty
strings, otherwise will split by whitespaces while omitting empty strings.
kd.strings.strip(s, chars=None)Strips whitespaces or the specified characters from both sides of `s`.
If `chars` is missing, then whitespaces are removed.
If `chars` is present, then it will strip all leading and tailing characters
from `s` that are present in the `chars` set.
Examples:
kd.strings.strip(kd.slice([' spacious ', '\t text \n']))
# -> kd.slice(['spacious', 'text'])
kd.strings.strip(kd.slice(['www.example.com']), kd.slice(['cmowz.']))
# -> kd.slice(['example'])
kd.strings.strip(kd.slice([['#... Section 3.1 Issue #32 ...'], ['# ...']]),
kd.slice('.#! '))
# -> kd.slice([['Section 3.1 Issue #32'], ['']])
Args:
s: (STRING or BYTES) Original string.
chars (Optional STRING or BYTES, the same as `s`): The set of chars to
remove.
Returns:
Stripped string.
kd.strings.substr(x, start=0, end=None)Returns a DataSlice of substrings with indices [start, end).
The usual Python rules apply:
* A negative index is computed from the end of the string.
* An empty range yields an empty string, for example when start >= end and
both are positive.
The result is broadcasted to the common shape of all inputs.
Examples:
ds = kd.slice([['Hello World!', 'Ciao bella'], ['Dolly!']])
kd.substr(ds) # -> kd.slice([['Hello World!', 'Ciao bella'],
['Dolly!']])
kd.substr(ds, 5) # -> kd.slice([[' World!', 'bella'], ['!']])
kd.substr(ds, -2) # -> kd.slice([['d!', 'la'], ['y!']])
kd.substr(ds, 1, 5) # -> kd.slice([['ello', 'iao '], ['olly']])
kd.substr(ds, 5, -1) # -> kd.slice([[' World', 'bell'], ['']])
kd.substr(ds, 4, 100) # -> kd.slice([['o World!', ' bella'], ['y!']])
kd.substr(ds, -1, -2) # -> kd.slice([['', ''], ['']])
kd.substr(ds, -2, -1) # -> kd.slice([['d', 'l'], ['y']])
# Start and end may also be multidimensional.
ds = kd.slice('Hello World!')
start = kd.slice([1, 2])
end = kd.slice([[2, 3], [4]])
kd.substr(ds, start, end) # -> kd.slice([['e', 'el'], ['ll']])
Args:
x: Text or Bytes DataSlice. If text, then `start` and `end` are codepoint
offsets. If bytes, then `start` and `end` are byte offsets.
start: The start index of the substring. Inclusive. Assumed to be 0 if
unspecified.
end: The end index of the substring. Exclusive. Assumed to be the length of
the string if unspecified.
kd.strings.upper(x)Returns a DataSlice with the uppercase version of each string in the input.
For example,
kd.strings.upper(kd.slice(['abc', None, ''])) -> kd.slice(['ABC', None, ''])
kd.strings.upper(kd.item('foo')) -> kd.item('FOO')
Note that the result DataSlice always has STRING schema.
Args:
x: String DataSlice.
Returns:
A String DataSlice of uppercase strings.