datafusion.functions

User functions for operating on Expr.

Attributes

today

Functions

abs(→ datafusion.expr.Expr)

Return the absolute value of a given number.

acos(→ datafusion.expr.Expr)

Returns the arc cosine or inverse cosine of a number.

acosh(→ datafusion.expr.Expr)

Returns inverse hyperbolic cosine.

alias(→ datafusion.expr.Expr)

Creates an alias expression with an optional metadata dictionary.

approx_distinct(→ datafusion.expr.Expr)

Returns the approximate number of distinct values.

approx_median(→ datafusion.expr.Expr)

Returns the approximate median value.

approx_percentile_cont(→ datafusion.expr.Expr)

Returns the value that is approximately at a given percentile of expr.

approx_percentile_cont_with_weight(→ datafusion.expr.Expr)

Returns the value of the weighted approximate percentile.

array(→ datafusion.expr.Expr)

Returns an array using the specified input expressions.

array_agg(→ datafusion.expr.Expr)

Aggregate values into an array.

array_any_value(→ datafusion.expr.Expr)

Returns the first non-null element in the array.

array_append(→ datafusion.expr.Expr)

Appends an element to the end of an array.

array_cat(→ datafusion.expr.Expr)

Concatenates the input arrays.

array_concat(→ datafusion.expr.Expr)

Concatenates the input arrays.

array_contains(→ datafusion.expr.Expr)

Returns true if the element appears in the array, otherwise false.

array_dims(→ datafusion.expr.Expr)

Returns an array of the array's dimensions.

array_distance(→ datafusion.expr.Expr)

Returns the Euclidean distance between two numeric arrays.

array_distinct(→ datafusion.expr.Expr)

Returns distinct values from the array after removing duplicates.

array_element(→ datafusion.expr.Expr)

Extracts the element with the index n from the array.

array_empty(→ datafusion.expr.Expr)

Returns a boolean indicating whether the array is empty.

array_except(→ datafusion.expr.Expr)

Returns the elements that appear in array1 but not in array2.

array_extract(→ datafusion.expr.Expr)

Extracts the element with the index n from the array.

array_has(→ datafusion.expr.Expr)

Returns true if the element appears in the first array, otherwise false.

array_has_all(→ datafusion.expr.Expr)

Determines if there is complete overlap second_array in first_array.

array_has_any(→ datafusion.expr.Expr)

Determine if there is an overlap between first_array and second_array.

array_indexof(→ datafusion.expr.Expr)

Return the position of the first occurrence of element in array.

array_intersect(→ datafusion.expr.Expr)

Returns the intersection of array1 and array2.

array_join(→ datafusion.expr.Expr)

Converts each element to its text representation.

array_length(→ datafusion.expr.Expr)

Returns the length of the array.

array_max(→ datafusion.expr.Expr)

Returns the maximum value in the array.

array_min(→ datafusion.expr.Expr)

Returns the minimum value in the array.

array_ndims(→ datafusion.expr.Expr)

Returns the number of dimensions of the array.

array_pop_back(→ datafusion.expr.Expr)

Returns the array without the last element.

array_pop_front(→ datafusion.expr.Expr)

Returns the array without the first element.

array_position(→ datafusion.expr.Expr)

Return the position of the first occurrence of element in array.

array_positions(→ datafusion.expr.Expr)

Searches for an element in the array and returns all occurrences.

array_prepend(→ datafusion.expr.Expr)

Prepends an element to the beginning of an array.

array_push_back(→ datafusion.expr.Expr)

Appends an element to the end of an array.

array_push_front(→ datafusion.expr.Expr)

Prepends an element to the beginning of an array.

array_remove(→ datafusion.expr.Expr)

Removes the first element from the array equal to the given value.

array_remove_all(→ datafusion.expr.Expr)

Removes all elements from the array equal to the given value.

array_remove_n(→ datafusion.expr.Expr)

Removes the first max elements from the array equal to the given value.

array_repeat(→ datafusion.expr.Expr)

Returns an array containing element count times.

array_replace(→ datafusion.expr.Expr)

Replaces the first occurrence of from_val with to_val.

array_replace_all(→ datafusion.expr.Expr)

Replaces all occurrences of from_val with to_val.

array_replace_n(→ datafusion.expr.Expr)

Replace n occurrences of from_val with to_val.

array_resize(→ datafusion.expr.Expr)

Returns an array with the specified size filled.

array_reverse(→ datafusion.expr.Expr)

Reverses the order of elements in the array.

array_slice(→ datafusion.expr.Expr)

Returns a slice of the array.

array_sort(→ datafusion.expr.Expr)

Sort an array.

array_to_string(→ datafusion.expr.Expr)

Converts each element to its text representation.

array_union(→ datafusion.expr.Expr)

Returns an array of the elements in the union of array1 and array2.

arrays_overlap(→ datafusion.expr.Expr)

Returns true if any element appears in both arrays.

arrays_zip(→ datafusion.expr.Expr)

Combines multiple arrays into a single array of structs.

arrow_cast(→ datafusion.expr.Expr)

Casts an expression to a specified data type.

arrow_metadata(→ datafusion.expr.Expr)

Returns the metadata of the input expression.

arrow_typeof(→ datafusion.expr.Expr)

Returns the Arrow type of the expression.

ascii(→ datafusion.expr.Expr)

Returns the numeric code of the first character of the argument.

asin(→ datafusion.expr.Expr)

Returns the arc sine or inverse sine of a number.

asinh(→ datafusion.expr.Expr)

Returns inverse hyperbolic sine.

atan(→ datafusion.expr.Expr)

Returns inverse tangent of a number.

atan2(→ datafusion.expr.Expr)

Returns inverse tangent of a division given in the argument.

atanh(→ datafusion.expr.Expr)

Returns inverse hyperbolic tangent.

avg(→ datafusion.expr.Expr)

Returns the average value.

bit_and(→ datafusion.expr.Expr)

Computes the bitwise AND of the argument.

bit_length(→ datafusion.expr.Expr)

Returns the number of bits in the string argument.

bit_or(→ datafusion.expr.Expr)

Computes the bitwise OR of the argument.

bit_xor(→ datafusion.expr.Expr)

Computes the bitwise XOR of the argument.

bool_and(→ datafusion.expr.Expr)

Computes the boolean AND of the argument.

bool_or(→ datafusion.expr.Expr)

Computes the boolean OR of the argument.

btrim(→ datafusion.expr.Expr)

Removes all characters, spaces by default, from both sides of a string.

cardinality(→ datafusion.expr.Expr)

Returns the total number of elements in the array.

case(→ datafusion.expr.CaseBuilder)

Create a case expression.

cbrt(→ datafusion.expr.Expr)

Returns the cube root of a number.

ceil(→ datafusion.expr.Expr)

Returns the nearest integer greater than or equal to argument.

char_length(→ datafusion.expr.Expr)

The number of characters in the string.

character_length(→ datafusion.expr.Expr)

Returns the number of characters in the argument.

chr(→ datafusion.expr.Expr)

Converts the Unicode code point to a UTF8 character.

coalesce(→ datafusion.expr.Expr)

Returns the value of the first expr in args which is not NULL.

col(→ datafusion.expr.Expr)

Creates a column reference expression.

concat(→ datafusion.expr.Expr)

Concatenates the text representations of all the arguments.

concat_ws(→ datafusion.expr.Expr)

Concatenates the list args with the separator.

contains(→ datafusion.expr.Expr)

Returns true if search_str is found within string (case-sensitive).

corr(→ datafusion.expr.Expr)

Returns the correlation coefficient between value1 and value2.

cos(→ datafusion.expr.Expr)

Returns the cosine of the argument.

cosh(→ datafusion.expr.Expr)

Returns the hyperbolic cosine of the argument.

cot(→ datafusion.expr.Expr)

Returns the cotangent of the argument.

count(→ datafusion.expr.Expr)

Returns the number of rows that match the given arguments.

count_star(→ datafusion.expr.Expr)

Create a COUNT(1) aggregate expression.

covar(→ datafusion.expr.Expr)

Computes the sample covariance.

covar_pop(→ datafusion.expr.Expr)

Computes the population covariance.

covar_samp(→ datafusion.expr.Expr)

Computes the sample covariance.

cume_dist(→ datafusion.expr.Expr)

Create a cumulative distribution window function.

current_date(→ datafusion.expr.Expr)

Returns current UTC date as a Date32 value.

current_time(→ datafusion.expr.Expr)

Returns current UTC time as a Time64 value.

current_timestamp(→ datafusion.expr.Expr)

Returns the current timestamp in nanoseconds.

date_bin(→ datafusion.expr.Expr)

Coerces an arbitrary timestamp to the start of the nearest specified interval.

date_format(→ datafusion.expr.Expr)

Returns a string representation of a date, time, timestamp or duration.

date_part(→ datafusion.expr.Expr)

Extracts a subfield from the date.

date_trunc(→ datafusion.expr.Expr)

Truncates the date to a specified level of precision.

datepart(→ datafusion.expr.Expr)

Return a specified part of a date.

datetrunc(→ datafusion.expr.Expr)

Truncates the date to a specified level of precision.

decode(→ datafusion.expr.Expr)

Decode the input, using the encoding. encoding can be base64 or hex.

degrees(→ datafusion.expr.Expr)

Converts the argument from radians to degrees.

dense_rank(→ datafusion.expr.Expr)

Create a dense_rank window function.

digest(→ datafusion.expr.Expr)

Computes the binary hash of an expression using the specified algorithm.

element_at(→ datafusion.expr.Expr)

Returns the value for a given key in the map.

empty(→ datafusion.expr.Expr)

Returns true if the array is empty.

encode(→ datafusion.expr.Expr)

Encode the input, using the encoding. encoding can be base64 or hex.

ends_with(→ datafusion.expr.Expr)

Returns true if the string ends with the suffix, false otherwise.

exp(→ datafusion.expr.Expr)

Returns the exponential of the argument.

extract(→ datafusion.expr.Expr)

Extracts a subfield from the date.

factorial(→ datafusion.expr.Expr)

Returns the factorial of the argument.

find_in_set(→ datafusion.expr.Expr)

Find a string in a list of strings.

first_value(→ datafusion.expr.Expr)

Returns the first value in a group of values.

flatten(→ datafusion.expr.Expr)

Flattens an array of arrays into a single array.

floor(→ datafusion.expr.Expr)

Returns the nearest integer less than or equal to the argument.

from_unixtime(→ datafusion.expr.Expr)

Converts an integer to RFC3339 timestamp format string.

gcd(→ datafusion.expr.Expr)

Returns the greatest common divisor.

gen_series(→ datafusion.expr.Expr)

Creates a list of values in the range between start and stop.

generate_series(→ datafusion.expr.Expr)

Creates a list of values in the range between start and stop.

get_field(→ datafusion.expr.Expr)

Extracts a field from a struct or map by name.

greatest(→ datafusion.expr.Expr)

Returns the greatest value from a list of expressions.

grouping(→ datafusion.expr.Expr)

Indicates whether a column is aggregated across in the current row.

ifnull(→ datafusion.expr.Expr)

Returns x if x is not NULL. Otherwise returns y.

in_list(→ datafusion.expr.Expr)

Returns whether the argument is contained within the list values.

initcap(→ datafusion.expr.Expr)

Set the initial letter of each word to capital.

isnan(→ datafusion.expr.Expr)

Returns true if a given number is +NaN or -NaN otherwise returns false.

iszero(→ datafusion.expr.Expr)

Returns true if a given number is +0.0 or -0.0 otherwise returns false.

lag(→ datafusion.expr.Expr)

Create a lag window function.

last_value(→ datafusion.expr.Expr)

Returns the last value in a group of values.

lcm(→ datafusion.expr.Expr)

Returns the least common multiple.

lead(→ datafusion.expr.Expr)

Create a lead window function.

least(→ datafusion.expr.Expr)

Returns the least value from a list of expressions.

left(→ datafusion.expr.Expr)

Returns the first n characters in the string.

length(→ datafusion.expr.Expr)

The number of characters in the string.

levenshtein(→ datafusion.expr.Expr)

Returns the Levenshtein distance between the two given strings.

list_any_value(→ datafusion.expr.Expr)

Returns the first non-null element in the array.

list_append(→ datafusion.expr.Expr)

Appends an element to the end of an array.

list_cat(→ datafusion.expr.Expr)

Concatenates the input arrays.

list_concat(→ datafusion.expr.Expr)

Concatenates the input arrays.

list_contains(→ datafusion.expr.Expr)

Returns true if the element appears in the array, otherwise false.

list_dims(→ datafusion.expr.Expr)

Returns an array of the array's dimensions.

list_distance(→ datafusion.expr.Expr)

Returns the Euclidean distance between two numeric arrays.

list_distinct(→ datafusion.expr.Expr)

Returns distinct values from the array after removing duplicates.

list_element(→ datafusion.expr.Expr)

Extracts the element with the index n from the array.

list_empty(→ datafusion.expr.Expr)

Returns a boolean indicating whether the array is empty.

list_except(→ datafusion.expr.Expr)

Returns the elements that appear in array1 but not in the array2.

list_extract(→ datafusion.expr.Expr)

Extracts the element with the index n from the array.

list_has(→ datafusion.expr.Expr)

Returns true if the element appears in the array, otherwise false.

list_has_all(→ datafusion.expr.Expr)

Determines if there is complete overlap second_array in first_array.

list_has_any(→ datafusion.expr.Expr)

Determine if there is an overlap between first_array and second_array.

list_indexof(→ datafusion.expr.Expr)

Return the position of the first occurrence of element in array.

list_intersect(→ datafusion.expr.Expr)

Returns an the intersection of array1 and array2.

list_join(→ datafusion.expr.Expr)

Converts each element to its text representation.

list_length(→ datafusion.expr.Expr)

Returns the length of the array.

list_max(→ datafusion.expr.Expr)

Returns the maximum value in the array.

list_min(→ datafusion.expr.Expr)

Returns the minimum value in the array.

list_ndims(→ datafusion.expr.Expr)

Returns the number of dimensions of the array.

list_overlap(→ datafusion.expr.Expr)

Returns true if any element appears in both arrays.

list_pop_back(→ datafusion.expr.Expr)

Returns the array without the last element.

list_pop_front(→ datafusion.expr.Expr)

Returns the array without the first element.

list_position(→ datafusion.expr.Expr)

Return the position of the first occurrence of element in array.

list_positions(→ datafusion.expr.Expr)

Searches for an element in the array and returns all occurrences.

list_prepend(→ datafusion.expr.Expr)

Prepends an element to the beginning of an array.

list_push_back(→ datafusion.expr.Expr)

Appends an element to the end of an array.

list_push_front(→ datafusion.expr.Expr)

Prepends an element to the beginning of an array.

list_remove(→ datafusion.expr.Expr)

Removes the first element from the array equal to the given value.

list_remove_all(→ datafusion.expr.Expr)

Removes all elements from the array equal to the given value.

list_remove_n(→ datafusion.expr.Expr)

Removes the first max elements from the array equal to the given value.

list_repeat(→ datafusion.expr.Expr)

Returns an array containing element count times.

list_replace(→ datafusion.expr.Expr)

Replaces the first occurrence of from_val with to_val.

list_replace_all(→ datafusion.expr.Expr)

Replaces all occurrences of from_val with to_val.

list_replace_n(→ datafusion.expr.Expr)

Replace n occurrences of from_val with to_val.

list_resize(→ datafusion.expr.Expr)

Returns an array with the specified size filled.

list_reverse(→ datafusion.expr.Expr)

Reverses the order of elements in the array.

list_slice(→ datafusion.expr.Expr)

Returns a slice of the array.

list_sort(→ datafusion.expr.Expr)

Sorts the array.

list_to_string(→ datafusion.expr.Expr)

Converts each element to its text representation.

list_union(→ datafusion.expr.Expr)

Returns an array of the elements in the union of array1 and array2.

list_zip(→ datafusion.expr.Expr)

Combines multiple arrays into a single array of structs.

ln(→ datafusion.expr.Expr)

Returns the natural logarithm (base e) of the argument.

log(→ datafusion.expr.Expr)

Returns the logarithm of a number for a particular base.

log10(→ datafusion.expr.Expr)

Base 10 logarithm of the argument.

log2(→ datafusion.expr.Expr)

Base 2 logarithm of the argument.

lower(→ datafusion.expr.Expr)

Converts a string to lowercase.

lpad(→ datafusion.expr.Expr)

Add left padding to a string.

ltrim(→ datafusion.expr.Expr)

Removes all characters, spaces by default, from the beginning of a string.

make_array(→ datafusion.expr.Expr)

Returns an array using the specified input expressions.

make_date(→ datafusion.expr.Expr)

Make a date from year, month and day component parts.

make_list(→ datafusion.expr.Expr)

Returns an array using the specified input expressions.

make_map(→ datafusion.expr.Expr)

Returns a map expression.

make_time(→ datafusion.expr.Expr)

Make a time from hour, minute and second component parts.

map_entries(→ datafusion.expr.Expr)

Returns a list of all entries (key-value struct pairs) in the map.

map_extract(→ datafusion.expr.Expr)

Returns the value for a given key in the map.

map_keys(→ datafusion.expr.Expr)

Returns a list of all keys in the map.

map_values(→ datafusion.expr.Expr)

Returns a list of all values in the map.

max(→ datafusion.expr.Expr)

Aggregate function that returns the maximum value of the argument.

md5(→ datafusion.expr.Expr)

Computes an MD5 128-bit checksum for a string expression.

mean(→ datafusion.expr.Expr)

Returns the average (mean) value of the argument.

median(→ datafusion.expr.Expr)

Computes the median of a set of numbers.

min(→ datafusion.expr.Expr)

Aggregate function that returns the minimum value of the argument.

named_struct(→ datafusion.expr.Expr)

Returns a struct with the given names and arguments pairs.

nanvl(→ datafusion.expr.Expr)

Returns x if x is not NaN. Otherwise returns y.

now(→ datafusion.expr.Expr)

Returns the current timestamp in nanoseconds.

nth_value(→ datafusion.expr.Expr)

Returns the n-th value in a group of values.

ntile(→ datafusion.expr.Expr)

Create a n-tile window function.

nullif(→ datafusion.expr.Expr)

Returns NULL if expr1 equals expr2; otherwise it returns expr1.

nvl(→ datafusion.expr.Expr)

Returns x if x is not NULL. Otherwise returns y.

nvl2(→ datafusion.expr.Expr)

Returns y if x is not NULL. Otherwise returns z.

octet_length(→ datafusion.expr.Expr)

Returns the number of bytes of a string.

order_by(→ datafusion.expr.SortExpr)

Creates a new sort expression.

overlay(→ datafusion.expr.Expr)

Replace a substring with a new substring.

percent_rank(→ datafusion.expr.Expr)

Create a percent_rank window function.

percentile_cont(→ datafusion.expr.Expr)

Computes the exact percentile of input values using continuous interpolation.

pi(→ datafusion.expr.Expr)

Returns an approximate value of π.

pow(→ datafusion.expr.Expr)

Returns base raised to the power of exponent.

power(→ datafusion.expr.Expr)

Returns base raised to the power of exponent.

quantile_cont(→ datafusion.expr.Expr)

Computes the exact percentile of input values using continuous interpolation.

radians(→ datafusion.expr.Expr)

Converts the argument from degrees to radians.

random(→ datafusion.expr.Expr)

Returns a random value in the range 0.0 <= x < 1.0.

range(→ datafusion.expr.Expr)

Create a list of values in the range between start and stop.

rank(→ datafusion.expr.Expr)

Create a rank window function.

regexp_count(→ datafusion.expr.Expr)

Returns the number of matches in a string.

regexp_instr(→ datafusion.expr.Expr)

Returns the position of a regular expression match in a string.

regexp_like(→ datafusion.expr.Expr)

Find if any regular expression (regex) matches exist.

regexp_match(→ datafusion.expr.Expr)

Perform regular expression (regex) matching.

regexp_replace(→ datafusion.expr.Expr)

Replaces substring(s) matching a PCRE-like regular expression.

regr_avgx(→ datafusion.expr.Expr)

Computes the average of the independent variable x.

regr_avgy(→ datafusion.expr.Expr)

Computes the average of the dependent variable y.

regr_count(→ datafusion.expr.Expr)

Counts the number of rows in which both expressions are not null.

regr_intercept(→ datafusion.expr.Expr)

Computes the intercept from the linear regression.

regr_r2(→ datafusion.expr.Expr)

Computes the R-squared value from linear regression.

regr_slope(→ datafusion.expr.Expr)

Computes the slope from linear regression.

regr_sxx(→ datafusion.expr.Expr)

Computes the sum of squares of the independent variable x.

regr_sxy(→ datafusion.expr.Expr)

Computes the sum of products of pairs of numbers.

regr_syy(→ datafusion.expr.Expr)

Computes the sum of squares of the dependent variable y.

repeat(→ datafusion.expr.Expr)

Repeats the string to n times.

replace(→ datafusion.expr.Expr)

Replaces all occurrences of from_val with to_val in the string.

reverse(→ datafusion.expr.Expr)

Reverse the string argument.

right(→ datafusion.expr.Expr)

Returns the last n characters in the string.

round(→ datafusion.expr.Expr)

Round the argument to the nearest integer.

row(→ datafusion.expr.Expr)

Returns a struct with the given arguments.

row_number(→ datafusion.expr.Expr)

Create a row number window function.

rpad(→ datafusion.expr.Expr)

Add right padding to a string.

rtrim(→ datafusion.expr.Expr)

Removes all characters, spaces by default, from the end of a string.

sha224(→ datafusion.expr.Expr)

Computes the SHA-224 hash of a binary string.

sha256(→ datafusion.expr.Expr)

Computes the SHA-256 hash of a binary string.

sha384(→ datafusion.expr.Expr)

Computes the SHA-384 hash of a binary string.

sha512(→ datafusion.expr.Expr)

Computes the SHA-512 hash of a binary string.

signum(→ datafusion.expr.Expr)

Returns the sign of the argument (-1, 0, +1).

sin(→ datafusion.expr.Expr)

Returns the sine of the argument.

sinh(→ datafusion.expr.Expr)

Returns the hyperbolic sine of the argument.

split_part(→ datafusion.expr.Expr)

Split a string and return one part.

sqrt(→ datafusion.expr.Expr)

Returns the square root of the argument.

starts_with(→ datafusion.expr.Expr)

Returns true if string starts with prefix.

stddev(→ datafusion.expr.Expr)

Computes the standard deviation of the argument.

stddev_pop(→ datafusion.expr.Expr)

Computes the population standard deviation of the argument.

stddev_samp(→ datafusion.expr.Expr)

Computes the sample standard deviation of the argument.

string_agg(→ datafusion.expr.Expr)

Concatenates the input strings.

string_to_array(→ datafusion.expr.Expr)

Splits a string based on a delimiter and returns an array of parts.

string_to_list(→ datafusion.expr.Expr)

Splits a string based on a delimiter and returns an array of parts.

strpos(→ datafusion.expr.Expr)

Finds the position from where the substring matches the string.

struct(→ datafusion.expr.Expr)

Returns a struct with the given arguments.

substr(→ datafusion.expr.Expr)

Substring from the position to the end.

substr_index(→ datafusion.expr.Expr)

Returns an indexed substring.

substring(→ datafusion.expr.Expr)

Substring from the position with length characters.

sum(→ datafusion.expr.Expr)

Computes the sum of a set of numbers.

tan(→ datafusion.expr.Expr)

Returns the tangent of the argument.

tanh(→ datafusion.expr.Expr)

Returns the hyperbolic tangent of the argument.

to_char(→ datafusion.expr.Expr)

Returns a string representation of a date, time, timestamp or duration.

to_date(→ datafusion.expr.Expr)

Converts a value to a date (YYYY-MM-DD).

to_hex(→ datafusion.expr.Expr)

Converts an integer to a hexadecimal string.

to_local_time(→ datafusion.expr.Expr)

Converts a timestamp with a timezone to a timestamp without a timezone.

to_time(→ datafusion.expr.Expr)

Converts a value to a time. Supports strings and timestamps as input.

to_timestamp(→ datafusion.expr.Expr)

Converts a string and optional formats to a Timestamp in nanoseconds.

to_timestamp_micros(→ datafusion.expr.Expr)

Converts a string and optional formats to a Timestamp in microseconds.

to_timestamp_millis(→ datafusion.expr.Expr)

Converts a string and optional formats to a Timestamp in milliseconds.

to_timestamp_nanos(→ datafusion.expr.Expr)

Converts a string and optional formats to a Timestamp in nanoseconds.

to_timestamp_seconds(→ datafusion.expr.Expr)

Converts a string and optional formats to a Timestamp in seconds.

to_unixtime(→ datafusion.expr.Expr)

Converts a string and optional formats to a Unixtime.

translate(→ datafusion.expr.Expr)

Replaces the characters in from_val with the counterpart in to_val.

trim(→ datafusion.expr.Expr)

Removes all characters, spaces by default, from both sides of a string.

trunc(→ datafusion.expr.Expr)

Truncate the number toward zero with optional precision.

union_extract(→ datafusion.expr.Expr)

Extracts a value from a union type by field name.

union_tag(→ datafusion.expr.Expr)

Returns the tag (active field name) of a union type.

upper(→ datafusion.expr.Expr)

Converts a string to uppercase.

uuid(→ datafusion.expr.Expr)

Returns uuid v4 as a string value.

var(→ datafusion.expr.Expr)

Computes the sample variance of the argument.

var_pop(→ datafusion.expr.Expr)

Computes the population variance of the argument.

var_population(→ datafusion.expr.Expr)

Computes the population variance of the argument.

var_samp(→ datafusion.expr.Expr)

Computes the sample variance of the argument.

var_sample(→ datafusion.expr.Expr)

Computes the sample variance of the argument.

version(→ datafusion.expr.Expr)

Returns the DataFusion version string.

when(→ datafusion.expr.CaseBuilder)

Create a case expression that has no base expression.

Module Contents

datafusion.functions.abs(arg: datafusion.expr.Expr) datafusion.expr.Expr

Return the absolute value of a given number.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [-1, 0, 1]})
>>> result = df.select(dfn.functions.abs(dfn.col("a")).alias("abs"))
>>> result.collect_column("abs")[0].as_py()
1
datafusion.functions.acos(arg: datafusion.expr.Expr) datafusion.expr.Expr

Returns the arc cosine or inverse cosine of a number.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [1.0]})
>>> result = df.select(dfn.functions.acos(dfn.col("a")).alias("acos"))
>>> result.collect_column("acos")[0].as_py()
0.0
datafusion.functions.acosh(arg: datafusion.expr.Expr) datafusion.expr.Expr

Returns inverse hyperbolic cosine.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [1.0]})
>>> result = df.select(dfn.functions.acosh(dfn.col("a")).alias("acosh"))
>>> result.collect_column("acosh")[0].as_py()
0.0
datafusion.functions.alias(expr: datafusion.expr.Expr, name: str, metadata: dict[str, str] | None = None) datafusion.expr.Expr

Creates an alias expression with an optional metadata dictionary.

Parameters:
  • expr – The expression to alias

  • name – The alias name

  • metadata – Optional metadata to attach to the column

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [1, 2]})
>>> result = df.select(
...     dfn.functions.alias(
...         dfn.col("a"), "b"
...     )
... )
>>> result.collect_column("b")[0].as_py()
1
>>> result = df.select(
...     dfn.functions.alias(
...         dfn.col("a"), "b", metadata={"info": "test"}
...     )
... )
>>> result.schema()
b: int64
  -- field metadata --
  info: 'test'
datafusion.functions.approx_distinct(expression: datafusion.expr.Expr, filter: datafusion.expr.Expr | None = None) datafusion.expr.Expr

Returns the approximate number of distinct values.

This aggregate function is similar to count() with distinct set, but it will approximate the number of distinct entries. It may return significantly faster than count() for some DataFrames.

If using the builder functions described in ref:_aggregation this function ignores the options order_by, null_treatment, and distinct.

Parameters:
  • expression – Values to check for distinct entries

  • filter – If provided, only compute against rows for which the filter is True

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [1, 1, 2, 3]})
>>> result = df.aggregate(
...     [], [dfn.functions.approx_distinct(
...         dfn.col("a")
...     ).alias("v")])
>>> result.collect_column("v")[0].as_py() == 3
True
>>> result = df.aggregate(
...     [], [dfn.functions.approx_distinct(
...         dfn.col("a"),
...         filter=dfn.col("a") > dfn.lit(1)
...     ).alias("v")])
>>> result.collect_column("v")[0].as_py() == 2
True
datafusion.functions.approx_median(expression: datafusion.expr.Expr, filter: datafusion.expr.Expr | None = None) datafusion.expr.Expr

Returns the approximate median value.

This aggregate function is similar to median(), but it will only approximate the median. It may return significantly faster for some DataFrames.

If using the builder functions described in ref:_aggregation this function ignores the options order_by and null_treatment, and distinct.

Parameters:
  • expression – Values to find the median for

  • filter – If provided, only compute against rows for which the filter is True

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [1.0, 2.0, 3.0]})
>>> result = df.aggregate(
...     [], [dfn.functions.approx_median(
...         dfn.col("a")
...     ).alias("v")])
>>> result.collect_column("v")[0].as_py()
2.0
>>> result = df.aggregate(
...     [], [dfn.functions.approx_median(
...         dfn.col("a"),
...         filter=dfn.col("a") > dfn.lit(1.0)
...     ).alias("v")])
>>> result.collect_column("v")[0].as_py()
2.5
datafusion.functions.approx_percentile_cont(sort_expression: datafusion.expr.Expr | datafusion.expr.SortExpr, percentile: float, num_centroids: int | None = None, filter: datafusion.expr.Expr | None = None) datafusion.expr.Expr

Returns the value that is approximately at a given percentile of expr.

This aggregate function assumes the input values form a continuous distribution. Suppose you have a DataFrame which consists of 100 different test scores. If you called this function with a percentile of 0.9, it would return the value of the test score that is above 90% of the other test scores. The returned value may be between two of the values.

This function uses the [t-digest](https://arxiv.org/abs/1902.04023) algorithm to compute the percentile. You can limit the number of bins used in this algorithm by setting the num_centroids parameter.

If using the builder functions described in ref:_aggregation this function ignores the options order_by, null_treatment, and distinct.

Parameters:
  • sort_expression – Values for which to find the approximate percentile

  • percentile – This must be between 0.0 and 1.0, inclusive

  • num_centroids – Max bin size for the t-digest algorithm

  • filter – If provided, only compute against rows for which the filter is True

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [1.0, 2.0, 3.0, 4.0, 5.0]})
>>> result = df.aggregate(
...     [], [dfn.functions.approx_percentile_cont(
...         dfn.col("a"), 0.5
...     ).alias("v")])
>>> result.collect_column("v")[0].as_py()
3.0
>>> result = df.aggregate(
...     [], [dfn.functions.approx_percentile_cont(
...         dfn.col("a"), 0.5,
...         num_centroids=10,
...         filter=dfn.col("a") > dfn.lit(1.0),
...     ).alias("v")])
>>> result.collect_column("v")[0].as_py()
3.5
datafusion.functions.approx_percentile_cont_with_weight(sort_expression: datafusion.expr.Expr | datafusion.expr.SortExpr, weight: datafusion.expr.Expr, percentile: float, num_centroids: int | None = None, filter: datafusion.expr.Expr | None = None) datafusion.expr.Expr

Returns the value of the weighted approximate percentile.

This aggregate function is similar to approx_percentile_cont() except that it uses the associated associated weights.

If using the builder functions described in ref:_aggregation this function ignores the options order_by, null_treatment, and distinct.

Parameters:
  • sort_expression – Values for which to find the approximate percentile

  • weight – Relative weight for each of the values in expression

  • percentile – This must be between 0.0 and 1.0, inclusive

  • num_centroids – Max bin size for the t-digest algorithm

  • filter – If provided, only compute against rows for which the filter is True

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [1.0, 2.0, 3.0], "w": [1.0, 1.0, 1.0]})
>>> result = df.aggregate(
...     [], [dfn.functions.approx_percentile_cont_with_weight(
...         dfn.col("a"), dfn.col("w"), 0.5
...     ).alias("v")])
>>> result.collect_column("v")[0].as_py()
2.0
>>> result = df.aggregate(
...     [], [dfn.functions.approx_percentile_cont_with_weight(
...         dfn.col("a"), dfn.col("w"), 0.5,
...         num_centroids=10,
...         filter=dfn.col("a") > dfn.lit(1.0),
...     ).alias("v")])
>>> result.collect_column("v")[0].as_py()
2.5
datafusion.functions.array(*args: datafusion.expr.Expr) datafusion.expr.Expr

Returns an array using the specified input expressions.

See also

This is an alias for make_array().

datafusion.functions.array_agg(expression: datafusion.expr.Expr, distinct: bool = False, filter: datafusion.expr.Expr | None = None, order_by: list[datafusion.expr.SortKey] | datafusion.expr.SortKey | None = None) datafusion.expr.Expr

Aggregate values into an array.

Currently distinct and order_by cannot be used together. As a work around, consider array_sort() after aggregation. [Issue Tracker](https://github.com/apache/datafusion/issues/12371)

If using the builder functions described in ref:_aggregation this function ignores the option null_treatment.

Parameters:
  • expression – Values to combine into an array

  • distinct – If True, a single entry for each distinct value will be in the result

  • filter – If provided, only compute against rows for which the filter is True

  • order_by – Order the resultant array values. Accepts column names or expressions.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [1, 2, 3]})
>>> result = df.aggregate(
...     [], [dfn.functions.array_agg(
...         dfn.col("a")
...     ).alias("v")])
>>> result.collect_column("v")[0].as_py()
[1, 2, 3]
>>> df = ctx.from_pydict({"a": [3, 1, 2, 1]})
>>> result = df.aggregate(
...     [], [dfn.functions.array_agg(
...         dfn.col("a"), distinct=True,
...     ).alias("v")])
>>> sorted(result.collect_column("v")[0].as_py())
[1, 2, 3]
>>> result = df.aggregate(
...     [], [dfn.functions.array_agg(
...         dfn.col("a"),
...         filter=dfn.col("a") > dfn.lit(1),
...         order_by="a",
...     ).alias("v")])
>>> result.collect_column("v")[0].as_py()
[2, 3]
datafusion.functions.array_any_value(array: datafusion.expr.Expr) datafusion.expr.Expr

Returns the first non-null element in the array.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [[None, 2, 3]]})
>>> result = df.select(
...     dfn.functions.array_any_value(dfn.col("a")).alias("result"))
>>> result.collect_column("result")[0].as_py()
2
datafusion.functions.array_append(array: datafusion.expr.Expr, element: datafusion.expr.Expr) datafusion.expr.Expr

Appends an element to the end of an array.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [[1, 2, 3]]})
>>> result = df.select(
...     dfn.functions.array_append(dfn.col("a"), dfn.lit(4)).alias("result"))
>>> result.collect_column("result")[0].as_py()
[1, 2, 3, 4]
datafusion.functions.array_cat(*args: datafusion.expr.Expr) datafusion.expr.Expr

Concatenates the input arrays.

See also

This is an alias for array_concat().

datafusion.functions.array_concat(*args: datafusion.expr.Expr) datafusion.expr.Expr

Concatenates the input arrays.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [[1, 2]], "b": [[3, 4]]})
>>> result = df.select(
...     dfn.functions.array_concat(dfn.col("a"), dfn.col("b")).alias("result"))
>>> result.collect_column("result")[0].as_py()
[1, 2, 3, 4]
datafusion.functions.array_contains(array: datafusion.expr.Expr, element: datafusion.expr.Expr) datafusion.expr.Expr

Returns true if the element appears in the array, otherwise false.

See also

This is an alias for array_has().

datafusion.functions.array_dims(array: datafusion.expr.Expr) datafusion.expr.Expr

Returns an array of the array’s dimensions.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [[1, 2, 3]]})
>>> result = df.select(dfn.functions.array_dims(dfn.col("a")).alias("result"))
>>> result.collect_column("result")[0].as_py()
[3]
datafusion.functions.array_distance(array1: datafusion.expr.Expr, array2: datafusion.expr.Expr) datafusion.expr.Expr

Returns the Euclidean distance between two numeric arrays.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [[1.0, 2.0]], "b": [[1.0, 4.0]]})
>>> result = df.select(
...     dfn.functions.array_distance(
...         dfn.col("a"), dfn.col("b"),
...     ).alias("result"))
>>> result.collect_column("result")[0].as_py()
2.0
datafusion.functions.array_distinct(array: datafusion.expr.Expr) datafusion.expr.Expr

Returns distinct values from the array after removing duplicates.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [[1, 1, 2, 3]]})
>>> result = df.select(
...     dfn.functions.array_distinct(
...         dfn.col("a")
...     ).alias("result")
... )
>>> sorted(
...     result.collect_column("result")[0].as_py()
... )
[1, 2, 3]
datafusion.functions.array_element(array: datafusion.expr.Expr, n: datafusion.expr.Expr) datafusion.expr.Expr

Extracts the element with the index n from the array.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [[10, 20, 30]]})
>>> result = df.select(
...     dfn.functions.array_element(dfn.col("a"), dfn.lit(2)).alias("result"))
>>> result.collect_column("result")[0].as_py()
20
datafusion.functions.array_empty(array: datafusion.expr.Expr) datafusion.expr.Expr

Returns a boolean indicating whether the array is empty.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [[1, 2]]})
>>> result = df.select(dfn.functions.array_empty(dfn.col("a")).alias("result"))
>>> result.collect_column("result")[0].as_py()
False
datafusion.functions.array_except(array1: datafusion.expr.Expr, array2: datafusion.expr.Expr) datafusion.expr.Expr

Returns the elements that appear in array1 but not in array2.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [[1, 2, 3]], "b": [[2, 3, 4]]})
>>> result = df.select(
...     dfn.functions.array_except(dfn.col("a"), dfn.col("b")).alias("result"))
>>> result.collect_column("result")[0].as_py()
[1]
datafusion.functions.array_extract(array: datafusion.expr.Expr, n: datafusion.expr.Expr) datafusion.expr.Expr

Extracts the element with the index n from the array.

See also

This is an alias for array_element().

datafusion.functions.array_has(first_array: datafusion.expr.Expr, second_array: datafusion.expr.Expr) datafusion.expr.Expr

Returns true if the element appears in the first array, otherwise false.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [[1, 2, 3]]})
>>> result = df.select(
...     dfn.functions.array_has(dfn.col("a"), dfn.lit(2)).alias("result"))
>>> result.collect_column("result")[0].as_py()
True
datafusion.functions.array_has_all(first_array: datafusion.expr.Expr, second_array: datafusion.expr.Expr) datafusion.expr.Expr

Determines if there is complete overlap second_array in first_array.

Returns true if each element of the second array appears in the first array. Otherwise, it returns false.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [[1, 2, 3]], "b": [[1, 2]]})
>>> result = df.select(
...     dfn.functions.array_has_all(dfn.col("a"), dfn.col("b")).alias("result"))
>>> result.collect_column("result")[0].as_py()
True
datafusion.functions.array_has_any(first_array: datafusion.expr.Expr, second_array: datafusion.expr.Expr) datafusion.expr.Expr

Determine if there is an overlap between first_array and second_array.

Returns true if at least one element of the second array appears in the first array. Otherwise, it returns false.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [[1, 2, 3]], "b": [[2, 5]]})
>>> result = df.select(
...     dfn.functions.array_has_any(dfn.col("a"), dfn.col("b")).alias("result"))
>>> result.collect_column("result")[0].as_py()
True
datafusion.functions.array_indexof(array: datafusion.expr.Expr, element: datafusion.expr.Expr, index: int | None = 1) datafusion.expr.Expr

Return the position of the first occurrence of element in array.

See also

This is an alias for array_position().

datafusion.functions.array_intersect(array1: datafusion.expr.Expr, array2: datafusion.expr.Expr) datafusion.expr.Expr

Returns the intersection of array1 and array2.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [[1, 2, 3]], "b": [[2, 3, 4]]})
>>> result = df.select(
...     dfn.functions.array_intersect(
...         dfn.col("a"), dfn.col("b")
...     ).alias("result")
... )
>>> sorted(
...     result.collect_column("result")[0].as_py()
... )
[2, 3]
datafusion.functions.array_join(expr: datafusion.expr.Expr, delimiter: datafusion.expr.Expr) datafusion.expr.Expr

Converts each element to its text representation.

See also

This is an alias for array_to_string().

datafusion.functions.array_length(array: datafusion.expr.Expr) datafusion.expr.Expr

Returns the length of the array.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [[1, 2, 3]]})
>>> result = df.select(dfn.functions.array_length(dfn.col("a")).alias("result"))
>>> result.collect_column("result")[0].as_py()
3
datafusion.functions.array_max(array: datafusion.expr.Expr) datafusion.expr.Expr

Returns the maximum value in the array.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [[1, 2, 3]]})
>>> result = df.select(
...     dfn.functions.array_max(dfn.col("a")).alias("result"))
>>> result.collect_column("result")[0].as_py()
3
datafusion.functions.array_min(array: datafusion.expr.Expr) datafusion.expr.Expr

Returns the minimum value in the array.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [[1, 2, 3]]})
>>> result = df.select(
...     dfn.functions.array_min(dfn.col("a")).alias("result"))
>>> result.collect_column("result")[0].as_py()
1
datafusion.functions.array_ndims(array: datafusion.expr.Expr) datafusion.expr.Expr

Returns the number of dimensions of the array.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [[1, 2, 3]]})
>>> result = df.select(dfn.functions.array_ndims(dfn.col("a")).alias("result"))
>>> result.collect_column("result")[0].as_py()
1
datafusion.functions.array_pop_back(array: datafusion.expr.Expr) datafusion.expr.Expr

Returns the array without the last element.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [[1, 2, 3]]})
>>> result = df.select(
...     dfn.functions.array_pop_back(dfn.col("a")).alias("result"))
>>> result.collect_column("result")[0].as_py()
[1, 2]
datafusion.functions.array_pop_front(array: datafusion.expr.Expr) datafusion.expr.Expr

Returns the array without the first element.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [[1, 2, 3]]})
>>> result = df.select(
...     dfn.functions.array_pop_front(dfn.col("a")).alias("result"))
>>> result.collect_column("result")[0].as_py()
[2, 3]
datafusion.functions.array_position(array: datafusion.expr.Expr, element: datafusion.expr.Expr, index: int | None = 1) datafusion.expr.Expr

Return the position of the first occurrence of element in array.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [[10, 20, 30]]})
>>> result = df.select(
...     dfn.functions.array_position(
...         dfn.col("a"), dfn.lit(20)
...     ).alias("result"))
>>> result.collect_column("result")[0].as_py()
2

Use index to start searching from a given position:

>>> df = ctx.from_pydict({"a": [[10, 20, 10, 20]]})
>>> result = df.select(
...     dfn.functions.array_position(
...         dfn.col("a"), dfn.lit(20), index=3,
...     ).alias("result"))
>>> result.collect_column("result")[0].as_py()
4
datafusion.functions.array_positions(array: datafusion.expr.Expr, element: datafusion.expr.Expr) datafusion.expr.Expr

Searches for an element in the array and returns all occurrences.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [[1, 2, 1]]})
>>> result = df.select(
...     dfn.functions.array_positions(dfn.col("a"), dfn.lit(1)).alias("result"))
>>> result.collect_column("result")[0].as_py()
[1, 3]
datafusion.functions.array_prepend(element: datafusion.expr.Expr, array: datafusion.expr.Expr) datafusion.expr.Expr

Prepends an element to the beginning of an array.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [[1, 2]]})
>>> result = df.select(
...     dfn.functions.array_prepend(dfn.lit(0), dfn.col("a")).alias("result"))
>>> result.collect_column("result")[0].as_py()
[0, 1, 2]
datafusion.functions.array_push_back(array: datafusion.expr.Expr, element: datafusion.expr.Expr) datafusion.expr.Expr

Appends an element to the end of an array.

See also

This is an alias for array_append().

datafusion.functions.array_push_front(element: datafusion.expr.Expr, array: datafusion.expr.Expr) datafusion.expr.Expr

Prepends an element to the beginning of an array.

See also

This is an alias for array_prepend().

datafusion.functions.array_remove(array: datafusion.expr.Expr, element: datafusion.expr.Expr) datafusion.expr.Expr

Removes the first element from the array equal to the given value.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [[1, 2, 1]]})
>>> result = df.select(
...     dfn.functions.array_remove(dfn.col("a"), dfn.lit(1)).alias("result"))
>>> result.collect_column("result")[0].as_py()
[2, 1]
datafusion.functions.array_remove_all(array: datafusion.expr.Expr, element: datafusion.expr.Expr) datafusion.expr.Expr

Removes all elements from the array equal to the given value.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [[1, 2, 1]]})
>>> result = df.select(
...     dfn.functions.array_remove_all(
...         dfn.col("a"), dfn.lit(1)
...     ).alias("result"))
>>> result.collect_column("result")[0].as_py()
[2]
datafusion.functions.array_remove_n(array: datafusion.expr.Expr, element: datafusion.expr.Expr, max: datafusion.expr.Expr) datafusion.expr.Expr

Removes the first max elements from the array equal to the given value.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [[1, 2, 1, 1]]})
>>> result = df.select(
...     dfn.functions.array_remove_n(dfn.col("a"), dfn.lit(1),
...     dfn.lit(2)).alias("result"))
>>> result.collect_column("result")[0].as_py()
[2, 1]
datafusion.functions.array_repeat(element: datafusion.expr.Expr, count: datafusion.expr.Expr) datafusion.expr.Expr

Returns an array containing element count times.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [1]})
>>> result = df.select(
...     dfn.functions.array_repeat(dfn.lit(3), dfn.lit(3)).alias("result"))
>>> result.collect_column("result")[0].as_py()
[3, 3, 3]
datafusion.functions.array_replace(array: datafusion.expr.Expr, from_val: datafusion.expr.Expr, to_val: datafusion.expr.Expr) datafusion.expr.Expr

Replaces the first occurrence of from_val with to_val.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [[1, 2, 1]]})
>>> result = df.select(
...     dfn.functions.array_replace(dfn.col("a"), dfn.lit(1),
...     dfn.lit(9)).alias("result"))
>>> result.collect_column("result")[0].as_py()
[9, 2, 1]
datafusion.functions.array_replace_all(array: datafusion.expr.Expr, from_val: datafusion.expr.Expr, to_val: datafusion.expr.Expr) datafusion.expr.Expr

Replaces all occurrences of from_val with to_val.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [[1, 2, 1]]})
>>> result = df.select(
...     dfn.functions.array_replace_all(dfn.col("a"), dfn.lit(1),
...     dfn.lit(9)).alias("result"))
>>> result.collect_column("result")[0].as_py()
[9, 2, 9]
datafusion.functions.array_replace_n(array: datafusion.expr.Expr, from_val: datafusion.expr.Expr, to_val: datafusion.expr.Expr, max: datafusion.expr.Expr) datafusion.expr.Expr

Replace n occurrences of from_val with to_val.

Replaces the first max occurrences of the specified element with another specified element.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [[1, 2, 1, 1]]})
>>> result = df.select(
...     dfn.functions.array_replace_n(dfn.col("a"), dfn.lit(1), dfn.lit(9),
...     dfn.lit(2)).alias("result"))
>>> result.collect_column("result")[0].as_py()
[9, 2, 9, 1]
datafusion.functions.array_resize(array: datafusion.expr.Expr, size: datafusion.expr.Expr, value: datafusion.expr.Expr) datafusion.expr.Expr

Returns an array with the specified size filled.

If size is greater than the array length, the additional entries will be filled with the given value.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [[1, 2]]})
>>> result = df.select(
...     dfn.functions.array_resize(dfn.col("a"), dfn.lit(4),
...     dfn.lit(0)).alias("result"))
>>> result.collect_column("result")[0].as_py()
[1, 2, 0, 0]
datafusion.functions.array_reverse(array: datafusion.expr.Expr) datafusion.expr.Expr

Reverses the order of elements in the array.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [[1, 2, 3]]})
>>> result = df.select(
...     dfn.functions.array_reverse(dfn.col("a")).alias("result"))
>>> result.collect_column("result")[0].as_py()
[3, 2, 1]
datafusion.functions.array_slice(array: datafusion.expr.Expr, begin: datafusion.expr.Expr, end: datafusion.expr.Expr, stride: datafusion.expr.Expr | None = None) datafusion.expr.Expr

Returns a slice of the array.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [[1, 2, 3, 4]]})
>>> result = df.select(
...     dfn.functions.array_slice(
...         dfn.col("a"), dfn.lit(2), dfn.lit(3)
...     ).alias("result"))
>>> result.collect_column("result")[0].as_py()
[2, 3]

Use stride to skip elements:

>>> result = df.select(
...     dfn.functions.array_slice(
...         dfn.col("a"), dfn.lit(1), dfn.lit(4),
...         stride=dfn.lit(2),
...     ).alias("result"))
>>> result.collect_column("result")[0].as_py()
[1, 3]
datafusion.functions.array_sort(array: datafusion.expr.Expr, descending: bool = False, null_first: bool = False) datafusion.expr.Expr

Sort an array.

Parameters:
  • array – The input array to sort.

  • descending – If True, sorts in descending order.

  • null_first – If True, nulls will be returned at the beginning of the array.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [[3, 1, 2]]})
>>> result = df.select(
...     dfn.functions.array_sort(
...         dfn.col("a")
...     ).alias("result"))
>>> result.collect_column("result")[0].as_py()
[1, 2, 3]
>>> df = ctx.from_pydict({"a": [[3, None, 1]]})
>>> result = df.select(
...     dfn.functions.array_sort(
...         dfn.col("a"), descending=True, null_first=True,
...     ).alias("result"))
>>> result.collect_column("result")[0].as_py()
[None, 3, 1]
datafusion.functions.array_to_string(expr: datafusion.expr.Expr, delimiter: datafusion.expr.Expr) datafusion.expr.Expr

Converts each element to its text representation.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [[1, 2, 3]]})
>>> result = df.select(
...     dfn.functions.array_to_string(dfn.col("a"), dfn.lit(",")).alias("s"))
>>> result.collect_column("s")[0].as_py()
'1,2,3'
datafusion.functions.array_union(array1: datafusion.expr.Expr, array2: datafusion.expr.Expr) datafusion.expr.Expr

Returns an array of the elements in the union of array1 and array2.

Duplicate rows will not be returned.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [[1, 2, 3]], "b": [[2, 3, 4]]})
>>> result = df.select(
...     dfn.functions.array_union(
...         dfn.col("a"), dfn.col("b")
...     ).alias("result")
... )
>>> sorted(
...     result.collect_column("result")[0].as_py()
... )
[1, 2, 3, 4]
datafusion.functions.arrays_overlap(first_array: datafusion.expr.Expr, second_array: datafusion.expr.Expr) datafusion.expr.Expr

Returns true if any element appears in both arrays.

See also

This is an alias for array_has_any().

datafusion.functions.arrays_zip(*arrays: datafusion.expr.Expr) datafusion.expr.Expr

Combines multiple arrays into a single array of structs.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [[1, 2]], "b": [[3, 4]]})
>>> result = df.select(
...     dfn.functions.arrays_zip(dfn.col("a"), dfn.col("b")).alias("result"))
>>> result.collect_column("result")[0].as_py()
[{'c0': 1, 'c1': 3}, {'c0': 2, 'c1': 4}]
datafusion.functions.arrow_cast(expr: datafusion.expr.Expr, data_type: datafusion.expr.Expr | str | pyarrow.DataType) datafusion.expr.Expr

Casts an expression to a specified data type.

The data_type can be a string, a pyarrow.DataType, or an Expr. For simple types, Expr.cast() is more concise (e.g., col("a").cast(pa.float64())). Use arrow_cast when you want to specify the target type as a string using DataFusion’s type syntax, which can be more readable for complex types like "Timestamp(Nanosecond, None)".

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [1]})
>>> result = df.select(
...     dfn.functions.arrow_cast(dfn.col("a"), "Float64").alias("c")
... )
>>> result.collect_column("c")[0].as_py()
1.0
>>> result = df.select(
...     dfn.functions.arrow_cast(
...         dfn.col("a"), data_type=pa.float64()
...     ).alias("c")
... )
>>> result.collect_column("c")[0].as_py()
1.0
datafusion.functions.arrow_metadata(expr: datafusion.expr.Expr, key: datafusion.expr.Expr | str | None = None) datafusion.expr.Expr

Returns the metadata of the input expression.

If called with one argument, returns a Map of all metadata key-value pairs. If called with two arguments, returns the value for the specified metadata key.

Examples

>>> field = pa.field("val", pa.int64(), metadata={"k": "v"})
>>> schema = pa.schema([field])
>>> batch = pa.RecordBatch.from_arrays([pa.array([1])], schema=schema)
>>> ctx = dfn.SessionContext()
>>> df = ctx.create_dataframe([[batch]])
>>> result = df.select(
...     dfn.functions.arrow_metadata(dfn.col("val")).alias("meta")
... )
>>> ("k", "v") in result.collect_column("meta")[0].as_py()
True
>>> result = df.select(
...     dfn.functions.arrow_metadata(
...         dfn.col("val"), key="k"
...     ).alias("meta_val")
... )
>>> result.collect_column("meta_val")[0].as_py()
'v'
datafusion.functions.arrow_typeof(arg: datafusion.expr.Expr) datafusion.expr.Expr

Returns the Arrow type of the expression.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [1]})
>>> result = df.select(dfn.functions.arrow_typeof(dfn.col("a")).alias("t"))
>>> result.collect_column("t")[0].as_py()
'Int64'
datafusion.functions.ascii(arg: datafusion.expr.Expr) datafusion.expr.Expr

Returns the numeric code of the first character of the argument.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": ["a","b","c"]})
>>> ascii_df = df.select(dfn.functions.ascii(dfn.col("a")).alias("ascii"))
>>> ascii_df.collect_column("ascii")[0].as_py()
97
datafusion.functions.asin(arg: datafusion.expr.Expr) datafusion.expr.Expr

Returns the arc sine or inverse sine of a number.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [0.0]})
>>> result = df.select(dfn.functions.asin(dfn.col("a")).alias("asin"))
>>> result.collect_column("asin")[0].as_py()
0.0
datafusion.functions.asinh(arg: datafusion.expr.Expr) datafusion.expr.Expr

Returns inverse hyperbolic sine.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [0.0]})
>>> result = df.select(dfn.functions.asinh(dfn.col("a")).alias("asinh"))
>>> result.collect_column("asinh")[0].as_py()
0.0
datafusion.functions.atan(arg: datafusion.expr.Expr) datafusion.expr.Expr

Returns inverse tangent of a number.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [0.0]})
>>> result = df.select(dfn.functions.atan(dfn.col("a")).alias("atan"))
>>> result.collect_column("atan")[0].as_py()
0.0
datafusion.functions.atan2(y: datafusion.expr.Expr, x: datafusion.expr.Expr) datafusion.expr.Expr

Returns inverse tangent of a division given in the argument.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"y": [0.0], "x": [1.0]})
>>> result = df.select(
...     dfn.functions.atan2(dfn.col("y"), dfn.col("x")).alias("atan2"))
>>> result.collect_column("atan2")[0].as_py()
0.0
datafusion.functions.atanh(arg: datafusion.expr.Expr) datafusion.expr.Expr

Returns inverse hyperbolic tangent.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [0.0]})
>>> result = df.select(dfn.functions.atanh(dfn.col("a")).alias("atanh"))
>>> result.collect_column("atanh")[0].as_py()
0.0
datafusion.functions.avg(expression: datafusion.expr.Expr, filter: datafusion.expr.Expr | None = None) datafusion.expr.Expr

Returns the average value.

This aggregate function expects a numeric expression and will return a float.

If using the builder functions described in ref:_aggregation this function ignores the options order_by, null_treatment, and distinct.

Parameters:
  • expression – Values to combine into an array

  • filter – If provided, only compute against rows for which the filter is True

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [1.0, 2.0, 3.0]})
>>> result = df.aggregate(
...     [], [dfn.functions.avg(
...         dfn.col("a")
...     ).alias("v")])
>>> result.collect_column("v")[0].as_py()
2.0
>>> result = df.aggregate(
...     [], [dfn.functions.avg(
...         dfn.col("a"),
...         filter=dfn.col("a") > dfn.lit(1.0)
...     ).alias("v")])
>>> result.collect_column("v")[0].as_py()
2.5
datafusion.functions.bit_and(expression: datafusion.expr.Expr, filter: datafusion.expr.Expr | None = None) datafusion.expr.Expr

Computes the bitwise AND of the argument.

This aggregate function will bitwise compare every value in the input partition.

If using the builder functions described in ref:_aggregation this function ignores the options order_by, null_treatment, and distinct.

Parameters:
  • expression – Argument to perform bitwise calculation on

  • filter – If provided, only compute against rows for which the filter is True

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [7, 3]})
>>> result = df.aggregate(
...     [], [dfn.functions.bit_and(
...         dfn.col("a")
...     ).alias("v")])
>>> result.collect_column("v")[0].as_py()
3
>>> df = ctx.from_pydict({"a": [7, 5, 3]})
>>> result = df.aggregate(
...     [], [dfn.functions.bit_and(
...         dfn.col("a"),
...         filter=dfn.col("a") > dfn.lit(3)
...     ).alias("v")])
>>> result.collect_column("v")[0].as_py()
5
datafusion.functions.bit_length(arg: datafusion.expr.Expr) datafusion.expr.Expr

Returns the number of bits in the string argument.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": ["a","b","c"]})
>>> bit_df = df.select(dfn.functions.bit_length(dfn.col("a")).alias("bit_len"))
>>> bit_df.collect_column("bit_len")[0].as_py()
8
datafusion.functions.bit_or(expression: datafusion.expr.Expr, filter: datafusion.expr.Expr | None = None) datafusion.expr.Expr

Computes the bitwise OR of the argument.

This aggregate function will bitwise compare every value in the input partition.

If using the builder functions described in ref:_aggregation this function ignores the options order_by, null_treatment, and distinct.

Parameters:
  • expression – Argument to perform bitwise calculation on

  • filter – If provided, only compute against rows for which the filter is True

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [1, 2]})
>>> result = df.aggregate(
...     [], [dfn.functions.bit_or(
...         dfn.col("a")
...     ).alias("v")]
... )
>>> result.collect_column("v")[0].as_py()
3
>>> df = ctx.from_pydict({"a": [1, 2, 4]})
>>> result = df.aggregate(
...     [], [dfn.functions.bit_or(
...         dfn.col("a"),
...         filter=dfn.col("a") > dfn.lit(1)
...     ).alias("v")]
... )
>>> result.collect_column("v")[0].as_py()
6
datafusion.functions.bit_xor(expression: datafusion.expr.Expr, distinct: bool = False, filter: datafusion.expr.Expr | None = None) datafusion.expr.Expr

Computes the bitwise XOR of the argument.

This aggregate function will bitwise compare every value in the input partition.

If using the builder functions described in ref:_aggregation this function ignores the options order_by and null_treatment.

Parameters:
  • expression – Argument to perform bitwise calculation on

  • distinct – If True, evaluate each unique value of expression only once

  • filter – If provided, only compute against rows for which the filter is True

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [5, 3]})
>>> result = df.aggregate(
...     [], [dfn.functions.bit_xor(
...         dfn.col("a")
...     ).alias("v")]
... )
>>> result.collect_column("v")[0].as_py()
6
>>> df = ctx.from_pydict({"a": [5, 5, 3]})
>>> result = df.aggregate(
...     [], [dfn.functions.bit_xor(
...         dfn.col("a"), distinct=True,
...         filter=dfn.col("a") > dfn.lit(3),
...     ).alias("v")]
... )
>>> result.collect_column("v")[0].as_py()
5
datafusion.functions.bool_and(expression: datafusion.expr.Expr, filter: datafusion.expr.Expr | None = None) datafusion.expr.Expr

Computes the boolean AND of the argument.

This aggregate function will compare every value in the input partition. These are expected to be boolean values.

If using the builder functions described in ref:_aggregation this function ignores the options order_by, null_treatment, and distinct.

Parameters:
  • expression – Argument to perform calculation on

  • filter – If provided, only compute against rows for which the filter is True

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [True, True, False]})
>>> result = df.aggregate(
...     [], [dfn.functions.bool_and(
...         dfn.col("a")
...     ).alias("v")]
... )
>>> result.collect_column("v")[0].as_py()
False
>>> df = ctx.from_pydict(
...     {"a": [True, True, False], "b": [1, 2, 3]})
>>> result = df.aggregate(
...     [], [dfn.functions.bool_and(
...         dfn.col("a"),
...         filter=dfn.col("b") < dfn.lit(3)
...     ).alias("v")]
... )
>>> result.collect_column("v")[0].as_py()
True
datafusion.functions.bool_or(expression: datafusion.expr.Expr, filter: datafusion.expr.Expr | None = None) datafusion.expr.Expr

Computes the boolean OR of the argument.

This aggregate function will compare every value in the input partition. These are expected to be boolean values.

If using the builder functions described in ref:_aggregation this function ignores the options order_by, null_treatment, and distinct.

Parameters:
  • expression – Argument to perform calculation on

  • filter – If provided, only compute against rows for which the filter is True

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [False, False, True]})
>>> result = df.aggregate(
...     [], [dfn.functions.bool_or(
...         dfn.col("a")
...     ).alias("v")]
... )
>>> result.collect_column("v")[0].as_py()
True
>>> df = ctx.from_pydict(
...     {"a": [False, False, True], "b": [1, 2, 3]})
>>> result = df.aggregate(
...     [], [dfn.functions.bool_or(
...         dfn.col("a"),
...         filter=dfn.col("b") < dfn.lit(3)
...     ).alias("v")]
... )
>>> result.collect_column("v")[0].as_py()
False
datafusion.functions.btrim(arg: datafusion.expr.Expr) datafusion.expr.Expr

Removes all characters, spaces by default, from both sides of a string.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [" a  "]})
>>> trim_df = df.select(dfn.functions.btrim(dfn.col("a")).alias("trimmed"))
>>> trim_df.collect_column("trimmed")[0].as_py()
'a'
datafusion.functions.cardinality(array: datafusion.expr.Expr) datafusion.expr.Expr

Returns the total number of elements in the array.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [[1, 2, 3]]})
>>> result = df.select(dfn.functions.cardinality(dfn.col("a")).alias("result"))
>>> result.collect_column("result")[0].as_py()
3
datafusion.functions.case(expr: datafusion.expr.Expr) datafusion.expr.CaseBuilder

Create a case expression.

Create a CaseBuilder to match cases for the expression expr. See CaseBuilder for detailed usage.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [1, 2, 3]})
>>> result = df.select(
...     dfn.functions.case(dfn.col("a")).when(dfn.lit(1),
...     dfn.lit("one")).otherwise(dfn.lit("other")).alias("c"))
>>> result.collect_column("c")[0].as_py()
'one'
datafusion.functions.cbrt(arg: datafusion.expr.Expr) datafusion.expr.Expr

Returns the cube root of a number.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [27]})
>>> cbrt_df = df.select(dfn.functions.cbrt(dfn.col("a")).alias("cbrt"))
>>> cbrt_df.collect_column("cbrt")[0].as_py()
3.0
datafusion.functions.ceil(arg: datafusion.expr.Expr) datafusion.expr.Expr

Returns the nearest integer greater than or equal to argument.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [1.9]})
>>> ceil_df = df.select(dfn.functions.ceil(dfn.col("a")).alias("ceil"))
>>> ceil_df.collect_column("ceil")[0].as_py()
2.0
datafusion.functions.char_length(string: datafusion.expr.Expr) datafusion.expr.Expr

The number of characters in the string.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": ["hello"]})
>>> result = df.select(dfn.functions.char_length(dfn.col("a")).alias("len"))
>>> result.collect_column("len")[0].as_py()
5
datafusion.functions.character_length(arg: datafusion.expr.Expr) datafusion.expr.Expr

Returns the number of characters in the argument.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": ["abc","b","c"]})
>>> char_len_df = df.select(
...     dfn.functions.character_length(dfn.col("a")).alias("char_len"))
>>> char_len_df.collect_column("char_len")[0].as_py()
3
datafusion.functions.chr(arg: datafusion.expr.Expr) datafusion.expr.Expr

Converts the Unicode code point to a UTF8 character.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [65]})
>>> result = df.select(dfn.functions.chr(dfn.col("a")).alias("chr"))
>>> result.collect_column("chr")[0].as_py()
'A'
datafusion.functions.coalesce(*args: datafusion.expr.Expr) datafusion.expr.Expr

Returns the value of the first expr in args which is not NULL.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [None, 1], "b": [2, 3]})
>>> result = df.select(
...     dfn.functions.coalesce(dfn.col("a"), dfn.col("b")).alias("c"))
>>> result.collect_column("c")[0].as_py()
2
datafusion.functions.col(name: str) datafusion.expr.Expr

Creates a column reference expression.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [1, 2, 3]})
>>> df.select(dfn.functions.col("a")).collect_column("a")[0].as_py()
1
datafusion.functions.concat(*args: datafusion.expr.Expr) datafusion.expr.Expr

Concatenates the text representations of all the arguments.

NULL arguments are ignored.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": ["hello"], "b": [" world"]})
>>> result = df.select(
...     dfn.functions.concat(dfn.col("a"), dfn.col("b")).alias("c")
... )
>>> result.collect_column("c")[0].as_py()
'hello world'
datafusion.functions.concat_ws(separator: str, *args: datafusion.expr.Expr) datafusion.expr.Expr

Concatenates the list args with the separator.

NULL arguments are ignored. separator should not be NULL.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": ["hello"], "b": ["world"]})
>>> result = df.select(
...     dfn.functions.concat_ws("-", dfn.col("a"), dfn.col("b")).alias("c"))
>>> result.collect_column("c")[0].as_py()
'hello-world'
datafusion.functions.contains(string: datafusion.expr.Expr, search_str: datafusion.expr.Expr) datafusion.expr.Expr

Returns true if search_str is found within string (case-sensitive).

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": ["the quick brown fox"]})
>>> result = df.select(
...     dfn.functions.contains(dfn.col("a"), dfn.lit("brown")).alias("c"))
>>> result.collect_column("c")[0].as_py()
True
datafusion.functions.corr(value_y: datafusion.expr.Expr, value_x: datafusion.expr.Expr, filter: datafusion.expr.Expr | None = None) datafusion.expr.Expr

Returns the correlation coefficient between value1 and value2.

This aggregate function expects both values to be numeric and will return a float.

If using the builder functions described in ref:_aggregation this function ignores the options order_by, null_treatment, and distinct.

Parameters:
  • value_y – The dependent variable for correlation

  • value_x – The independent variable for correlation

  • filter – If provided, only compute against rows for which the filter is True

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [1.0, 2.0, 3.0], "b": [1.0, 2.0, 3.0]})
>>> result = df.aggregate(
...     [], [dfn.functions.corr(
...         dfn.col("a"), dfn.col("b")
...     ).alias("v")])
>>> result.collect_column("v")[0].as_py()
1.0
>>> result = df.aggregate(
...     [], [dfn.functions.corr(
...         dfn.col("a"), dfn.col("b"),
...         filter=dfn.col("a") > dfn.lit(1.0)
...     ).alias("v")])
>>> result.collect_column("v")[0].as_py()
1.0
datafusion.functions.cos(arg: datafusion.expr.Expr) datafusion.expr.Expr

Returns the cosine of the argument.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [0,-1,1]})
>>> cos_df = df.select(dfn.functions.cos(dfn.col("a")).alias("cos"))
>>> cos_df.collect_column("cos")[0].as_py()
1.0
datafusion.functions.cosh(arg: datafusion.expr.Expr) datafusion.expr.Expr

Returns the hyperbolic cosine of the argument.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [0,-1,1]})
>>> cosh_df = df.select(dfn.functions.cosh(dfn.col("a")).alias("cosh"))
>>> cosh_df.collect_column("cosh")[0].as_py()
1.0
datafusion.functions.cot(arg: datafusion.expr.Expr) datafusion.expr.Expr

Returns the cotangent of the argument.

Examples

>>> from math import pi
>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [pi / 4]})
>>> result = df.select(
...     dfn.functions.cot(dfn.col("a")).alias("cot")
... )
>>> result.collect_column("cot")[0].as_py()
1.0...
datafusion.functions.count(expressions: datafusion.expr.Expr | list[datafusion.expr.Expr] | None = None, distinct: bool = False, filter: datafusion.expr.Expr | None = None) datafusion.expr.Expr

Returns the number of rows that match the given arguments.

This aggregate function will count the non-null rows provided in the expression.

If using the builder functions described in ref:_aggregation this function ignores the options order_by and null_treatment.

Parameters:
  • expressions – Argument to perform bitwise calculation on

  • distinct – If True, a single entry for each distinct value will be in the result

  • filter – If provided, only compute against rows for which the filter is True

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [1, 2, 3]})
>>> result = df.aggregate(
...     [], [dfn.functions.count(
...         dfn.col("a")
...     ).alias("v")])
>>> result.collect_column("v")[0].as_py()
3
>>> df = ctx.from_pydict({"a": [1, 1, 2, 3]})
>>> result = df.aggregate(
...     [], [dfn.functions.count(
...         dfn.col("a"), distinct=True,
...         filter=dfn.col("a") > dfn.lit(1),
...     ).alias("v")])
>>> result.collect_column("v")[0].as_py()
2
datafusion.functions.count_star(filter: datafusion.expr.Expr | None = None) datafusion.expr.Expr

Create a COUNT(1) aggregate expression.

This aggregate function will count all of the rows in the partition.

If using the builder functions described in ref:_aggregation this function ignores the options order_by, distinct, and null_treatment.

Parameters:

filter – If provided, only count rows for which the filter is True

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [1, 2, 3]})
>>> result = df.aggregate(
...     [], [dfn.functions.count_star(
...     ).alias("cnt")])
>>> result.collect_column("cnt")[0].as_py()
3
>>> result = df.aggregate(
...     [], [dfn.functions.count_star(
...         filter=dfn.col("a") > dfn.lit(1)
...     ).alias("cnt")])
>>> result.collect_column("cnt")[0].as_py()
2
datafusion.functions.covar(value_y: datafusion.expr.Expr, value_x: datafusion.expr.Expr, filter: datafusion.expr.Expr | None = None) datafusion.expr.Expr

Computes the sample covariance.

See also

This is an alias for covar_samp().

datafusion.functions.covar_pop(value_y: datafusion.expr.Expr, value_x: datafusion.expr.Expr, filter: datafusion.expr.Expr | None = None) datafusion.expr.Expr

Computes the population covariance.

This aggregate function expects both values to be numeric and will return a float.

If using the builder functions described in ref:_aggregation this function ignores the options order_by, null_treatment, and distinct.

Parameters:
  • value_y – The dependent variable for covariance

  • value_x – The independent variable for covariance

  • filter – If provided, only compute against rows for which the filter is True

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [1.0, 5.0, 10.0], "b": [1.0, 2.0, 3.0]})
>>> result = df.aggregate(
...     [],
...     [dfn.functions.covar_pop(
...         dfn.col("a"), dfn.col("b")
...     ).alias("v")]
... )
>>> result.collect_column("v")[0].as_py()
3.0
>>> df = ctx.from_pydict(
...     {"a": [0.0, 1.0, 3.0], "b": [0.0, 1.0, 3.0]})
>>> result = df.aggregate(
...     [],
...     [dfn.functions.covar_pop(
...         dfn.col("a"), dfn.col("b"),
...         filter=dfn.col("a") > dfn.lit(0.0)
...     ).alias("v")]
... )
>>> result.collect_column("v")[0].as_py()
1.0
datafusion.functions.covar_samp(value_y: datafusion.expr.Expr, value_x: datafusion.expr.Expr, filter: datafusion.expr.Expr | None = None) datafusion.expr.Expr

Computes the sample covariance.

This aggregate function expects both values to be numeric and will return a float.

If using the builder functions described in ref:_aggregation this function ignores the options order_by, null_treatment, and distinct.

Parameters:
  • value_y – The dependent variable for covariance

  • value_x – The independent variable for covariance

  • filter – If provided, only compute against rows for which the filter is True

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [1.0, 2.0, 3.0], "b": [4.0, 5.0, 6.0]})
>>> result = df.aggregate(
...     [], [dfn.functions.covar_samp(
...         dfn.col("a"), dfn.col("b")
...     ).alias("v")])
>>> result.collect_column("v")[0].as_py()
1.0
>>> result = df.aggregate(
...     [], [dfn.functions.covar_samp(
...         dfn.col("a"), dfn.col("b"),
...         filter=dfn.col("a") > dfn.lit(1.0)
...     ).alias("v")])
>>> result.collect_column("v")[0].as_py()
0.5
datafusion.functions.cume_dist(partition_by: list[datafusion.expr.Expr] | datafusion.expr.Expr | None = None, order_by: list[datafusion.expr.SortKey] | datafusion.expr.SortKey | None = None) datafusion.expr.Expr

Create a cumulative distribution window function.

This window function is similar to rank() except that the returned values are the ratio of the row number to the total number of rows. Here is an example of a dataframe with a window ordered by descending points and the associated cumulative distribution:

+--------+-----------+
| points | cume_dist |
+--------+-----------+
| 100    | 0.5       |
| 100    | 0.5       |
| 50     | 0.75      |
| 25     | 1.0       |
+--------+-----------+
Parameters:
  • partition_by – Expressions to partition the window frame on.

  • order_by – Set ordering within the window frame. Accepts column names or expressions.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [1., 2., 2., 3.]})
>>> result = df.select(
...     dfn.col("a"),
...     dfn.functions.cume_dist(
...         order_by="a"
...     ).alias("cd")
... )
>>> result.collect_column("cd").to_pylist()
[0.25..., 0.75..., 0.75..., 1.0...]
>>> df = ctx.from_pydict(
...     {"g": ["a", "a", "b", "b"], "v": [1, 2, 3, 4]})
>>> result = df.select(
...     dfn.col("g"), dfn.col("v"),
...     dfn.functions.cume_dist(
...         partition_by=dfn.col("g"), order_by="v",
...     ).alias("cd"))
>>> result.sort(dfn.col("g"), dfn.col("v")).collect_column("cd").to_pylist()
[0.5, 1.0, 0.5, 1.0]
datafusion.functions.current_date() datafusion.expr.Expr

Returns current UTC date as a Date32 value.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [1]})
>>> result = df.select(
...     dfn.functions.current_date().alias("d")
... )
>>> result.collect_column("d")[0].as_py() is not None
True
datafusion.functions.current_time() datafusion.expr.Expr

Returns current UTC time as a Time64 value.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [1]})
>>> result = df.select(
...     dfn.functions.current_time().alias("t")
... )

Use .value instead of .as_py() because nanosecond timestamps require pandas to convert to Python datetime objects.

>>> result.collect_column("t")[0].value > 0
True
datafusion.functions.current_timestamp() datafusion.expr.Expr

Returns the current timestamp in nanoseconds.

See also

This is an alias for now().

datafusion.functions.date_bin(stride: datafusion.expr.Expr, source: datafusion.expr.Expr, origin: datafusion.expr.Expr) datafusion.expr.Expr

Coerces an arbitrary timestamp to the start of the nearest specified interval.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"timestamp": ['2021-07-15 12:34:56', '2021-01-01']})
>>> result = df.select(
...     dfn.functions.date_bin(
...         dfn.string_literal("15 minutes"),
...         dfn.col("timestamp"),
...         dfn.string_literal("2001-01-01 00:00:00")
...     ).alias("b")
... )
>>> str(result.collect_column("b")[0].as_py())
'2021-07-15 12:30:00'
>>> str(result.collect_column("b")[1].as_py())
'2021-01-01 00:00:00'
datafusion.functions.date_format(arg: datafusion.expr.Expr, formatter: datafusion.expr.Expr) datafusion.expr.Expr

Returns a string representation of a date, time, timestamp or duration.

See also

This is an alias for to_char().

datafusion.functions.date_part(part: datafusion.expr.Expr, date: datafusion.expr.Expr) datafusion.expr.Expr

Extracts a subfield from the date.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": ["2021-07-15T00:00:00"]})
>>> df = df.select(dfn.functions.to_timestamp(dfn.col("a")).alias("a"))
>>> result = df.select(
...     dfn.functions.date_part(dfn.lit("year"), dfn.col("a")).alias("y"))
>>> result.collect_column("y")[0].as_py()
2021
datafusion.functions.date_trunc(part: datafusion.expr.Expr, date: datafusion.expr.Expr) datafusion.expr.Expr

Truncates the date to a specified level of precision.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": ["2021-07-15T12:34:56"]})
>>> df = df.select(dfn.functions.to_timestamp(dfn.col("a")).alias("a"))
>>> result = df.select(
...     dfn.functions.date_trunc(
...         dfn.lit("month"), dfn.col("a")
...     ).alias("t")
... )
>>> str(result.collect_column("t")[0].as_py())
'2021-07-01 00:00:00'
datafusion.functions.datepart(part: datafusion.expr.Expr, date: datafusion.expr.Expr) datafusion.expr.Expr

Return a specified part of a date.

See also

This is an alias for date_part().

datafusion.functions.datetrunc(part: datafusion.expr.Expr, date: datafusion.expr.Expr) datafusion.expr.Expr

Truncates the date to a specified level of precision.

See also

This is an alias for date_trunc().

datafusion.functions.decode(expr: datafusion.expr.Expr, encoding: datafusion.expr.Expr) datafusion.expr.Expr

Decode the input, using the encoding. encoding can be base64 or hex.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": ["aGVsbG8="]})
>>> result = df.select(
...     dfn.functions.decode(dfn.col("a"), dfn.lit("base64")).alias("dec"))
>>> result.collect_column("dec")[0].as_py()
b'hello'
datafusion.functions.degrees(arg: datafusion.expr.Expr) datafusion.expr.Expr

Converts the argument from radians to degrees.

Examples

>>> from math import pi
>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [0,pi,2*pi]})
>>> deg_df = df.select(dfn.functions.degrees(dfn.col("a")).alias("deg"))
>>> deg_df.collect_column("deg")[2].as_py()
360.0
datafusion.functions.dense_rank(partition_by: list[datafusion.expr.Expr] | datafusion.expr.Expr | None = None, order_by: list[datafusion.expr.SortKey] | datafusion.expr.SortKey | None = None) datafusion.expr.Expr

Create a dense_rank window function.

This window function is similar to rank() except that the returned values will be consecutive. Here is an example of a dataframe with a window ordered by descending points and the associated dense rank:

+--------+------------+
| points | dense_rank |
+--------+------------+
| 100    | 1          |
| 100    | 1          |
| 50     | 2          |
| 25     | 3          |
+--------+------------+
Parameters:
  • partition_by – Expressions to partition the window frame on.

  • order_by – Set ordering within the window frame. Accepts column names or expressions.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [10, 10, 20]})
>>> result = df.select(
...     dfn.col("a"),
...     dfn.functions.dense_rank(
...         order_by="a"
...     ).alias("dr"))
>>> result.sort(dfn.col("a")).collect_column("dr").to_pylist()
[1, 1, 2]
>>> df = ctx.from_pydict(
...     {"g": ["a", "a", "b", "b"], "v": [1, 1, 2, 3]})
>>> result = df.select(
...     dfn.col("g"), dfn.col("v"),
...     dfn.functions.dense_rank(
...         partition_by=dfn.col("g"), order_by="v",
...     ).alias("dr"))
>>> result.sort(dfn.col("g"), dfn.col("v")).collect_column("dr").to_pylist()
[1, 1, 1, 2]
datafusion.functions.digest(value: datafusion.expr.Expr, method: datafusion.expr.Expr) datafusion.expr.Expr

Computes the binary hash of an expression using the specified algorithm.

Standard algorithms are md5, sha224, sha256, sha384, sha512, blake2s, blake2b, and blake3.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": ["hello"]})
>>> result = df.select(
...     dfn.functions.digest(dfn.col("a"), dfn.lit("md5")).alias("d"))
>>> len(result.collect_column("d")[0].as_py()) > 0
True
datafusion.functions.element_at(map: datafusion.expr.Expr, key: datafusion.expr.Expr) datafusion.expr.Expr

Returns the value for a given key in the map.

Returns [None] if the key is absent.

See also

This is an alias for map_extract().

datafusion.functions.empty(array: datafusion.expr.Expr) datafusion.expr.Expr

Returns true if the array is empty.

See also

This is an alias for array_empty().

datafusion.functions.encode(expr: datafusion.expr.Expr, encoding: datafusion.expr.Expr) datafusion.expr.Expr

Encode the input, using the encoding. encoding can be base64 or hex.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": ["hello"]})
>>> result = df.select(
...     dfn.functions.encode(dfn.col("a"), dfn.lit("base64")).alias("enc"))
>>> result.collect_column("enc")[0].as_py()
'aGVsbG8'
datafusion.functions.ends_with(arg: datafusion.expr.Expr, suffix: datafusion.expr.Expr) datafusion.expr.Expr

Returns true if the string ends with the suffix, false otherwise.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": ["abc","b","c"]})
>>> ends_with_df = df.select(
...     dfn.functions.ends_with(dfn.col("a"), dfn.lit("c")).alias("ends_with"))
>>> ends_with_df.collect_column("ends_with")[0].as_py()
True
datafusion.functions.exp(arg: datafusion.expr.Expr) datafusion.expr.Expr

Returns the exponential of the argument.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [0.0]})
>>> result = df.select(dfn.functions.exp(dfn.col("a")).alias("exp"))
>>> result.collect_column("exp")[0].as_py()
1.0
datafusion.functions.extract(part: datafusion.expr.Expr, date: datafusion.expr.Expr) datafusion.expr.Expr

Extracts a subfield from the date.

See also

This is an alias for date_part().

datafusion.functions.factorial(arg: datafusion.expr.Expr) datafusion.expr.Expr

Returns the factorial of the argument.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [3]})
>>> result = df.select(
...     dfn.functions.factorial(dfn.col("a")).alias("factorial")
... )
>>> result.collect_column("factorial")[0].as_py()
6
datafusion.functions.find_in_set(string: datafusion.expr.Expr, string_list: datafusion.expr.Expr) datafusion.expr.Expr

Find a string in a list of strings.

Returns a value in the range of 1 to N if the string is in the string list string_list consisting of N substrings.

The string list is a string composed of substrings separated by , characters.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": ["b"]})
>>> result = df.select(
...     dfn.functions.find_in_set(dfn.col("a"), dfn.lit("a,b,c")).alias("pos"))
>>> result.collect_column("pos")[0].as_py()
2
datafusion.functions.first_value(expression: datafusion.expr.Expr, filter: datafusion.expr.Expr | None = None, order_by: list[datafusion.expr.SortKey] | datafusion.expr.SortKey | None = None, null_treatment: datafusion.common.NullTreatment = NullTreatment.RESPECT_NULLS) datafusion.expr.Expr

Returns the first value in a group of values.

This aggregate function will return the first value in the partition.

If using the builder functions described in ref:_aggregation this function ignores the option distinct.

Parameters:
  • expression – Argument to perform bitwise calculation on

  • filter – If provided, only compute against rows for which the filter is True

  • order_by – Set the ordering of the expression to evaluate. Accepts column names or expressions.

  • null_treatment – Assign whether to respect or ignore null values.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [10, 20, 30]})
>>> result = df.aggregate(
...     [], [dfn.functions.first_value(
...         dfn.col("a")
...     ).alias("v")]
... )
>>> result.collect_column("v")[0].as_py()
10
>>> df = ctx.from_pydict({"a": [None, 20, 10]})
>>> result = df.aggregate(
...     [], [dfn.functions.first_value(
...         dfn.col("a"),
...         filter=dfn.col("a") > dfn.lit(10),
...         order_by="a",
...         null_treatment=dfn.common.NullTreatment.IGNORE_NULLS,
...     ).alias("v")]
... )
>>> result.collect_column("v")[0].as_py()
20
datafusion.functions.flatten(array: datafusion.expr.Expr) datafusion.expr.Expr

Flattens an array of arrays into a single array.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [[[1, 2], [3, 4]]]})
>>> result = df.select(dfn.functions.flatten(dfn.col("a")).alias("result"))
>>> result.collect_column("result")[0].as_py()
[1, 2, 3, 4]
datafusion.functions.floor(arg: datafusion.expr.Expr) datafusion.expr.Expr

Returns the nearest integer less than or equal to the argument.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [1.9]})
>>> floor_df = df.select(dfn.functions.floor(dfn.col("a")).alias("floor"))
>>> floor_df.collect_column("floor")[0].as_py()
1.0
datafusion.functions.from_unixtime(arg: datafusion.expr.Expr) datafusion.expr.Expr

Converts an integer to RFC3339 timestamp format string.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [0]})
>>> result = df.select(
...     dfn.functions.from_unixtime(
...         dfn.col("a")
...     ).alias("ts")
... )
>>> str(result.collect_column("ts")[0].as_py())
'1970-01-01 00:00:00'
datafusion.functions.gcd(x: datafusion.expr.Expr, y: datafusion.expr.Expr) datafusion.expr.Expr

Returns the greatest common divisor.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [12], "b": [8]})
>>> result = df.select(
...     dfn.functions.gcd(dfn.col("a"), dfn.col("b")).alias("gcd")
... )
>>> result.collect_column("gcd")[0].as_py()
4
datafusion.functions.gen_series(start: datafusion.expr.Expr, stop: datafusion.expr.Expr, step: datafusion.expr.Expr | None = None) datafusion.expr.Expr

Creates a list of values in the range between start and stop.

Unlike range(), this includes the upper bound.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [0]})
>>> result = df.select(
...     dfn.functions.gen_series(
...         dfn.lit(1), dfn.lit(5),
...     ).alias("result"))
>>> result.collect_column("result")[0].as_py()
[1, 2, 3, 4, 5]

Specify a custom step:

>>> result = df.select(
...     dfn.functions.gen_series(
...         dfn.lit(1), dfn.lit(10), step=dfn.lit(3),
...     ).alias("result"))
>>> result.collect_column("result")[0].as_py()
[1, 4, 7, 10]
datafusion.functions.generate_series(start: datafusion.expr.Expr, stop: datafusion.expr.Expr, step: datafusion.expr.Expr | None = None) datafusion.expr.Expr

Creates a list of values in the range between start and stop.

Unlike range(), this includes the upper bound.

See also

This is an alias for gen_series().

datafusion.functions.get_field(expr: datafusion.expr.Expr, name: datafusion.expr.Expr | str) datafusion.expr.Expr

Extracts a field from a struct or map by name.

When the field name is a static string, the bracket operator expr["field"] is a convenient shorthand. Use get_field when the field name is a dynamic expression.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [1], "b": [2]})
>>> df = df.with_column(
...     "s",
...     dfn.functions.named_struct(
...         [("x", dfn.col("a")), ("y", dfn.col("b"))]
...     ),
... )
>>> result = df.select(
...     dfn.functions.get_field(dfn.col("s"), "x").alias("x_val")
... )
>>> result.collect_column("x_val")[0].as_py()
1

Equivalent using bracket syntax:

>>> result = df.select(
...     dfn.col("s")["x"].alias("x_val")
... )
>>> result.collect_column("x_val")[0].as_py()
1
datafusion.functions.greatest(*args: datafusion.expr.Expr) datafusion.expr.Expr

Returns the greatest value from a list of expressions.

Returns NULL if all expressions are NULL.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [1, 3], "b": [2, 1]})
>>> result = df.select(
...     dfn.functions.greatest(dfn.col("a"), dfn.col("b")).alias("greatest"))
>>> result.collect_column("greatest")[0].as_py()
2
>>> result.collect_column("greatest")[1].as_py()
3
datafusion.functions.grouping(expression: datafusion.expr.Expr, distinct: bool = False, filter: datafusion.expr.Expr | None = None) datafusion.expr.Expr

Indicates whether a column is aggregated across in the current row.

Returns 0 when the column is part of the grouping key for that row (i.e., the row contains per-group results for that column). Returns 1 when the column is not part of the grouping key (i.e., the row’s aggregate spans all values of that column).

This function is meaningful with GroupingSet.rollup, GroupingSet.cube, or GroupingSet.grouping_sets, where different rows are grouped by different subsets of columns. In a default aggregation without grouping sets every column is always part of the key, so grouping() always returns 0.

Warning

Due to an upstream DataFusion limitation (#21411), .alias() cannot be applied directly to a grouping() expression. Doing so will raise an error at execution time. To rename the column, use with_column_renamed() on the result DataFrame instead.

Parameters:
  • expression – The column to check grouping status for

  • distinct – If True, compute on distinct values only

  • filter – If provided, only compute against rows for which the filter is True

Examples

With rollup(), the result includes both per-group rows (grouping(a) = 0) and a grand-total row where a is aggregated across (grouping(a) = 1):

>>> from datafusion.expr import GroupingSet
>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [1, 1, 2], "b": [10, 20, 30]})
>>> result = df.aggregate(
...     [GroupingSet.rollup(dfn.col("a"))],
...     [dfn.functions.sum(dfn.col("b")).alias("s"),
...      dfn.functions.grouping(dfn.col("a"))],
... ).sort(dfn.col("a").sort(nulls_first=False))
>>> result.collect_column("s").to_pylist()
[30, 30, 60]

See also

GroupingSet

datafusion.functions.ifnull(x: datafusion.expr.Expr, y: datafusion.expr.Expr) datafusion.expr.Expr

Returns x if x is not NULL. Otherwise returns y.

See also

This is an alias for nvl().

datafusion.functions.in_list(arg: datafusion.expr.Expr, values: list[datafusion.expr.Expr], negated: bool = False) datafusion.expr.Expr

Returns whether the argument is contained within the list values.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [1, 2, 3]})
>>> result = df.select(
...     dfn.functions.in_list(
...         dfn.col("a"), [dfn.lit(1), dfn.lit(3)]
...     ).alias("in")
... )
>>> result.collect_column("in").to_pylist()
[True, False, True]
>>> result = df.select(
...     dfn.functions.in_list(
...         dfn.col("a"), [dfn.lit(1), dfn.lit(3)],
...         negated=True,
...     ).alias("not_in")
... )
>>> result.collect_column("not_in").to_pylist()
[False, True, False]
datafusion.functions.initcap(string: datafusion.expr.Expr) datafusion.expr.Expr

Set the initial letter of each word to capital.

Converts the first letter of each word in string to uppercase and the remaining characters to lowercase.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": ["the cat"]})
>>> cap_df = df.select(dfn.functions.initcap(dfn.col("a")).alias("cap"))
>>> cap_df.collect_column("cap")[0].as_py()
'The Cat'
datafusion.functions.isnan(expr: datafusion.expr.Expr) datafusion.expr.Expr

Returns true if a given number is +NaN or -NaN otherwise returns false.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [1.0, np.nan]})
>>> result = df.select(dfn.functions.isnan(dfn.col("a")).alias("isnan"))
>>> result.collect_column("isnan")[1].as_py()
True
datafusion.functions.iszero(arg: datafusion.expr.Expr) datafusion.expr.Expr

Returns true if a given number is +0.0 or -0.0 otherwise returns false.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [0.0, 1.0]})
>>> result = df.select(dfn.functions.iszero(dfn.col("a")).alias("iz"))
>>> result.collect_column("iz")[0].as_py()
True
datafusion.functions.lag(arg: datafusion.expr.Expr, shift_offset: int = 1, default_value: Any | None = None, partition_by: list[datafusion.expr.Expr] | datafusion.expr.Expr | None = None, order_by: list[datafusion.expr.SortKey] | datafusion.expr.SortKey | None = None) datafusion.expr.Expr

Create a lag window function.

Lag operation will return the argument that is in the previous shift_offset-th row in the partition. For example lag(col("b"), shift_offset=3, default_value=5) will return the 3rd previous value in column b. At the beginning of the partition, where no values can be returned it will return the default value of 5.

Here is an example of both the lag and datafusion.functions.lead() functions on a simple DataFrame:

+--------+------+-----+
| points | lead | lag |
+--------+------+-----+
| 100    | 100  |     |
| 100    | 50   | 100 |
| 50     | 25   | 100 |
| 25     |      | 50  |
+--------+------+-----+
Parameters:
  • arg – Value to return

  • shift_offset – Number of rows before the current row.

  • default_value – Value to return if shift_offet row does not exist.

  • partition_by – Expressions to partition the window frame on.

  • order_by – Set ordering within the window frame. Accepts column names or expressions.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [1, 2, 3]})
>>> result = df.select(
...     dfn.col("a"),
...     dfn.functions.lag(
...         dfn.col("a"), shift_offset=1,
...         default_value=0, order_by="a"
...     ).alias("lag"))
>>> result.sort(dfn.col("a")).collect_column("lag").to_pylist()
[0, 1, 2]
>>> df = ctx.from_pydict({"g": ["a", "a", "b"], "v": [1, 2, 3]})
>>> result = df.select(
...     dfn.col("g"), dfn.col("v"),
...     dfn.functions.lag(
...         dfn.col("v"), shift_offset=1, default_value=0,
...         partition_by=dfn.col("g"), order_by="v",
...     ).alias("lag"))
>>> result.sort(dfn.col("g"), dfn.col("v")).collect_column("lag").to_pylist()
[0, 1, 0]
datafusion.functions.last_value(expression: datafusion.expr.Expr, filter: datafusion.expr.Expr | None = None, order_by: list[datafusion.expr.SortKey] | datafusion.expr.SortKey | None = None, null_treatment: datafusion.common.NullTreatment = NullTreatment.RESPECT_NULLS) datafusion.expr.Expr

Returns the last value in a group of values.

This aggregate function will return the last value in the partition.

If using the builder functions described in ref:_aggregation this function ignores the option distinct.

Parameters:
  • expression – Argument to perform bitwise calculation on

  • filter – If provided, only compute against rows for which the filter is True

  • order_by – Set the ordering of the expression to evaluate. Accepts column names or expressions.

  • null_treatment – Assign whether to respect or ignore null values.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [10, 20, 30]})
>>> result = df.aggregate(
...     [], [dfn.functions.last_value(
...         dfn.col("a")
...     ).alias("v")]
... )
>>> result.collect_column("v")[0].as_py()
30
>>> df = ctx.from_pydict({"a": [None, 20, 10]})
>>> result = df.aggregate(
...     [], [dfn.functions.last_value(
...         dfn.col("a"),
...         filter=dfn.col("a") > dfn.lit(10),
...         order_by="a",
...         null_treatment=dfn.common.NullTreatment.IGNORE_NULLS,
...     ).alias("v")]
... )
>>> result.collect_column("v")[0].as_py()
20
datafusion.functions.lcm(x: datafusion.expr.Expr, y: datafusion.expr.Expr) datafusion.expr.Expr

Returns the least common multiple.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [4], "b": [6]})
>>> result = df.select(
...     dfn.functions.lcm(dfn.col("a"), dfn.col("b")).alias("lcm")
... )
>>> result.collect_column("lcm")[0].as_py()
12
datafusion.functions.lead(arg: datafusion.expr.Expr, shift_offset: int = 1, default_value: Any | None = None, partition_by: list[datafusion.expr.Expr] | datafusion.expr.Expr | None = None, order_by: list[datafusion.expr.SortKey] | datafusion.expr.SortKey | None = None) datafusion.expr.Expr

Create a lead window function.

Lead operation will return the argument that is in the next shift_offset-th row in the partition. For example lead(col("b"), shift_offset=3, default_value=5) will return the 3rd following value in column b. At the end of the partition, where no further values can be returned it will return the default value of 5.

Here is an example of both the lead and datafusion.functions.lag() functions on a simple DataFrame:

+--------+------+-----+
| points | lead | lag |
+--------+------+-----+
| 100    | 100  |     |
| 100    | 50   | 100 |
| 50     | 25   | 100 |
| 25     |      | 50  |
+--------+------+-----+

To set window function parameters use the window builder approach described in the ref:_window_functions online documentation.

Parameters:
  • arg – Value to return

  • shift_offset – Number of rows following the current row.

  • default_value – Value to return if shift_offet row does not exist.

  • partition_by – Expressions to partition the window frame on.

  • order_by – Set ordering within the window frame. Accepts column names or expressions.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [1, 2, 3]})
>>> result = df.select(
...     dfn.col("a"),
...     dfn.functions.lead(
...         dfn.col("a"), shift_offset=1,
...         default_value=0, order_by="a"
...     ).alias("lead"))
>>> result.sort(dfn.col("a")).collect_column("lead").to_pylist()
[2, 3, 0]
>>> df = ctx.from_pydict({"g": ["a", "a", "b"], "v": [1, 2, 3]})
>>> result = df.select(
...     dfn.col("g"), dfn.col("v"),
...     dfn.functions.lead(
...         dfn.col("v"), shift_offset=1, default_value=0,
...         partition_by=dfn.col("g"), order_by="v",
...     ).alias("lead"))
>>> result.sort(dfn.col("g"), dfn.col("v")).collect_column("lead").to_pylist()
[2, 0, 0]
datafusion.functions.least(*args: datafusion.expr.Expr) datafusion.expr.Expr

Returns the least value from a list of expressions.

Returns NULL if all expressions are NULL.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [1, 3], "b": [2, 1]})
>>> result = df.select(
...     dfn.functions.least(dfn.col("a"), dfn.col("b")).alias("least"))
>>> result.collect_column("least")[0].as_py()
1
>>> result.collect_column("least")[1].as_py()
1
datafusion.functions.left(string: datafusion.expr.Expr, n: datafusion.expr.Expr) datafusion.expr.Expr

Returns the first n characters in the string.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": ["the cat"]})
>>> left_df = df.select(
...     dfn.functions.left(dfn.col("a"), dfn.lit(3)).alias("left"))
>>> left_df.collect_column("left")[0].as_py()
'the'
datafusion.functions.length(string: datafusion.expr.Expr) datafusion.expr.Expr

The number of characters in the string.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": ["hello"]})
>>> result = df.select(dfn.functions.length(dfn.col("a")).alias("len"))
>>> result.collect_column("len")[0].as_py()
5
datafusion.functions.levenshtein(string1: datafusion.expr.Expr, string2: datafusion.expr.Expr) datafusion.expr.Expr

Returns the Levenshtein distance between the two given strings.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": ["kitten"]})
>>> result = df.select(
...     dfn.functions.levenshtein(dfn.col("a"), dfn.lit("sitting")).alias("d"))
>>> result.collect_column("d")[0].as_py()
3
datafusion.functions.list_any_value(array: datafusion.expr.Expr) datafusion.expr.Expr

Returns the first non-null element in the array.

See also

This is an alias for array_any_value().

datafusion.functions.list_append(array: datafusion.expr.Expr, element: datafusion.expr.Expr) datafusion.expr.Expr

Appends an element to the end of an array.

See also

This is an alias for array_append().

datafusion.functions.list_cat(*args: datafusion.expr.Expr) datafusion.expr.Expr

Concatenates the input arrays.

See also

This is an alias for array_concat(), array_cat().

datafusion.functions.list_concat(*args: datafusion.expr.Expr) datafusion.expr.Expr

Concatenates the input arrays.

See also

This is an alias for array_concat(), array_cat().

datafusion.functions.list_contains(array: datafusion.expr.Expr, element: datafusion.expr.Expr) datafusion.expr.Expr

Returns true if the element appears in the array, otherwise false.

See also

This is an alias for array_has().

datafusion.functions.list_dims(array: datafusion.expr.Expr) datafusion.expr.Expr

Returns an array of the array’s dimensions.

See also

This is an alias for array_dims().

datafusion.functions.list_distance(array1: datafusion.expr.Expr, array2: datafusion.expr.Expr) datafusion.expr.Expr

Returns the Euclidean distance between two numeric arrays.

See also

This is an alias for array_distance().

datafusion.functions.list_distinct(array: datafusion.expr.Expr) datafusion.expr.Expr

Returns distinct values from the array after removing duplicates.

See also

This is an alias for array_distinct().

datafusion.functions.list_element(array: datafusion.expr.Expr, n: datafusion.expr.Expr) datafusion.expr.Expr

Extracts the element with the index n from the array.

See also

This is an alias for array_element().

datafusion.functions.list_empty(array: datafusion.expr.Expr) datafusion.expr.Expr

Returns a boolean indicating whether the array is empty.

See also

This is an alias for array_empty().

datafusion.functions.list_except(array1: datafusion.expr.Expr, array2: datafusion.expr.Expr) datafusion.expr.Expr

Returns the elements that appear in array1 but not in the array2.

See also

This is an alias for array_except().

datafusion.functions.list_extract(array: datafusion.expr.Expr, n: datafusion.expr.Expr) datafusion.expr.Expr

Extracts the element with the index n from the array.

See also

This is an alias for array_element().

datafusion.functions.list_has(array: datafusion.expr.Expr, element: datafusion.expr.Expr) datafusion.expr.Expr

Returns true if the element appears in the array, otherwise false.

See also

This is an alias for array_has().

datafusion.functions.list_has_all(first_array: datafusion.expr.Expr, second_array: datafusion.expr.Expr) datafusion.expr.Expr

Determines if there is complete overlap second_array in first_array.

See also

This is an alias for array_has_all().

datafusion.functions.list_has_any(first_array: datafusion.expr.Expr, second_array: datafusion.expr.Expr) datafusion.expr.Expr

Determine if there is an overlap between first_array and second_array.

See also

This is an alias for array_has_any().

datafusion.functions.list_indexof(array: datafusion.expr.Expr, element: datafusion.expr.Expr, index: int | None = 1) datafusion.expr.Expr

Return the position of the first occurrence of element in array.

See also

This is an alias for array_position().

datafusion.functions.list_intersect(array1: datafusion.expr.Expr, array2: datafusion.expr.Expr) datafusion.expr.Expr

Returns an the intersection of array1 and array2.

See also

This is an alias for array_intersect().

datafusion.functions.list_join(expr: datafusion.expr.Expr, delimiter: datafusion.expr.Expr) datafusion.expr.Expr

Converts each element to its text representation.

See also

This is an alias for array_to_string().

datafusion.functions.list_length(array: datafusion.expr.Expr) datafusion.expr.Expr

Returns the length of the array.

See also

This is an alias for array_length().

datafusion.functions.list_max(array: datafusion.expr.Expr) datafusion.expr.Expr

Returns the maximum value in the array.

See also

This is an alias for array_max().

datafusion.functions.list_min(array: datafusion.expr.Expr) datafusion.expr.Expr

Returns the minimum value in the array.

See also

This is an alias for array_min().

datafusion.functions.list_ndims(array: datafusion.expr.Expr) datafusion.expr.Expr

Returns the number of dimensions of the array.

See also

This is an alias for array_ndims().

datafusion.functions.list_overlap(first_array: datafusion.expr.Expr, second_array: datafusion.expr.Expr) datafusion.expr.Expr

Returns true if any element appears in both arrays.

See also

This is an alias for array_has_any().

datafusion.functions.list_pop_back(array: datafusion.expr.Expr) datafusion.expr.Expr

Returns the array without the last element.

See also

This is an alias for array_pop_back().

datafusion.functions.list_pop_front(array: datafusion.expr.Expr) datafusion.expr.Expr

Returns the array without the first element.

See also

This is an alias for array_pop_front().

datafusion.functions.list_position(array: datafusion.expr.Expr, element: datafusion.expr.Expr, index: int | None = 1) datafusion.expr.Expr

Return the position of the first occurrence of element in array.

See also

This is an alias for array_position().

datafusion.functions.list_positions(array: datafusion.expr.Expr, element: datafusion.expr.Expr) datafusion.expr.Expr

Searches for an element in the array and returns all occurrences.

See also

This is an alias for array_positions().

datafusion.functions.list_prepend(element: datafusion.expr.Expr, array: datafusion.expr.Expr) datafusion.expr.Expr

Prepends an element to the beginning of an array.

See also

This is an alias for array_prepend().

datafusion.functions.list_push_back(array: datafusion.expr.Expr, element: datafusion.expr.Expr) datafusion.expr.Expr

Appends an element to the end of an array.

See also

This is an alias for array_append().

datafusion.functions.list_push_front(element: datafusion.expr.Expr, array: datafusion.expr.Expr) datafusion.expr.Expr

Prepends an element to the beginning of an array.

See also

This is an alias for array_prepend().

datafusion.functions.list_remove(array: datafusion.expr.Expr, element: datafusion.expr.Expr) datafusion.expr.Expr

Removes the first element from the array equal to the given value.

See also

This is an alias for array_remove().

datafusion.functions.list_remove_all(array: datafusion.expr.Expr, element: datafusion.expr.Expr) datafusion.expr.Expr

Removes all elements from the array equal to the given value.

See also

This is an alias for array_remove_all().

datafusion.functions.list_remove_n(array: datafusion.expr.Expr, element: datafusion.expr.Expr, max: datafusion.expr.Expr) datafusion.expr.Expr

Removes the first max elements from the array equal to the given value.

See also

This is an alias for array_remove_n().

datafusion.functions.list_repeat(element: datafusion.expr.Expr, count: datafusion.expr.Expr) datafusion.expr.Expr

Returns an array containing element count times.

See also

This is an alias for array_repeat().

datafusion.functions.list_replace(array: datafusion.expr.Expr, from_val: datafusion.expr.Expr, to_val: datafusion.expr.Expr) datafusion.expr.Expr

Replaces the first occurrence of from_val with to_val.

See also

This is an alias for array_replace().

datafusion.functions.list_replace_all(array: datafusion.expr.Expr, from_val: datafusion.expr.Expr, to_val: datafusion.expr.Expr) datafusion.expr.Expr

Replaces all occurrences of from_val with to_val.

See also

This is an alias for array_replace_all().

datafusion.functions.list_replace_n(array: datafusion.expr.Expr, from_val: datafusion.expr.Expr, to_val: datafusion.expr.Expr, max: datafusion.expr.Expr) datafusion.expr.Expr

Replace n occurrences of from_val with to_val.

Replaces the first max occurrences of the specified element with another specified element.

See also

This is an alias for array_replace_n().

datafusion.functions.list_resize(array: datafusion.expr.Expr, size: datafusion.expr.Expr, value: datafusion.expr.Expr) datafusion.expr.Expr

Returns an array with the specified size filled.

If size is greater than the array length, the additional entries will be filled with the given value.

See also

This is an alias for array_resize().

datafusion.functions.list_reverse(array: datafusion.expr.Expr) datafusion.expr.Expr

Reverses the order of elements in the array.

See also

This is an alias for array_reverse().

datafusion.functions.list_slice(array: datafusion.expr.Expr, begin: datafusion.expr.Expr, end: datafusion.expr.Expr, stride: datafusion.expr.Expr | None = None) datafusion.expr.Expr

Returns a slice of the array.

See also

This is an alias for array_slice().

datafusion.functions.list_sort(array: datafusion.expr.Expr, descending: bool = False, null_first: bool = False) datafusion.expr.Expr

Sorts the array.

See also

This is an alias for array_sort().

datafusion.functions.list_to_string(expr: datafusion.expr.Expr, delimiter: datafusion.expr.Expr) datafusion.expr.Expr

Converts each element to its text representation.

See also

This is an alias for array_to_string().

datafusion.functions.list_union(array1: datafusion.expr.Expr, array2: datafusion.expr.Expr) datafusion.expr.Expr

Returns an array of the elements in the union of array1 and array2.

Duplicate rows will not be returned.

See also

This is an alias for array_union().

datafusion.functions.list_zip(*arrays: datafusion.expr.Expr) datafusion.expr.Expr

Combines multiple arrays into a single array of structs.

See also

This is an alias for arrays_zip().

datafusion.functions.ln(arg: datafusion.expr.Expr) datafusion.expr.Expr

Returns the natural logarithm (base e) of the argument.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [1.0]})
>>> result = df.select(dfn.functions.ln(dfn.col("a")).alias("ln"))
>>> result.collect_column("ln")[0].as_py()
0.0
datafusion.functions.log(base: datafusion.expr.Expr, num: datafusion.expr.Expr) datafusion.expr.Expr

Returns the logarithm of a number for a particular base.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [100.0]})
>>> result = df.select(
...     dfn.functions.log(dfn.lit(10.0), dfn.col("a")).alias("log")
... )
>>> result.collect_column("log")[0].as_py()
2.0
datafusion.functions.log10(arg: datafusion.expr.Expr) datafusion.expr.Expr

Base 10 logarithm of the argument.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [100.0]})
>>> result = df.select(dfn.functions.log10(dfn.col("a")).alias("log10"))
>>> result.collect_column("log10")[0].as_py()
2.0
datafusion.functions.log2(arg: datafusion.expr.Expr) datafusion.expr.Expr

Base 2 logarithm of the argument.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [8.0]})
>>> result = df.select(dfn.functions.log2(dfn.col("a")).alias("log2"))
>>> result.collect_column("log2")[0].as_py()
3.0
datafusion.functions.lower(arg: datafusion.expr.Expr) datafusion.expr.Expr

Converts a string to lowercase.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": ["THE CaT"]})
>>> lower_df = df.select(dfn.functions.lower(dfn.col("a")).alias("lower"))
>>> lower_df.collect_column("lower")[0].as_py()
'the cat'
datafusion.functions.lpad(string: datafusion.expr.Expr, count: datafusion.expr.Expr, characters: datafusion.expr.Expr | None = None) datafusion.expr.Expr

Add left padding to a string.

Extends the string to length length by prepending the characters fill (a space by default). If the string is already longer than length then it is truncated (on the right).

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": ["the cat", "a hat"]})
>>> lpad_df = df.select(
...     dfn.functions.lpad(
...         dfn.col("a"), dfn.lit(6)
...     ).alias("lpad"))
>>> lpad_df.collect_column("lpad")[0].as_py()
'the ca'
>>> lpad_df.collect_column("lpad")[1].as_py()
' a hat'
>>> result = df.select(
...     dfn.functions.lpad(
...         dfn.col("a"), dfn.lit(10), characters=dfn.lit(".")
...     ).alias("lpad"))
>>> result.collect_column("lpad")[0].as_py()
'...the cat'
datafusion.functions.ltrim(arg: datafusion.expr.Expr) datafusion.expr.Expr

Removes all characters, spaces by default, from the beginning of a string.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [" a  "]})
>>> trim_df = df.select(dfn.functions.ltrim(dfn.col("a")).alias("trimmed"))
>>> trim_df.collect_column("trimmed")[0].as_py()
'a  '
datafusion.functions.make_array(*args: datafusion.expr.Expr) datafusion.expr.Expr

Returns an array using the specified input expressions.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [1]})
>>> result = df.select(
...     dfn.functions.make_array(
...         dfn.lit(1), dfn.lit(2), dfn.lit(3)
...     ).alias("arr"))
>>> result.collect_column("arr")[0].as_py()
[1, 2, 3]
datafusion.functions.make_date(year: datafusion.expr.Expr, month: datafusion.expr.Expr, day: datafusion.expr.Expr) datafusion.expr.Expr

Make a date from year, month and day component parts.

Examples

>>> from datetime import date
>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"y": [2024], "m": [1], "d": [15]})
>>> result = df.select(
...     dfn.functions.make_date(dfn.col("y"), dfn.col("m"),
...     dfn.col("d")).alias("dt"))
>>> result.collect_column("dt")[0].as_py()
datetime.date(2024, 1, 15)
datafusion.functions.make_list(*args: datafusion.expr.Expr) datafusion.expr.Expr

Returns an array using the specified input expressions.

See also

This is an alias for make_array().

datafusion.functions.make_map(*args: Any) datafusion.expr.Expr

Returns a map expression.

Supports three calling conventions:

  • make_map({"a": 1, "b": 2}) — from a Python dictionary.

  • make_map([keys], [values]) — from a list of keys and a list of their associated values. Both lists must be the same length.

  • make_map(k1, v1, k2, v2, ...) — from alternating keys and their associated values.

Keys and values that are not already Expr are automatically converted to literal expressions.

Examples

From a dictionary:

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [1]})
>>> result = df.select(
...     dfn.functions.make_map({"a": 1, "b": 2}).alias("m"))
>>> result.collect_column("m")[0].as_py()
[('a', 1), ('b', 2)]

From two lists:

>>> df = ctx.from_pydict({"key": ["x", "y"], "val": [10, 20]})
>>> df = df.select(
...     dfn.functions.make_map(
...         [dfn.col("key")], [dfn.col("val")]
...     ).alias("m"))
>>> df.collect_column("m")[0].as_py()
[('x', 10)]

From alternating keys and values:

>>> df = ctx.from_pydict({"a": [1]})
>>> result = df.select(
...     dfn.functions.make_map("x", 1, "y", 2).alias("m"))
>>> result.collect_column("m")[0].as_py()
[('x', 1), ('y', 2)]
datafusion.functions.make_time(hour: datafusion.expr.Expr, minute: datafusion.expr.Expr, second: datafusion.expr.Expr) datafusion.expr.Expr

Make a time from hour, minute and second component parts.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"h": [12], "m": [30], "s": [0]})
>>> result = df.select(
...     dfn.functions.make_time(dfn.col("h"), dfn.col("m"),
...     dfn.col("s")).alias("t"))
>>> result.collect_column("t")[0].as_py()
datetime.time(12, 30)
datafusion.functions.map_entries(map: datafusion.expr.Expr) datafusion.expr.Expr

Returns a list of all entries (key-value struct pairs) in the map.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [1]})
>>> df = df.select(
...     dfn.functions.make_map({"x": 1, "y": 2}).alias("m"))
>>> result = df.select(
...     dfn.functions.map_entries(dfn.col("m")).alias("entries"))
>>> result.collect_column("entries")[0].as_py()
[{'key': 'x', 'value': 1}, {'key': 'y', 'value': 2}]
datafusion.functions.map_extract(map: datafusion.expr.Expr, key: datafusion.expr.Expr) datafusion.expr.Expr

Returns the value for a given key in the map.

Returns [None] if the key is absent.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [1]})
>>> df = df.select(
...     dfn.functions.make_map({"x": 1, "y": 2}).alias("m"))
>>> result = df.select(
...     dfn.functions.map_extract(
...         dfn.col("m"), dfn.lit("x")
...     ).alias("val"))
>>> result.collect_column("val")[0].as_py()
[1]
datafusion.functions.map_keys(map: datafusion.expr.Expr) datafusion.expr.Expr

Returns a list of all keys in the map.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [1]})
>>> df = df.select(
...     dfn.functions.make_map({"x": 1, "y": 2}).alias("m"))
>>> result = df.select(
...     dfn.functions.map_keys(dfn.col("m")).alias("keys"))
>>> result.collect_column("keys")[0].as_py()
['x', 'y']
datafusion.functions.map_values(map: datafusion.expr.Expr) datafusion.expr.Expr

Returns a list of all values in the map.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [1]})
>>> df = df.select(
...     dfn.functions.make_map({"x": 1, "y": 2}).alias("m"))
>>> result = df.select(
...     dfn.functions.map_values(dfn.col("m")).alias("vals"))
>>> result.collect_column("vals")[0].as_py()
[1, 2]
datafusion.functions.max(expression: datafusion.expr.Expr, filter: datafusion.expr.Expr | None = None) datafusion.expr.Expr

Aggregate function that returns the maximum value of the argument.

If using the builder functions described in ref:_aggregation this function ignores the options order_by, null_treatment, and distinct.

Parameters:
  • expression – The value to find the maximum of

  • filter – If provided, only compute against rows for which the filter is True

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [1, 2, 3]})
>>> result = df.aggregate(
...     [], [dfn.functions.max(
...         dfn.col("a")
...     ).alias("v")])
>>> result.collect_column("v")[0].as_py()
3
>>> result = df.aggregate(
...     [], [dfn.functions.max(
...         dfn.col("a"),
...         filter=dfn.col("a") < dfn.lit(3)
...     ).alias("v")])
>>> result.collect_column("v")[0].as_py()
2
datafusion.functions.md5(arg: datafusion.expr.Expr) datafusion.expr.Expr

Computes an MD5 128-bit checksum for a string expression.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": ["hello"]})
>>> result = df.select(dfn.functions.md5(dfn.col("a")).alias("md5"))
>>> result.collect_column("md5")[0].as_py()
'5d41402abc4b2a76b9719d911017c592'
datafusion.functions.mean(expression: datafusion.expr.Expr, filter: datafusion.expr.Expr | None = None) datafusion.expr.Expr

Returns the average (mean) value of the argument.

See also

This is an alias for avg().

datafusion.functions.median(expression: datafusion.expr.Expr, distinct: bool = False, filter: datafusion.expr.Expr | None = None) datafusion.expr.Expr

Computes the median of a set of numbers.

This aggregate function returns the median value of the expression for the given aggregate function.

If using the builder functions described in ref:_aggregation this function ignores the options order_by and null_treatment.

Parameters:
  • expression – The value to compute the median of

  • distinct – If True, a single entry for each distinct value will be in the result

  • filter – If provided, only compute against rows for which the filter is True

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [1.0, 2.0, 3.0]})
>>> result = df.aggregate(
...     [], [dfn.functions.median(
...         dfn.col("a")
...     ).alias("v")])
>>> result.collect_column("v")[0].as_py()
2.0
>>> df = ctx.from_pydict({"a": [1.0, 1.0, 2.0, 3.0]})
>>> result = df.aggregate(
...     [], [dfn.functions.median(
...         dfn.col("a"), distinct=True,
...         filter=dfn.col("a") < dfn.lit(3.0),
...     ).alias("v")])
>>> result.collect_column("v")[0].as_py()
1.5
datafusion.functions.min(expression: datafusion.expr.Expr, filter: datafusion.expr.Expr | None = None) datafusion.expr.Expr

Aggregate function that returns the minimum value of the argument.

If using the builder functions described in ref:_aggregation this function ignores the options order_by, null_treatment, and distinct.

Parameters:
  • expression – The value to find the minimum of

  • filter – If provided, only compute against rows for which the filter is True

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [1, 2, 3]})
>>> result = df.aggregate(
...     [], [dfn.functions.min(
...         dfn.col("a")
...     ).alias("v")])
>>> result.collect_column("v")[0].as_py()
1
>>> result = df.aggregate(
...     [], [dfn.functions.min(
...         dfn.col("a"),
...         filter=dfn.col("a") > dfn.lit(1)
...     ).alias("v")])
>>> result.collect_column("v")[0].as_py()
2
datafusion.functions.named_struct(name_pairs: list[tuple[str, datafusion.expr.Expr]]) datafusion.expr.Expr

Returns a struct with the given names and arguments pairs.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [1]})
>>> result = df.select(
...     dfn.functions.named_struct(
...         [("x", dfn.lit(10)), ("y", dfn.lit(20))]
...     ).alias("s")
... )
>>> result.collect_column("s")[0].as_py() == {"x": 10, "y": 20}
True
datafusion.functions.nanvl(x: datafusion.expr.Expr, y: datafusion.expr.Expr) datafusion.expr.Expr

Returns x if x is not NaN. Otherwise returns y.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [np.nan, 1.0], "b": [0.0, 0.0]})
>>> nanvl_df = df.select(
...     dfn.functions.nanvl(dfn.col("a"), dfn.col("b")).alias("nanvl"))
>>> nanvl_df.collect_column("nanvl")[0].as_py()
0.0
>>> nanvl_df.collect_column("nanvl")[1].as_py()
1.0
datafusion.functions.now() datafusion.expr.Expr

Returns the current timestamp in nanoseconds.

This will use the same value for all instances of now() in same statement.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [1]})
>>> result = df.select(
...     dfn.functions.now().alias("now")
... )

Use .value instead of .as_py() because nanosecond timestamps require pandas to convert to Python datetime objects.

>>> result.collect_column("now")[0].value > 0
True
datafusion.functions.nth_value(expression: datafusion.expr.Expr, n: int, filter: datafusion.expr.Expr | None = None, order_by: list[datafusion.expr.SortKey] | datafusion.expr.SortKey | None = None, null_treatment: datafusion.common.NullTreatment = NullTreatment.RESPECT_NULLS) datafusion.expr.Expr

Returns the n-th value in a group of values.

This aggregate function will return the n-th value in the partition.

If using the builder functions described in ref:_aggregation this function ignores the option distinct.

Parameters:
  • expression – Argument to perform bitwise calculation on

  • n – Index of value to return. Starts at 1.

  • filter – If provided, only compute against rows for which the filter is True

  • order_by – Set the ordering of the expression to evaluate. Accepts column names or expressions.

  • null_treatment – Assign whether to respect or ignore null values.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [10, 20, 30]})
>>> result = df.aggregate(
...     [], [dfn.functions.nth_value(
...         dfn.col("a"), 1
...     ).alias("v")]
... )
>>> result.collect_column("v")[0].as_py()
10
>>> result = df.aggregate(
...     [], [dfn.functions.nth_value(
...         dfn.col("a"), 1,
...         filter=dfn.col("a") > dfn.lit(10),
...         order_by="a",
...         null_treatment=dfn.common.NullTreatment.IGNORE_NULLS,
...     ).alias("v")]
... )
>>> result.collect_column("v")[0].as_py()
20
datafusion.functions.ntile(groups: int, partition_by: list[datafusion.expr.Expr] | datafusion.expr.Expr | None = None, order_by: list[datafusion.expr.SortKey] | datafusion.expr.SortKey | None = None) datafusion.expr.Expr

Create a n-tile window function.

This window function orders the window frame into a give number of groups based on the ordering criteria. It then returns which group the current row is assigned to. Here is an example of a dataframe with a window ordered by descending points and the associated n-tile function:

+--------+-------+
| points | ntile |
+--------+-------+
| 120    | 1     |
| 100    | 1     |
| 80     | 2     |
| 60     | 2     |
| 40     | 3     |
| 20     | 3     |
+--------+-------+
Parameters:
  • groups – Number of groups for the n-tile to be divided into.

  • partition_by – Expressions to partition the window frame on.

  • order_by – Set ordering within the window frame. Accepts column names or expressions.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [10, 20, 30, 40]})
>>> result = df.select(
...     dfn.col("a"),
...     dfn.functions.ntile(
...         2, order_by="a"
...     ).alias("nt"))
>>> result.sort(dfn.col("a")).collect_column("nt").to_pylist()
[1, 1, 2, 2]
>>> df = ctx.from_pydict(
...     {"g": ["a", "a", "b", "b"], "v": [1, 2, 3, 4]})
>>> result = df.select(
...     dfn.col("g"), dfn.col("v"),
...     dfn.functions.ntile(
...         2, partition_by=dfn.col("g"), order_by="v",
...     ).alias("nt"))
>>> result.sort(dfn.col("g"), dfn.col("v")).collect_column("nt").to_pylist()
[1, 2, 1, 2]
datafusion.functions.nullif(expr1: datafusion.expr.Expr, expr2: datafusion.expr.Expr) datafusion.expr.Expr

Returns NULL if expr1 equals expr2; otherwise it returns expr1.

This can be used to perform the inverse operation of the COALESCE expression.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [1, 2], "b": [1, 3]})
>>> result = df.select(
...     dfn.functions.nullif(dfn.col("a"), dfn.col("b")).alias("nullif"))
>>> result.collect_column("nullif").to_pylist()
[None, 2]
datafusion.functions.nvl(x: datafusion.expr.Expr, y: datafusion.expr.Expr) datafusion.expr.Expr

Returns x if x is not NULL. Otherwise returns y.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [None, 1], "b": [0, 0]})
>>> nvl_df = df.select(
...     dfn.functions.nvl(dfn.col("a"), dfn.col("b")).alias("nvl")
... )
>>> nvl_df.collect_column("nvl")[0].as_py()
0
>>> nvl_df.collect_column("nvl")[1].as_py()
1
datafusion.functions.nvl2(x: datafusion.expr.Expr, y: datafusion.expr.Expr, z: datafusion.expr.Expr) datafusion.expr.Expr

Returns y if x is not NULL. Otherwise returns z.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [None, 1], "b": [10, 20], "c": [30, 40]})
>>> result = df.select(
...     dfn.functions.nvl2(
...         dfn.col("a"), dfn.col("b"), dfn.col("c")).alias("nvl2")
... )
>>> result.collect_column("nvl2")[0].as_py()
30
>>> result.collect_column("nvl2")[1].as_py()
20
datafusion.functions.octet_length(arg: datafusion.expr.Expr) datafusion.expr.Expr

Returns the number of bytes of a string.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": ["hello"]})
>>> result = df.select(dfn.functions.octet_length(dfn.col("a")).alias("len"))
>>> result.collect_column("len")[0].as_py()
5
datafusion.functions.order_by(expr: datafusion.expr.Expr, ascending: bool = True, nulls_first: bool = True) datafusion.expr.SortExpr

Creates a new sort expression.

Examples

>>> sort_expr = dfn.functions.order_by(
...     dfn.col("a"), ascending=False)
>>> sort_expr.ascending()
False
>>> sort_expr = dfn.functions.order_by(
...     dfn.col("a"), ascending=True, nulls_first=False)
>>> sort_expr.nulls_first()
False
datafusion.functions.overlay(string: datafusion.expr.Expr, substring: datafusion.expr.Expr, start: datafusion.expr.Expr, length: datafusion.expr.Expr | None = None) datafusion.expr.Expr

Replace a substring with a new substring.

Replace the substring of string that starts at the start’th character and extends for length characters with new substring.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": ["abcdef"]})
>>> result = df.select(
...     dfn.functions.overlay(dfn.col("a"), dfn.lit("XY"), dfn.lit(3),
...     dfn.lit(2)).alias("o"))
>>> result.collect_column("o")[0].as_py()
'abXYef'
datafusion.functions.percent_rank(partition_by: list[datafusion.expr.Expr] | datafusion.expr.Expr | None = None, order_by: list[datafusion.expr.SortKey] | datafusion.expr.SortKey | None = None) datafusion.expr.Expr

Create a percent_rank window function.

This window function is similar to rank() except that the returned values are the percentage from 0.0 to 1.0 from first to last. Here is an example of a dataframe with a window ordered by descending points and the associated percent rank:

+--------+--------------+
| points | percent_rank |
+--------+--------------+
| 100    | 0.0          |
| 100    | 0.0          |
| 50     | 0.666667     |
| 25     | 1.0          |
+--------+--------------+
Parameters:
  • partition_by – Expressions to partition the window frame on.

  • order_by – Set ordering within the window frame. Accepts column names or expressions.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [10, 20, 30]})
>>> result = df.select(
...     dfn.col("a"),
...     dfn.functions.percent_rank(
...         order_by="a"
...     ).alias("pr"))
>>> result.sort(dfn.col("a")).collect_column("pr").to_pylist()
[0.0, 0.5, 1.0]
>>> df = ctx.from_pydict(
...     {"g": ["a", "a", "a", "b", "b"], "v": [1, 2, 3, 4, 5]})
>>> result = df.select(
...     dfn.col("g"), dfn.col("v"),
...     dfn.functions.percent_rank(
...         partition_by=dfn.col("g"), order_by="v",
...     ).alias("pr"))
>>> result.sort(dfn.col("g"), dfn.col("v")).collect_column("pr").to_pylist()
[0.0, 0.5, 1.0, 0.0, 1.0]
datafusion.functions.percentile_cont(sort_expression: datafusion.expr.Expr | datafusion.expr.SortExpr, percentile: float, filter: datafusion.expr.Expr | None = None) datafusion.expr.Expr

Computes the exact percentile of input values using continuous interpolation.

Unlike approx_percentile_cont(), this function computes the exact percentile value rather than an approximation.

If using the builder functions described in ref:_aggregation this function ignores the options order_by, null_treatment, and distinct.

Parameters:
  • sort_expression – Values for which to find the percentile

  • percentile – This must be between 0.0 and 1.0, inclusive

  • filter – If provided, only compute against rows for which the filter is True

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [1.0, 2.0, 3.0, 4.0, 5.0]})
>>> result = df.aggregate(
...     [], [dfn.functions.percentile_cont(
...         dfn.col("a"), 0.5
...     ).alias("v")])
>>> result.collect_column("v")[0].as_py()
3.0
>>> result = df.aggregate(
...     [], [dfn.functions.percentile_cont(
...         dfn.col("a"), 0.5,
...         filter=dfn.col("a") > dfn.lit(1.0),
...     ).alias("v")])
>>> result.collect_column("v")[0].as_py()
3.5
datafusion.functions.pi() datafusion.expr.Expr

Returns an approximate value of π.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [1]})
>>> from math import pi
>>> result = df.select(
...     dfn.functions.pi().alias("pi")
... )
>>> result.collect_column("pi")[0].as_py() == pi
True
datafusion.functions.pow(base: datafusion.expr.Expr, exponent: datafusion.expr.Expr) datafusion.expr.Expr

Returns base raised to the power of exponent.

See also

This is an alias of power().

datafusion.functions.power(base: datafusion.expr.Expr, exponent: datafusion.expr.Expr) datafusion.expr.Expr

Returns base raised to the power of exponent.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [2.0]})
>>> result = df.select(
...     dfn.functions.power(dfn.col("a"), dfn.lit(3.0)).alias("pow")
... )
>>> result.collect_column("pow")[0].as_py()
8.0
datafusion.functions.quantile_cont(sort_expression: datafusion.expr.Expr | datafusion.expr.SortExpr, percentile: float, filter: datafusion.expr.Expr | None = None) datafusion.expr.Expr

Computes the exact percentile of input values using continuous interpolation.

See also

This is an alias for percentile_cont().

datafusion.functions.radians(arg: datafusion.expr.Expr) datafusion.expr.Expr

Converts the argument from degrees to radians.

Examples

>>> from math import pi
>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [180.0]})
>>> result = df.select(
...     dfn.functions.radians(dfn.col("a")).alias("rad")
... )
>>> result.collect_column("rad")[0].as_py() == pi
True
datafusion.functions.random() datafusion.expr.Expr

Returns a random value in the range 0.0 <= x < 1.0.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [1]})
>>> result = df.select(
...     dfn.functions.random().alias("r")
... )
>>> val = result.collect_column("r")[0].as_py()
>>> 0.0 <= val < 1.0
True
datafusion.functions.range(start: datafusion.expr.Expr, stop: datafusion.expr.Expr, step: datafusion.expr.Expr) datafusion.expr.Expr

Create a list of values in the range between start and stop.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [1]})
>>> result = df.select(
...     dfn.functions.range(dfn.lit(0), dfn.lit(5), dfn.lit(2)).alias("r"))
>>> result.collect_column("r")[0].as_py()
[0, 2, 4]
datafusion.functions.rank(partition_by: list[datafusion.expr.Expr] | datafusion.expr.Expr | None = None, order_by: list[datafusion.expr.SortKey] | datafusion.expr.SortKey | None = None) datafusion.expr.Expr

Create a rank window function.

Returns the rank based upon the window order. Consecutive equal values will receive the same rank, but the next different value will not be consecutive but rather the number of rows that precede it plus one. This is similar to Olympic medals. If two people tie for gold, the next place is bronze. There would be no silver medal. Here is an example of a dataframe with a window ordered by descending points and the associated rank.

You should set order_by to produce meaningful results:

+--------+------+
| points | rank |
+--------+------+
| 100    | 1    |
| 100    | 1    |
| 50     | 3    |
| 25     | 4    |
+--------+------+
Parameters:
  • partition_by – Expressions to partition the window frame on.

  • order_by – Set ordering within the window frame. Accepts column names or expressions.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [10, 10, 20]})
>>> result = df.select(
...     dfn.col("a"),
...     dfn.functions.rank(
...         order_by="a"
...     ).alias("rnk")
... )
>>> result.sort(dfn.col("a")).collect_column("rnk").to_pylist()
[1, 1, 3]
>>> df = ctx.from_pydict(
...     {"g": ["a", "a", "b", "b"], "v": [1, 1, 2, 3]})
>>> result = df.select(
...     dfn.col("g"), dfn.col("v"),
...     dfn.functions.rank(
...         partition_by=dfn.col("g"), order_by="v",
...     ).alias("rnk"))
>>> result.sort(dfn.col("g"), dfn.col("v")).collect_column("rnk").to_pylist()
[1, 1, 1, 2]
datafusion.functions.regexp_count(string: datafusion.expr.Expr, pattern: datafusion.expr.Expr, start: datafusion.expr.Expr | None = None, flags: datafusion.expr.Expr | None = None) datafusion.expr.Expr

Returns the number of matches in a string.

Optional start position (the first position is 1) to search for the regular expression.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": ["abcabc"]})
>>> result = df.select(
...     dfn.functions.regexp_count(
...         dfn.col("a"), dfn.lit("abc")
...     ).alias("c"))
>>> result.collect_column("c")[0].as_py()
2

Use start to begin searching from a position, and flags for case-insensitive matching:

>>> result = df.select(
...     dfn.functions.regexp_count(
...         dfn.col("a"), dfn.lit("ABC"),
...         start=dfn.lit(4), flags=dfn.lit("i"),
...     ).alias("c"))
>>> result.collect_column("c")[0].as_py()
1
datafusion.functions.regexp_instr(values: datafusion.expr.Expr, regex: datafusion.expr.Expr, start: datafusion.expr.Expr | None = None, n: datafusion.expr.Expr | None = None, flags: datafusion.expr.Expr | None = None, sub_expr: datafusion.expr.Expr | None = None) datafusion.expr.Expr

Returns the position of a regular expression match in a string.

Parameters:
  • values – Data to search for the regular expression match.

  • regex – Regular expression to search for.

  • start – Optional position to start the search (the first position is 1).

  • n – Optional occurrence of the match to find (the first occurrence is 1).

  • flags – Optional regular expression flags to control regex behavior.

  • sub_expr – Optionally capture group position instead of the entire match.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": ["hello 42 world"]})
>>> result = df.select(
...     dfn.functions.regexp_instr(
...         dfn.col("a"), dfn.lit("\\d+")
...     ).alias("pos")
... )
>>> result.collect_column("pos")[0].as_py()
7

Use start to search from a position, n for the nth occurrence, and flags for case-insensitive mode:

>>> df = ctx.from_pydict({"a": ["abc ABC abc"]})
>>> result = df.select(
...     dfn.functions.regexp_instr(
...         dfn.col("a"), dfn.lit("abc"),
...         start=dfn.lit(2), n=dfn.lit(1),
...         flags=dfn.lit("i"),
...     ).alias("pos")
... )
>>> result.collect_column("pos")[0].as_py()
5

Use sub_expr to get the position of a capture group:

>>> result = df.select(
...     dfn.functions.regexp_instr(
...         dfn.col("a"), dfn.lit("(abc)"),
...         sub_expr=dfn.lit(1),
...     ).alias("pos")
... )
>>> result.collect_column("pos")[0].as_py()
1
datafusion.functions.regexp_like(string: datafusion.expr.Expr, regex: datafusion.expr.Expr, flags: datafusion.expr.Expr | None = None) datafusion.expr.Expr

Find if any regular expression (regex) matches exist.

Tests a string using a regular expression returning true if at least one match, false otherwise.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": ["hello123"]})
>>> result = df.select(
...     dfn.functions.regexp_like(
...         dfn.col("a"), dfn.lit("\\d+")
...     ).alias("m")
... )
>>> result.collect_column("m")[0].as_py()
True

Use flags for case-insensitive matching:

>>> result = df.select(
...     dfn.functions.regexp_like(
...         dfn.col("a"), dfn.lit("HELLO"),
...         flags=dfn.lit("i"),
...     ).alias("m")
... )
>>> result.collect_column("m")[0].as_py()
True
datafusion.functions.regexp_match(string: datafusion.expr.Expr, regex: datafusion.expr.Expr, flags: datafusion.expr.Expr | None = None) datafusion.expr.Expr

Perform regular expression (regex) matching.

Returns an array with each element containing the leftmost-first match of the corresponding index in regex to string in string.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": ["hello 42 world"]})
>>> result = df.select(
...     dfn.functions.regexp_match(
...         dfn.col("a"), dfn.lit("(\\d+)")
...     ).alias("m")
... )
>>> result.collect_column("m")[0].as_py()
['42']

Use flags for case-insensitive matching:

>>> result = df.select(
...     dfn.functions.regexp_match(
...         dfn.col("a"), dfn.lit("(HELLO)"),
...         flags=dfn.lit("i"),
...     ).alias("m")
... )
>>> result.collect_column("m")[0].as_py()
['hello']
datafusion.functions.regexp_replace(string: datafusion.expr.Expr, pattern: datafusion.expr.Expr, replacement: datafusion.expr.Expr, flags: datafusion.expr.Expr | None = None) datafusion.expr.Expr

Replaces substring(s) matching a PCRE-like regular expression.

The full list of supported features and syntax can be found at <https://docs.rs/regex/latest/regex/#syntax>

Supported flags with the addition of ‘g’ can be found at <https://docs.rs/regex/latest/regex/#grouping-and-flags>

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": ["hello 42"]})
>>> result = df.select(
...     dfn.functions.regexp_replace(
...         dfn.col("a"), dfn.lit("\\d+"),
...         dfn.lit("XX")
...     ).alias("r")
... )
>>> result.collect_column("r")[0].as_py()
'hello XX'

Use the g flag to replace all occurrences:

>>> df = ctx.from_pydict({"a": ["a1 b2 c3"]})
>>> result = df.select(
...     dfn.functions.regexp_replace(
...         dfn.col("a"), dfn.lit("\\d+"),
...         dfn.lit("X"), flags=dfn.lit("g"),
...     ).alias("r")
... )
>>> result.collect_column("r")[0].as_py()
'aX bX cX'
datafusion.functions.regr_avgx(y: datafusion.expr.Expr, x: datafusion.expr.Expr, filter: datafusion.expr.Expr | None = None) datafusion.expr.Expr

Computes the average of the independent variable x.

This is a linear regression aggregate function. Only non-null pairs of the inputs are evaluated.

If using the builder functions described in ref:_aggregation this function ignores the options order_by, null_treatment, and distinct.

Parameters:
  • y – The linear regression dependent variable

  • x – The linear regression independent variable

  • filter – If provided, only compute against rows for which the filter is True

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"y": [1.0, 2.0, 3.0], "x": [4.0, 5.0, 6.0]})
>>> result = df.aggregate(
...     [], [dfn.functions.regr_avgx(
...         dfn.col("y"), dfn.col("x")
...     ).alias("v")])
>>> result.collect_column("v")[0].as_py()
5.0
>>> result = df.aggregate(
...     [], [dfn.functions.regr_avgx(
...         dfn.col("y"), dfn.col("x"),
...         filter=dfn.col("y") > dfn.lit(1.0)
...     ).alias("v")])
>>> result.collect_column("v")[0].as_py()
5.5
datafusion.functions.regr_avgy(y: datafusion.expr.Expr, x: datafusion.expr.Expr, filter: datafusion.expr.Expr | None = None) datafusion.expr.Expr

Computes the average of the dependent variable y.

This is a linear regression aggregate function. Only non-null pairs of the inputs are evaluated.

If using the builder functions described in ref:_aggregation this function ignores the options order_by, null_treatment, and distinct.

Parameters:
  • y – The linear regression dependent variable

  • x – The linear regression independent variable

  • filter – If provided, only compute against rows for which the filter is True

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"y": [1.0, 2.0, 3.0], "x": [4.0, 5.0, 6.0]})
>>> result = df.aggregate(
...     [], [dfn.functions.regr_avgy(
...         dfn.col("y"), dfn.col("x")
...     ).alias("v")])
>>> result.collect_column("v")[0].as_py()
2.0
>>> result = df.aggregate(
...     [], [dfn.functions.regr_avgy(
...         dfn.col("y"), dfn.col("x"),
...         filter=dfn.col("y") > dfn.lit(1.0)
...     ).alias("v")])
>>> result.collect_column("v")[0].as_py()
2.5
datafusion.functions.regr_count(y: datafusion.expr.Expr, x: datafusion.expr.Expr, filter: datafusion.expr.Expr | None = None) datafusion.expr.Expr

Counts the number of rows in which both expressions are not null.

This is a linear regression aggregate function. Only non-null pairs of the inputs are evaluated.

If using the builder functions described in ref:_aggregation this function ignores the options order_by, null_treatment, and distinct.

Parameters:
  • y – The linear regression dependent variable

  • x – The linear regression independent variable

  • filter – If provided, only compute against rows for which the filter is True

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"y": [1.0, 2.0, 3.0], "x": [4.0, 5.0, 6.0]})
>>> result = df.aggregate(
...     [], [dfn.functions.regr_count(
...         dfn.col("y"), dfn.col("x")
...     ).alias("v")])
>>> result.collect_column("v")[0].as_py()
3
>>> result = df.aggregate(
...     [], [dfn.functions.regr_count(
...         dfn.col("y"), dfn.col("x"),
...         filter=dfn.col("y") > dfn.lit(1.0)
...     ).alias("v")])
>>> result.collect_column("v")[0].as_py()
2
datafusion.functions.regr_intercept(y: datafusion.expr.Expr, x: datafusion.expr.Expr, filter: datafusion.expr.Expr | None = None) datafusion.expr.Expr

Computes the intercept from the linear regression.

This is a linear regression aggregate function. Only non-null pairs of the inputs are evaluated.

If using the builder functions described in ref:_aggregation this function ignores the options order_by, null_treatment, and distinct.

Parameters:
  • y – The linear regression dependent variable

  • x – The linear regression independent variable

  • filter – If provided, only compute against rows for which the filter is True

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"y": [2.0, 4.0, 6.0], "x": [4.0, 16.0, 36.0]})
>>> result = df.aggregate(
...     [],
...     [dfn.functions.regr_intercept(
...         dfn.col("y"), dfn.col("x")
...     ).alias("v")])
>>> result.collect_column("v")[0].as_py()
1.714...
>>> result = df.aggregate(
...     [],
...     [dfn.functions.regr_intercept(
...         dfn.col("y"), dfn.col("x"),
...         filter=dfn.col("y") > dfn.lit(2.0)
...     ).alias("v")])
>>> result.collect_column("v")[0].as_py()
2.4
datafusion.functions.regr_r2(y: datafusion.expr.Expr, x: datafusion.expr.Expr, filter: datafusion.expr.Expr | None = None) datafusion.expr.Expr

Computes the R-squared value from linear regression.

This is a linear regression aggregate function. Only non-null pairs of the inputs are evaluated.

If using the builder functions described in ref:_aggregation this function ignores the options order_by, null_treatment, and distinct.

Parameters:
  • y – The linear regression dependent variable

  • x – The linear regression independent variable

  • filter – If provided, only compute against rows for which the filter is True

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"y": [2.0, 4.0, 6.0], "x": [4.0, 16.0, 36.0]})
>>> result = df.aggregate(
...     [], [dfn.functions.regr_r2(
...         dfn.col("y"), dfn.col("x")
...     ).alias("v")])
>>> result.collect_column("v")[0].as_py()
0.9795...
>>> result = df.aggregate(
...     [], [dfn.functions.regr_r2(
...         dfn.col("y"), dfn.col("x"),
...         filter=dfn.col("y") > dfn.lit(2.0)
...     ).alias("v")])
>>> result.collect_column("v")[0].as_py()
1.0
datafusion.functions.regr_slope(y: datafusion.expr.Expr, x: datafusion.expr.Expr, filter: datafusion.expr.Expr | None = None) datafusion.expr.Expr

Computes the slope from linear regression.

This is a linear regression aggregate function. Only non-null pairs of the inputs are evaluated.

If using the builder functions described in ref:_aggregation this function ignores the options order_by, null_treatment, and distinct.

Parameters:
  • y – The linear regression dependent variable

  • x – The linear regression independent variable

  • filter – If provided, only compute against rows for which the filter is True

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"y": [2.0, 4.0, 6.0], "x": [4.0, 16.0, 36.0]})
>>> result = df.aggregate(
...     [], [dfn.functions.regr_slope(
...         dfn.col("y"), dfn.col("x")
...     ).alias("v")])
>>> result.collect_column("v")[0].as_py()
0.122...
>>> result = df.aggregate(
...     [], [dfn.functions.regr_slope(
...         dfn.col("y"), dfn.col("x"),
...         filter=dfn.col("y") > dfn.lit(2.0)
...     ).alias("v")])
>>> result.collect_column("v")[0].as_py()
0.1
datafusion.functions.regr_sxx(y: datafusion.expr.Expr, x: datafusion.expr.Expr, filter: datafusion.expr.Expr | None = None) datafusion.expr.Expr

Computes the sum of squares of the independent variable x.

This is a linear regression aggregate function. Only non-null pairs of the inputs are evaluated.

If using the builder functions described in ref:_aggregation this function ignores the options order_by, null_treatment, and distinct.

Parameters:
  • y – The linear regression dependent variable

  • x – The linear regression independent variable

  • filter – If provided, only compute against rows for which the filter is True

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"y": [1.0, 2.0, 3.0], "x": [1.0, 2.0, 3.0]})
>>> result = df.aggregate(
...     [], [dfn.functions.regr_sxx(
...         dfn.col("y"), dfn.col("x")
...     ).alias("v")])
>>> result.collect_column("v")[0].as_py()
2.0
>>> result = df.aggregate(
...     [], [dfn.functions.regr_sxx(
...         dfn.col("y"), dfn.col("x"),
...         filter=dfn.col("y") > dfn.lit(1.0)
...     ).alias("v")])
>>> result.collect_column("v")[0].as_py()
0.5
datafusion.functions.regr_sxy(y: datafusion.expr.Expr, x: datafusion.expr.Expr, filter: datafusion.expr.Expr | None = None) datafusion.expr.Expr

Computes the sum of products of pairs of numbers.

This is a linear regression aggregate function. Only non-null pairs of the inputs are evaluated.

If using the builder functions described in ref:_aggregation this function ignores the options order_by, null_treatment, and distinct.

Parameters:
  • y – The linear regression dependent variable

  • x – The linear regression independent variable

  • filter – If provided, only compute against rows for which the filter is True

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"y": [1.0, 2.0, 3.0], "x": [1.0, 2.0, 3.0]})
>>> result = df.aggregate(
...     [], [dfn.functions.regr_sxy(
...         dfn.col("y"), dfn.col("x")
...     ).alias("v")])
>>> result.collect_column("v")[0].as_py()
2.0
>>> result = df.aggregate(
...     [], [dfn.functions.regr_sxy(
...         dfn.col("y"), dfn.col("x"),
...         filter=dfn.col("y") > dfn.lit(1.0)
...     ).alias("v")])
>>> result.collect_column("v")[0].as_py()
0.5
datafusion.functions.regr_syy(y: datafusion.expr.Expr, x: datafusion.expr.Expr, filter: datafusion.expr.Expr | None = None) datafusion.expr.Expr

Computes the sum of squares of the dependent variable y.

This is a linear regression aggregate function. Only non-null pairs of the inputs are evaluated.

If using the builder functions described in ref:_aggregation this function ignores the options order_by, null_treatment, and distinct.

Parameters:
  • y – The linear regression dependent variable

  • x – The linear regression independent variable

  • filter – If provided, only compute against rows for which the filter is True

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"y": [1.0, 2.0, 3.0], "x": [1.0, 2.0, 3.0]})
>>> result = df.aggregate(
...     [], [dfn.functions.regr_syy(
...         dfn.col("y"), dfn.col("x")
...     ).alias("v")])
>>> result.collect_column("v")[0].as_py()
2.0
>>> result = df.aggregate(
...     [], [dfn.functions.regr_syy(
...         dfn.col("y"), dfn.col("x"),
...         filter=dfn.col("y") > dfn.lit(1.0)
...     ).alias("v")])
>>> result.collect_column("v")[0].as_py()
0.5
datafusion.functions.repeat(string: datafusion.expr.Expr, n: datafusion.expr.Expr) datafusion.expr.Expr

Repeats the string to n times.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": ["ha"]})
>>> result = df.select(
...     dfn.functions.repeat(dfn.col("a"), dfn.lit(3)).alias("r"))
>>> result.collect_column("r")[0].as_py()
'hahaha'
datafusion.functions.replace(string: datafusion.expr.Expr, from_val: datafusion.expr.Expr, to_val: datafusion.expr.Expr) datafusion.expr.Expr

Replaces all occurrences of from_val with to_val in the string.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": ["hello world"]})
>>> result = df.select(
...     dfn.functions.replace(dfn.col("a"), dfn.lit("world"),
...     dfn.lit("there")).alias("r"))
>>> result.collect_column("r")[0].as_py()
'hello there'
datafusion.functions.reverse(arg: datafusion.expr.Expr) datafusion.expr.Expr

Reverse the string argument.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": ["hello"]})
>>> result = df.select(dfn.functions.reverse(dfn.col("a")).alias("r"))
>>> result.collect_column("r")[0].as_py()
'olleh'
datafusion.functions.right(string: datafusion.expr.Expr, n: datafusion.expr.Expr) datafusion.expr.Expr

Returns the last n characters in the string.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": ["hello"]})
>>> result = df.select(dfn.functions.right(dfn.col("a"), dfn.lit(3)).alias("r"))
>>> result.collect_column("r")[0].as_py()
'llo'
datafusion.functions.round(value: datafusion.expr.Expr, decimal_places: datafusion.expr.Expr | None = None) datafusion.expr.Expr

Round the argument to the nearest integer.

If the optional decimal_places is specified, round to the nearest number of decimal places. You can specify a negative number of decimal places. For example round(lit(125.2345), lit(-2)) would yield a value of 100.0.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [1.567]})
>>> result = df.select(dfn.functions.round(dfn.col("a"), dfn.lit(2)).alias("r"))
>>> result.collect_column("r")[0].as_py()
1.57
datafusion.functions.row(*args: datafusion.expr.Expr) datafusion.expr.Expr

Returns a struct with the given arguments.

See also

This is an alias for struct().

datafusion.functions.row_number(partition_by: list[datafusion.expr.Expr] | datafusion.expr.Expr | None = None, order_by: list[datafusion.expr.SortKey] | datafusion.expr.SortKey | None = None) datafusion.expr.Expr

Create a row number window function.

Returns the row number of the window function.

Here is an example of the row_number on a simple DataFrame:

+--------+------------+
| points | row number |
+--------+------------+
| 100    | 1          |
| 100    | 2          |
| 50     | 3          |
| 25     | 4          |
+--------+------------+
Parameters:
  • partition_by – Expressions to partition the window frame on.

  • order_by – Set ordering within the window frame. Accepts column names or expressions.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [10, 20, 30]})
>>> result = df.select(
...     dfn.col("a"),
...     dfn.functions.row_number(
...         order_by="a"
...     ).alias("rn"))
>>> result.sort(dfn.col("a")).collect_column("rn").to_pylist()
[1, 2, 3]
>>> df = ctx.from_pydict(
...     {"g": ["a", "a", "b", "b"], "v": [1, 2, 3, 4]})
>>> result = df.select(
...     dfn.col("g"), dfn.col("v"),
...     dfn.functions.row_number(
...         partition_by=dfn.col("g"), order_by="v",
...     ).alias("rn"))
>>> result.sort(dfn.col("g"), dfn.col("v")).collect_column("rn").to_pylist()
[1, 2, 1, 2]
datafusion.functions.rpad(string: datafusion.expr.Expr, count: datafusion.expr.Expr, characters: datafusion.expr.Expr | None = None) datafusion.expr.Expr

Add right padding to a string.

Extends the string to length length by appending the characters fill (a space by default). If the string is already longer than length then it is truncated.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": ["hi"]})
>>> result = df.select(
...     dfn.functions.rpad(dfn.col("a"), dfn.lit(5), dfn.lit("!")).alias("r"))
>>> result.collect_column("r")[0].as_py()
'hi!!!'
datafusion.functions.rtrim(arg: datafusion.expr.Expr) datafusion.expr.Expr

Removes all characters, spaces by default, from the end of a string.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [" a  "]})
>>> trim_df = df.select(dfn.functions.rtrim(dfn.col("a")).alias("trimmed"))
>>> trim_df.collect_column("trimmed")[0].as_py()
' a'
datafusion.functions.sha224(arg: datafusion.expr.Expr) datafusion.expr.Expr

Computes the SHA-224 hash of a binary string.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": ["hello"]})
>>> result = df.select(
...     dfn.functions.sha224(dfn.col("a")).alias("h")
... )
>>> result.collect_column("h")[0].as_py().hex()
'ea09ae9cc6768c50fcee903ed054556e5bfc8347907f12598aa24193'
datafusion.functions.sha256(arg: datafusion.expr.Expr) datafusion.expr.Expr

Computes the SHA-256 hash of a binary string.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": ["hello"]})
>>> result = df.select(
...     dfn.functions.sha256(dfn.col("a")).alias("h")
... )
>>> result.collect_column("h")[0].as_py().hex()
'2cf24dba5fb0a30e26e83b2ac5b9e29e1b161e5c1fa7425e73043362938b9824'
datafusion.functions.sha384(arg: datafusion.expr.Expr) datafusion.expr.Expr

Computes the SHA-384 hash of a binary string.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": ["hello"]})
>>> result = df.select(
...     dfn.functions.sha384(dfn.col("a")).alias("h")
... )
>>> result.collect_column("h")[0].as_py().hex()
'59e1748777448c69de6b800d7a33bbfb9ff1b...
datafusion.functions.sha512(arg: datafusion.expr.Expr) datafusion.expr.Expr

Computes the SHA-512 hash of a binary string.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": ["hello"]})
>>> result = df.select(
...     dfn.functions.sha512(dfn.col("a")).alias("h")
... )
>>> result.collect_column("h")[0].as_py().hex()
'9b71d224bd62f3785d96d46ad3ea3d73319bfb...
datafusion.functions.signum(arg: datafusion.expr.Expr) datafusion.expr.Expr

Returns the sign of the argument (-1, 0, +1).

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [-5.0, 0.0, 5.0]})
>>> result = df.select(dfn.functions.signum(dfn.col("a")).alias("s"))
>>> result.collect_column("s").to_pylist()
[-1.0, 0.0, 1.0]
datafusion.functions.sin(arg: datafusion.expr.Expr) datafusion.expr.Expr

Returns the sine of the argument.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [0.0]})
>>> result = df.select(dfn.functions.sin(dfn.col("a")).alias("sin"))
>>> result.collect_column("sin")[0].as_py()
0.0
datafusion.functions.sinh(arg: datafusion.expr.Expr) datafusion.expr.Expr

Returns the hyperbolic sine of the argument.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [0.0]})
>>> result = df.select(dfn.functions.sinh(dfn.col("a")).alias("sinh"))
>>> result.collect_column("sinh")[0].as_py()
0.0
datafusion.functions.split_part(string: datafusion.expr.Expr, delimiter: datafusion.expr.Expr, index: datafusion.expr.Expr) datafusion.expr.Expr

Split a string and return one part.

Splits a string based on a delimiter and picks out the desired field based on the index.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": ["a,b,c"]})
>>> result = df.select(
...     dfn.functions.split_part(
...         dfn.col("a"), dfn.lit(","), dfn.lit(2)
...     ).alias("s"))
>>> result.collect_column("s")[0].as_py()
'b'
datafusion.functions.sqrt(arg: datafusion.expr.Expr) datafusion.expr.Expr

Returns the square root of the argument.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [9.0]})
>>> result = df.select(dfn.functions.sqrt(dfn.col("a")).alias("sqrt"))
>>> result.collect_column("sqrt")[0].as_py()
3.0
datafusion.functions.starts_with(string: datafusion.expr.Expr, prefix: datafusion.expr.Expr) datafusion.expr.Expr

Returns true if string starts with prefix.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": ["hello_from_datafusion"]})
>>> result = df.select(
...     dfn.functions.starts_with(dfn.col("a"), dfn.lit("hello")).alias("sw"))
>>> result.collect_column("sw")[0].as_py()
True
datafusion.functions.stddev(expression: datafusion.expr.Expr, filter: datafusion.expr.Expr | None = None) datafusion.expr.Expr

Computes the standard deviation of the argument.

If using the builder functions described in ref:_aggregation this function ignores the options order_by, null_treatment, and distinct.

Parameters:
  • expression – The value to find the minimum of

  • filter – If provided, only compute against rows for which the filter is True

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [2.0, 4.0, 6.0]})
>>> result = df.aggregate(
...     [], [dfn.functions.stddev(
...         dfn.col("a")
...     ).alias("v")])
>>> result.collect_column("v")[0].as_py()
2.0
>>> result = df.aggregate(
...     [], [dfn.functions.stddev(
...         dfn.col("a"),
...         filter=dfn.col("a") > dfn.lit(2.0)
...     ).alias("v")])
>>> result.collect_column("v")[0].as_py()
1.41...
datafusion.functions.stddev_pop(expression: datafusion.expr.Expr, filter: datafusion.expr.Expr | None = None) datafusion.expr.Expr

Computes the population standard deviation of the argument.

If using the builder functions described in ref:_aggregation this function ignores the options order_by, null_treatment, and distinct.

Parameters:
  • expression – The value to find the minimum of

  • filter – If provided, only compute against rows for which the filter is True

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [0.0, 1.0, 3.0]})
>>> result = df.aggregate(
...     [], [dfn.functions.stddev_pop(
...         dfn.col("a")
...     ).alias("v")]
... )
>>> result.collect_column("v")[0].as_py()
1.247...
>>> df = ctx.from_pydict({"a": [0.0, 1.0, 3.0]})
>>> result = df.aggregate(
...     [], [dfn.functions.stddev_pop(
...         dfn.col("a"),
...         filter=dfn.col("a") > dfn.lit(0.0)
...     ).alias("v")]
... )
>>> result.collect_column("v")[0].as_py()
1.0
datafusion.functions.stddev_samp(arg: datafusion.expr.Expr, filter: datafusion.expr.Expr | None = None) datafusion.expr.Expr

Computes the sample standard deviation of the argument.

See also

This is an alias for stddev().

datafusion.functions.string_agg(expression: datafusion.expr.Expr, delimiter: str, filter: datafusion.expr.Expr | None = None, order_by: list[datafusion.expr.SortKey] | datafusion.expr.SortKey | None = None) datafusion.expr.Expr

Concatenates the input strings.

This aggregate function will concatenate input strings, ignoring null values, and separating them with the specified delimiter. Non-string values will be converted to their string equivalents.

If using the builder functions described in ref:_aggregation this function ignores the options distinct and null_treatment.

Parameters:
  • expression – Argument to perform bitwise calculation on

  • delimiter – Text to place between each value of expression

  • filter – If provided, only compute against rows for which the filter is True

  • order_by – Set the ordering of the expression to evaluate. Accepts column names or expressions.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": ["x", "y", "z"]})
>>> result = df.aggregate(
...     [], [dfn.functions.string_agg(
...         dfn.col("a"), ",", order_by="a"
...     ).alias("s")])
>>> result.collect_column("s")[0].as_py()
'x,y,z'
>>> result = df.aggregate(
...     [], [dfn.functions.string_agg(
...         dfn.col("a"), ",",
...         filter=dfn.col("a") > dfn.lit("x"),
...         order_by="a",
...     ).alias("s")])
>>> result.collect_column("s")[0].as_py()
'y,z'
datafusion.functions.string_to_array(string: datafusion.expr.Expr, delimiter: datafusion.expr.Expr, null_string: datafusion.expr.Expr | None = None) datafusion.expr.Expr

Splits a string based on a delimiter and returns an array of parts.

Any parts matching the optional null_string will be replaced with NULL.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": ["hello,world"]})
>>> result = df.select(
...     dfn.functions.string_to_array(
...         dfn.col("a"), dfn.lit(","),
...     ).alias("result"))
>>> result.collect_column("result")[0].as_py()
['hello', 'world']

Replace parts matching a null_string with NULL:

>>> result = df.select(
...     dfn.functions.string_to_array(
...         dfn.col("a"), dfn.lit(","), null_string=dfn.lit("world"),
...     ).alias("result"))
>>> result.collect_column("result")[0].as_py()
['hello', None]
datafusion.functions.string_to_list(string: datafusion.expr.Expr, delimiter: datafusion.expr.Expr, null_string: datafusion.expr.Expr | None = None) datafusion.expr.Expr

Splits a string based on a delimiter and returns an array of parts.

See also

This is an alias for string_to_array().

datafusion.functions.strpos(string: datafusion.expr.Expr, substring: datafusion.expr.Expr) datafusion.expr.Expr

Finds the position from where the substring matches the string.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": ["hello"]})
>>> result = df.select(
...     dfn.functions.strpos(dfn.col("a"), dfn.lit("llo")).alias("pos"))
>>> result.collect_column("pos")[0].as_py()
3
datafusion.functions.struct(*args: datafusion.expr.Expr) datafusion.expr.Expr

Returns a struct with the given arguments.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [1], "b": [2]})
>>> result = df.select(
...     dfn.functions.struct(
...         dfn.col("a"), dfn.col("b")
...     ).alias("s")
... )

Children in the new struct will always be c0, …, cN-1 for N children.

>>> result.collect_column("s")[0].as_py() == {"c0": 1, "c1": 2}
True
datafusion.functions.substr(string: datafusion.expr.Expr, position: datafusion.expr.Expr) datafusion.expr.Expr

Substring from the position to the end.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": ["hello"]})
>>> result = df.select(
...     dfn.functions.substr(dfn.col("a"), dfn.lit(3)).alias("s"))
>>> result.collect_column("s")[0].as_py()
'llo'
datafusion.functions.substr_index(string: datafusion.expr.Expr, delimiter: datafusion.expr.Expr, count: datafusion.expr.Expr) datafusion.expr.Expr

Returns an indexed substring.

The return will be the string from before count occurrences of delimiter.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": ["a.b.c"]})
>>> result = df.select(
...     dfn.functions.substr_index(dfn.col("a"), dfn.lit("."),
...     dfn.lit(2)).alias("s"))
>>> result.collect_column("s")[0].as_py()
'a.b'
datafusion.functions.substring(string: datafusion.expr.Expr, position: datafusion.expr.Expr, length: datafusion.expr.Expr) datafusion.expr.Expr

Substring from the position with length characters.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": ["hello world"]})
>>> result = df.select(
...     dfn.functions.substring(
...         dfn.col("a"), dfn.lit(1), dfn.lit(5)
...     ).alias("s"))
>>> result.collect_column("s")[0].as_py()
'hello'
datafusion.functions.sum(expression: datafusion.expr.Expr, filter: datafusion.expr.Expr | None = None) datafusion.expr.Expr

Computes the sum of a set of numbers.

This aggregate function expects a numeric expression.

If using the builder functions described in ref:_aggregation this function ignores the options order_by, null_treatment, and distinct.

Parameters:
  • expression – Values to combine into an array

  • filter – If provided, only compute against rows for which the filter is True

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [1, 2, 3]})
>>> result = df.aggregate(
...     [], [dfn.functions.sum(
...         dfn.col("a")
...     ).alias("v")])
>>> result.collect_column("v")[0].as_py()
6
>>> result = df.aggregate(
...     [], [dfn.functions.sum(
...         dfn.col("a"),
...         filter=dfn.col("a") > dfn.lit(1)
...     ).alias("v")])
>>> result.collect_column("v")[0].as_py()
5
datafusion.functions.tan(arg: datafusion.expr.Expr) datafusion.expr.Expr

Returns the tangent of the argument.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [0.0]})
>>> result = df.select(dfn.functions.tan(dfn.col("a")).alias("tan"))
>>> result.collect_column("tan")[0].as_py()
0.0
datafusion.functions.tanh(arg: datafusion.expr.Expr) datafusion.expr.Expr

Returns the hyperbolic tangent of the argument.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [0.0]})
>>> result = df.select(dfn.functions.tanh(dfn.col("a")).alias("tanh"))
>>> result.collect_column("tanh")[0].as_py()
0.0
datafusion.functions.to_char(arg: datafusion.expr.Expr, formatter: datafusion.expr.Expr) datafusion.expr.Expr

Returns a string representation of a date, time, timestamp or duration.

For usage of formatter see the rust chrono package strftime package.

[Documentation here.](https://docs.rs/chrono/latest/chrono/format/strftime/index.html)

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": ["2021-01-01T00:00:00"]})
>>> result = df.select(
...     dfn.functions.to_char(
...         dfn.functions.to_timestamp(dfn.col("a")),
...         dfn.lit("%Y/%m/%d"),
...     ).alias("formatted")
... )
>>> result.collect_column("formatted")[0].as_py()
'2021/01/01'
datafusion.functions.to_date(arg: datafusion.expr.Expr, *formatters: datafusion.expr.Expr) datafusion.expr.Expr

Converts a value to a date (YYYY-MM-DD).

Supports strings, numeric and timestamp types as input. Integers and doubles are interpreted as days since the unix epoch. Strings are parsed as YYYY-MM-DD (e.g. ‘2023-07-20’) if formatters are not provided.

For usage of formatters see the rust chrono package strftime package.

[Documentation here.](https://docs.rs/chrono/latest/chrono/format/strftime/index.html)

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": ["2021-07-20"]})
>>> result = df.select(
...     dfn.functions.to_date(dfn.col("a")).alias("dt"))
>>> str(result.collect_column("dt")[0].as_py())
'2021-07-20'
datafusion.functions.to_hex(arg: datafusion.expr.Expr) datafusion.expr.Expr

Converts an integer to a hexadecimal string.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [255]})
>>> result = df.select(dfn.functions.to_hex(dfn.col("a")).alias("hex"))
>>> result.collect_column("hex")[0].as_py()
'ff'
datafusion.functions.to_local_time(*args: datafusion.expr.Expr) datafusion.expr.Expr

Converts a timestamp with a timezone to a timestamp without a timezone.

This function handles daylight saving time changes.

datafusion.functions.to_time(arg: datafusion.expr.Expr, *formatters: datafusion.expr.Expr) datafusion.expr.Expr

Converts a value to a time. Supports strings and timestamps as input.

If formatters is not provided strings are parsed as HH:MM:SS, HH:MM or HH:MM:SS.nnnnnnnnn;

For usage of formatters see the rust chrono package strftime package.

[Documentation here.](https://docs.rs/chrono/latest/chrono/format/strftime/index.html)

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": ["14:30:00"]})
>>> result = df.select(
...     dfn.functions.to_time(dfn.col("a")).alias("t"))
>>> str(result.collect_column("t")[0].as_py())
'14:30:00'
datafusion.functions.to_timestamp(arg: datafusion.expr.Expr, *formatters: datafusion.expr.Expr) datafusion.expr.Expr

Converts a string and optional formats to a Timestamp in nanoseconds.

For usage of formatters see the rust chrono package strftime package.

[Documentation here.](https://docs.rs/chrono/latest/chrono/format/strftime/index.html)

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": ["2021-01-01T00:00:00"]})
>>> result = df.select(
...     dfn.functions.to_timestamp(
...         dfn.col("a")
...     ).alias("ts")
... )
>>> str(result.collect_column("ts")[0].as_py())
'2021-01-01 00:00:00'
datafusion.functions.to_timestamp_micros(arg: datafusion.expr.Expr, *formatters: datafusion.expr.Expr) datafusion.expr.Expr

Converts a string and optional formats to a Timestamp in microseconds.

See to_timestamp() for a description on how to use formatters.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": ["2021-01-01T00:00:00"]})
>>> result = df.select(
...     dfn.functions.to_timestamp_micros(
...         dfn.col("a")
...     ).alias("ts")
... )
>>> str(result.collect_column("ts")[0].as_py())
'2021-01-01 00:00:00'
datafusion.functions.to_timestamp_millis(arg: datafusion.expr.Expr, *formatters: datafusion.expr.Expr) datafusion.expr.Expr

Converts a string and optional formats to a Timestamp in milliseconds.

See to_timestamp() for a description on how to use formatters.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": ["2021-01-01T00:00:00"]})
>>> result = df.select(
...     dfn.functions.to_timestamp_millis(
...         dfn.col("a")
...     ).alias("ts")
... )
>>> str(result.collect_column("ts")[0].as_py())
'2021-01-01 00:00:00'
datafusion.functions.to_timestamp_nanos(arg: datafusion.expr.Expr, *formatters: datafusion.expr.Expr) datafusion.expr.Expr

Converts a string and optional formats to a Timestamp in nanoseconds.

See to_timestamp() for a description on how to use formatters.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": ["2021-01-01T00:00:00"]})
>>> result = df.select(
...     dfn.functions.to_timestamp_nanos(
...         dfn.col("a")
...     ).alias("ts")
... )
>>> str(result.collect_column("ts")[0].as_py())
'2021-01-01 00:00:00'
datafusion.functions.to_timestamp_seconds(arg: datafusion.expr.Expr, *formatters: datafusion.expr.Expr) datafusion.expr.Expr

Converts a string and optional formats to a Timestamp in seconds.

See to_timestamp() for a description on how to use formatters.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": ["2021-01-01T00:00:00"]})
>>> result = df.select(
...     dfn.functions.to_timestamp_seconds(
...         dfn.col("a")
...     ).alias("ts")
... )
>>> str(result.collect_column("ts")[0].as_py())
'2021-01-01 00:00:00'
datafusion.functions.to_unixtime(string: datafusion.expr.Expr, *format_arguments: datafusion.expr.Expr) datafusion.expr.Expr

Converts a string and optional formats to a Unixtime.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": ["1970-01-01T00:00:00"]})
>>> result = df.select(dfn.functions.to_unixtime(dfn.col("a")).alias("u"))
>>> result.collect_column("u")[0].as_py()
0
datafusion.functions.translate(string: datafusion.expr.Expr, from_val: datafusion.expr.Expr, to_val: datafusion.expr.Expr) datafusion.expr.Expr

Replaces the characters in from_val with the counterpart in to_val.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": ["hello"]})
>>> result = df.select(
...     dfn.functions.translate(dfn.col("a"), dfn.lit("helo"),
...     dfn.lit("HELO")).alias("t"))
>>> result.collect_column("t")[0].as_py()
'HELLO'
datafusion.functions.trim(arg: datafusion.expr.Expr) datafusion.expr.Expr

Removes all characters, spaces by default, from both sides of a string.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": ["  hello  "]})
>>> result = df.select(dfn.functions.trim(dfn.col("a")).alias("t"))
>>> result.collect_column("t")[0].as_py()
'hello'
datafusion.functions.trunc(num: datafusion.expr.Expr, precision: datafusion.expr.Expr | None = None) datafusion.expr.Expr

Truncate the number toward zero with optional precision.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [1.567]})
>>> result = df.select(
...     dfn.functions.trunc(
...         dfn.col("a")
...     ).alias("t"))
>>> result.collect_column("t")[0].as_py()
1.0
>>> result = df.select(
...     dfn.functions.trunc(
...         dfn.col("a"), precision=dfn.lit(2)
...     ).alias("t"))
>>> result.collect_column("t")[0].as_py()
1.56
datafusion.functions.union_extract(union_expr: datafusion.expr.Expr, field_name: datafusion.expr.Expr | str) datafusion.expr.Expr

Extracts a value from a union type by field name.

Returns the value of the named field if it is the currently selected variant, otherwise returns NULL.

Examples

>>> ctx = dfn.SessionContext()
>>> types = pa.array([0, 1, 0], type=pa.int8())
>>> offsets = pa.array([0, 0, 1], type=pa.int32())
>>> arr = pa.UnionArray.from_dense(
...     types, offsets, [pa.array([1, 2]), pa.array(["hi"])],
...     ["int", "str"], [0, 1],
... )
>>> batch = pa.RecordBatch.from_arrays([arr], names=["u"])
>>> df = ctx.create_dataframe([[batch]])
>>> result = df.select(
...     dfn.functions.union_extract(dfn.col("u"), "int").alias("val")
... )
>>> result.collect_column("val").to_pylist()
[1, None, 2]
datafusion.functions.union_tag(union_expr: datafusion.expr.Expr) datafusion.expr.Expr

Returns the tag (active field name) of a union type.

Examples

>>> ctx = dfn.SessionContext()
>>> types = pa.array([0, 1, 0], type=pa.int8())
>>> offsets = pa.array([0, 0, 1], type=pa.int32())
>>> arr = pa.UnionArray.from_dense(
...     types, offsets, [pa.array([1, 2]), pa.array(["hi"])],
...     ["int", "str"], [0, 1],
... )
>>> batch = pa.RecordBatch.from_arrays([arr], names=["u"])
>>> df = ctx.create_dataframe([[batch]])
>>> result = df.select(
...     dfn.functions.union_tag(dfn.col("u")).alias("tag")
... )
>>> result.collect_column("tag").to_pylist()
['int', 'str', 'int']
datafusion.functions.upper(arg: datafusion.expr.Expr) datafusion.expr.Expr

Converts a string to uppercase.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": ["hello"]})
>>> result = df.select(dfn.functions.upper(dfn.col("a")).alias("u"))
>>> result.collect_column("u")[0].as_py()
'HELLO'
datafusion.functions.uuid() datafusion.expr.Expr

Returns uuid v4 as a string value.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [1]})
>>> result = df.select(
...     dfn.functions.uuid().alias("u")
... )
>>> len(result.collect_column("u")[0].as_py()) == 36
True
datafusion.functions.var(expression: datafusion.expr.Expr, filter: datafusion.expr.Expr | None = None) datafusion.expr.Expr

Computes the sample variance of the argument.

See also

This is an alias for var_samp().

datafusion.functions.var_pop(expression: datafusion.expr.Expr, filter: datafusion.expr.Expr | None = None) datafusion.expr.Expr

Computes the population variance of the argument.

If using the builder functions described in ref:_aggregation this function ignores the options order_by, null_treatment, and distinct.

Parameters:
  • expression – The variable to compute the variance for

  • filter – If provided, only compute against rows for which the filter is True

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [-1.0, 0.0, 2.0]})
>>> result = df.aggregate(
...     [], [dfn.functions.var_pop(
...         dfn.col("a")
...     ).alias("v")])
>>> result.collect_column("v")[0].as_py()
1.555...
>>> result = df.aggregate(
...     [], [dfn.functions.var_pop(
...         dfn.col("a"),
...         filter=dfn.col("a") > dfn.lit(-1.0)
...     ).alias("v")])
>>> result.collect_column("v")[0].as_py()
1.0
datafusion.functions.var_population(expression: datafusion.expr.Expr, filter: datafusion.expr.Expr | None = None) datafusion.expr.Expr

Computes the population variance of the argument.

See also

This is an alias for var_pop().

datafusion.functions.var_samp(expression: datafusion.expr.Expr, filter: datafusion.expr.Expr | None = None) datafusion.expr.Expr

Computes the sample variance of the argument.

If using the builder functions described in ref:_aggregation this function ignores the options order_by, null_treatment, and distinct.

Parameters:
  • expression – The variable to compute the variance for

  • filter – If provided, only compute against rows for which the filter is True

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [1.0, 2.0, 3.0]})
>>> result = df.aggregate(
...     [], [dfn.functions.var_samp(
...         dfn.col("a")
...     ).alias("v")])
>>> result.collect_column("v")[0].as_py()
1.0
>>> result = df.aggregate(
...     [], [dfn.functions.var_samp(
...         dfn.col("a"),
...         filter=dfn.col("a") > dfn.lit(1.0)
...     ).alias("v")])
>>> result.collect_column("v")[0].as_py()
0.5
datafusion.functions.var_sample(expression: datafusion.expr.Expr, filter: datafusion.expr.Expr | None = None) datafusion.expr.Expr

Computes the sample variance of the argument.

See also

This is an alias for var_samp().

datafusion.functions.version() datafusion.expr.Expr

Returns the DataFusion version string.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.empty_table()
>>> result = df.select(dfn.functions.version().alias("v"))
>>> "Apache DataFusion" in result.collect_column("v")[0].as_py()
True
datafusion.functions.when(when: datafusion.expr.Expr, then: datafusion.expr.Expr) datafusion.expr.CaseBuilder

Create a case expression that has no base expression.

Create a CaseBuilder to match cases for the expression expr. See CaseBuilder for detailed usage.

Examples

>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [1, 2, 3]})
>>> result = df.select(
...     dfn.functions.when(dfn.col("a") > dfn.lit(2),
...     dfn.lit("big")).otherwise(dfn.lit("small")).alias("c"))
>>> result.collect_column("c")[2].as_py()
'big'
datafusion.functions.today