datafusion.functions¶
User functions for operating on Expr.
Attributes¶
Functions¶
|
Return the absolute value of a given number. |
|
Returns the arc cosine or inverse cosine of a number. |
|
Returns inverse hyperbolic cosine. |
|
Creates an alias expression with an optional metadata dictionary. |
|
Returns the approximate number of distinct values. |
|
Returns the approximate median value. |
|
Returns the value that is approximately at a given percentile of |
|
Returns the value of the weighted approximate percentile. |
|
Returns an array using the specified input expressions. |
|
Aggregate values into an array. |
|
Returns the first non-null element in the array. |
|
Appends an element to the end of an array. |
|
Concatenates the input arrays. |
|
Concatenates the input arrays. |
|
Returns true if the element appears in the array, otherwise false. |
|
Returns an array of the array's dimensions. |
|
Returns the Euclidean distance between two numeric arrays. |
|
Returns distinct values from the array after removing duplicates. |
|
Extracts the element with the index n from the array. |
|
Returns a boolean indicating whether the array is empty. |
|
Returns the elements that appear in |
|
Extracts the element with the index n from the array. |
|
Returns true if the element appears in the first array, otherwise false. |
|
Determines if there is complete overlap |
|
Determine if there is an overlap between |
|
Return the position of the first occurrence of |
|
Returns the intersection of |
|
Converts each element to its text representation. |
|
Returns the length of the array. |
|
Returns the maximum value in the array. |
|
Returns the minimum value in the array. |
|
Returns the number of dimensions of the array. |
|
Returns the array without the last element. |
|
Returns the array without the first element. |
|
Return the position of the first occurrence of |
|
Searches for an element in the array and returns all occurrences. |
|
Prepends an element to the beginning of an array. |
|
Appends an element to the end of an array. |
|
Prepends an element to the beginning of an array. |
|
Removes the first element from the array equal to the given value. |
|
Removes all elements from the array equal to the given value. |
|
Removes the first |
|
Returns an array containing |
|
Replaces the first occurrence of |
|
Replaces all occurrences of |
|
Replace |
|
Returns an array with the specified size filled. |
|
Reverses the order of elements in the array. |
|
Returns a slice of the array. |
|
Sort an array. |
|
Converts each element to its text representation. |
|
Returns an array of the elements in the union of array1 and array2. |
|
Returns true if any element appears in both arrays. |
|
Combines multiple arrays into a single array of structs. |
|
Casts an expression to a specified data type. |
|
Returns the metadata of the input expression. |
|
Returns the Arrow type of the expression. |
|
Returns the numeric code of the first character of the argument. |
|
Returns the arc sine or inverse sine of a number. |
|
Returns inverse hyperbolic sine. |
|
Returns inverse tangent of a number. |
|
Returns inverse tangent of a division given in the argument. |
|
Returns inverse hyperbolic tangent. |
|
Returns the average value. |
|
Computes the bitwise AND of the argument. |
|
Returns the number of bits in the string argument. |
|
Computes the bitwise OR of the argument. |
|
Computes the bitwise XOR of the argument. |
|
Computes the boolean AND of the argument. |
|
Computes the boolean OR of the argument. |
|
Removes all characters, spaces by default, from both sides of a string. |
|
Returns the total number of elements in the array. |
|
Create a case expression. |
|
Returns the cube root of a number. |
|
Returns the nearest integer greater than or equal to argument. |
|
The number of characters in the |
|
Returns the number of characters in the argument. |
|
Converts the Unicode code point to a UTF8 character. |
|
Returns the value of the first expr in |
|
Creates a column reference expression. |
|
Concatenates the text representations of all the arguments. |
|
Concatenates the list |
|
Returns true if |
|
Returns the correlation coefficient between |
|
Returns the cosine of the argument. |
|
Returns the hyperbolic cosine of the argument. |
|
Returns the cotangent of the argument. |
|
Returns the number of rows that match the given arguments. |
|
Create a COUNT(1) aggregate expression. |
|
Computes the sample covariance. |
|
Computes the population covariance. |
|
Computes the sample covariance. |
|
Create a cumulative distribution window function. |
|
Returns current UTC date as a Date32 value. |
|
Returns current UTC time as a Time64 value. |
|
Returns the current timestamp in nanoseconds. |
|
Coerces an arbitrary timestamp to the start of the nearest specified interval. |
|
Returns a string representation of a date, time, timestamp or duration. |
|
Extracts a subfield from the date. |
|
Truncates the date to a specified level of precision. |
|
Return a specified part of a date. |
|
Truncates the date to a specified level of precision. |
|
Decode the |
|
Converts the argument from radians to degrees. |
|
Create a dense_rank window function. |
|
Computes the binary hash of an expression using the specified algorithm. |
|
Returns the value for a given key in the map. |
|
Returns true if the array is empty. |
|
Encode the |
|
Returns true if the |
|
Returns the exponential of the argument. |
|
Extracts a subfield from the date. |
|
Returns the factorial of the argument. |
|
Find a string in a list of strings. |
|
Returns the first value in a group of values. |
|
Flattens an array of arrays into a single array. |
|
Returns the nearest integer less than or equal to the argument. |
|
Converts an integer to RFC3339 timestamp format string. |
|
Returns the greatest common divisor. |
|
Creates a list of values in the range between start and stop. |
|
Creates a list of values in the range between start and stop. |
|
Extracts a field from a struct or map by name. |
|
Returns the greatest value from a list of expressions. |
|
Indicates whether a column is aggregated across in the current row. |
|
Returns |
|
Returns whether the argument is contained within the list |
|
Set the initial letter of each word to capital. |
|
Returns true if a given number is +NaN or -NaN otherwise returns false. |
|
Returns true if a given number is +0.0 or -0.0 otherwise returns false. |
|
Create a lag window function. |
|
Returns the last value in a group of values. |
|
Returns the least common multiple. |
|
Create a lead window function. |
|
Returns the least value from a list of expressions. |
|
Returns the first |
|
The number of characters in the |
|
Returns the Levenshtein distance between the two given strings. |
|
Returns the first non-null element in the array. |
|
Appends an element to the end of an array. |
|
Concatenates the input arrays. |
|
Concatenates the input arrays. |
|
Returns true if the element appears in the array, otherwise false. |
|
Returns an array of the array's dimensions. |
|
Returns the Euclidean distance between two numeric arrays. |
|
Returns distinct values from the array after removing duplicates. |
|
Extracts the element with the index n from the array. |
|
Returns a boolean indicating whether the array is empty. |
|
Returns the elements that appear in |
|
Extracts the element with the index n from the array. |
|
Returns true if the element appears in the array, otherwise false. |
|
Determines if there is complete overlap |
|
Determine if there is an overlap between |
|
Return the position of the first occurrence of |
|
Returns an the intersection of |
|
Converts each element to its text representation. |
|
Returns the length of the array. |
|
Returns the maximum value in the array. |
|
Returns the minimum value in the array. |
|
Returns the number of dimensions of the array. |
|
Returns true if any element appears in both arrays. |
|
Returns the array without the last element. |
|
Returns the array without the first element. |
|
Return the position of the first occurrence of |
|
Searches for an element in the array and returns all occurrences. |
|
Prepends an element to the beginning of an array. |
|
Appends an element to the end of an array. |
|
Prepends an element to the beginning of an array. |
|
Removes the first element from the array equal to the given value. |
|
Removes all elements from the array equal to the given value. |
|
Removes the first |
|
Returns an array containing |
|
Replaces the first occurrence of |
|
Replaces all occurrences of |
|
Replace |
|
Returns an array with the specified size filled. |
|
Reverses the order of elements in the array. |
|
Returns a slice of the array. |
|
Sorts the array. |
|
Converts each element to its text representation. |
|
Returns an array of the elements in the union of array1 and array2. |
|
Combines multiple arrays into a single array of structs. |
|
Returns the natural logarithm (base e) of the argument. |
|
Returns the logarithm of a number for a particular |
|
Base 10 logarithm of the argument. |
|
Base 2 logarithm of the argument. |
|
Converts a string to lowercase. |
|
Add left padding to a string. |
|
Removes all characters, spaces by default, from the beginning of a string. |
|
Returns an array using the specified input expressions. |
|
Make a date from year, month and day component parts. |
|
Returns an array using the specified input expressions. |
|
Returns a map expression. |
|
Make a time from hour, minute and second component parts. |
|
Returns a list of all entries (key-value struct pairs) in the map. |
|
Returns the value for a given key in the map. |
|
Returns a list of all keys in the map. |
|
Returns a list of all values in the map. |
|
Aggregate function that returns the maximum value of the argument. |
|
Computes an MD5 128-bit checksum for a string expression. |
|
Returns the average (mean) value of the argument. |
|
Computes the median of a set of numbers. |
|
Aggregate function that returns the minimum value of the argument. |
|
Returns a struct with the given names and arguments pairs. |
|
Returns |
|
Returns the current timestamp in nanoseconds. |
|
Returns the n-th value in a group of values. |
|
Create a n-tile window function. |
|
Returns NULL if expr1 equals expr2; otherwise it returns expr1. |
|
Returns |
|
Returns |
|
Returns the number of bytes of a string. |
|
Creates a new sort expression. |
|
Replace a substring with a new substring. |
|
Create a percent_rank window function. |
|
Computes the exact percentile of input values using continuous interpolation. |
|
Returns an approximate value of π. |
|
Returns |
|
Returns |
|
Computes the exact percentile of input values using continuous interpolation. |
|
Converts the argument from degrees to radians. |
|
Returns a random value in the range |
|
Create a list of values in the range between start and stop. |
|
Create a rank window function. |
|
Returns the number of matches in a string. |
|
Returns the position of a regular expression match in a string. |
|
Find if any regular expression (regex) matches exist. |
|
Perform regular expression (regex) matching. |
|
Replaces substring(s) matching a PCRE-like regular expression. |
|
Computes the average of the independent variable |
|
Computes the average of the dependent variable |
|
Counts the number of rows in which both expressions are not null. |
|
Computes the intercept from the linear regression. |
|
Computes the R-squared value from linear regression. |
|
Computes the slope from linear regression. |
|
Computes the sum of squares of the independent variable |
|
Computes the sum of products of pairs of numbers. |
|
Computes the sum of squares of the dependent variable |
|
Repeats the |
|
Replaces all occurrences of |
|
Reverse the string argument. |
|
Returns the last |
|
Round the argument to the nearest integer. |
|
Returns a struct with the given arguments. |
|
Create a row number window function. |
|
Add right padding to a string. |
|
Removes all characters, spaces by default, from the end of a string. |
|
Computes the SHA-224 hash of a binary string. |
|
Computes the SHA-256 hash of a binary string. |
|
Computes the SHA-384 hash of a binary string. |
|
Computes the SHA-512 hash of a binary string. |
|
Returns the sign of the argument (-1, 0, +1). |
|
Returns the sine of the argument. |
|
Returns the hyperbolic sine of the argument. |
|
Split a string and return one part. |
|
Returns the square root of the argument. |
|
Returns true if string starts with prefix. |
|
Computes the standard deviation of the argument. |
|
Computes the population standard deviation of the argument. |
|
Computes the sample standard deviation of the argument. |
|
Concatenates the input strings. |
|
Splits a string based on a delimiter and returns an array of parts. |
|
Splits a string based on a delimiter and returns an array of parts. |
|
Finds the position from where the |
|
Returns a struct with the given arguments. |
|
Substring from the |
|
Returns an indexed substring. |
|
Substring from the |
|
Computes the sum of a set of numbers. |
|
Returns the tangent of the argument. |
|
Returns the hyperbolic tangent of the argument. |
|
Returns a string representation of a date, time, timestamp or duration. |
|
Converts a value to a date (YYYY-MM-DD). |
|
Converts an integer to a hexadecimal string. |
|
Converts a timestamp with a timezone to a timestamp without a timezone. |
|
Converts a value to a time. Supports strings and timestamps as input. |
|
Converts a string and optional formats to a |
|
Converts a string and optional formats to a |
|
Converts a string and optional formats to a |
|
Converts a string and optional formats to a |
|
Converts a string and optional formats to a |
|
Converts a string and optional formats to a Unixtime. |
|
Replaces the characters in |
|
Removes all characters, spaces by default, from both sides of a string. |
|
Truncate the number toward zero with optional precision. |
|
Extracts a value from a union type by field name. |
|
Returns the tag (active field name) of a union type. |
|
Converts a string to uppercase. |
|
Returns uuid v4 as a string value. |
|
Computes the sample variance of the argument. |
|
Computes the population variance of the argument. |
|
Computes the population variance of the argument. |
|
Computes the sample variance of the argument. |
|
Computes the sample variance of the argument. |
|
Returns the DataFusion version string. |
|
Create a case expression that has no base expression. |
Module Contents¶
- datafusion.functions.abs(arg: datafusion.expr.Expr) datafusion.expr.Expr¶
Return the absolute value of a given number.
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [-1, 0, 1]}) >>> result = df.select(dfn.functions.abs(dfn.col("a")).alias("abs")) >>> result.collect_column("abs")[0].as_py() 1
- datafusion.functions.acos(arg: datafusion.expr.Expr) datafusion.expr.Expr¶
Returns the arc cosine or inverse cosine of a number.
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [1.0]}) >>> result = df.select(dfn.functions.acos(dfn.col("a")).alias("acos")) >>> result.collect_column("acos")[0].as_py() 0.0
- datafusion.functions.acosh(arg: datafusion.expr.Expr) datafusion.expr.Expr¶
Returns inverse hyperbolic cosine.
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [1.0]}) >>> result = df.select(dfn.functions.acosh(dfn.col("a")).alias("acosh")) >>> result.collect_column("acosh")[0].as_py() 0.0
- datafusion.functions.alias(expr: datafusion.expr.Expr, name: str, metadata: dict[str, str] | None = None) datafusion.expr.Expr¶
Creates an alias expression with an optional metadata dictionary.
- Parameters:
expr – The expression to alias
name – The alias name
metadata – Optional metadata to attach to the column
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [1, 2]}) >>> result = df.select( ... dfn.functions.alias( ... dfn.col("a"), "b" ... ) ... ) >>> result.collect_column("b")[0].as_py() 1
>>> result = df.select( ... dfn.functions.alias( ... dfn.col("a"), "b", metadata={"info": "test"} ... ) ... ) >>> result.schema() b: int64 -- field metadata -- info: 'test'
- datafusion.functions.approx_distinct(expression: datafusion.expr.Expr, filter: datafusion.expr.Expr | None = None) datafusion.expr.Expr¶
Returns the approximate number of distinct values.
This aggregate function is similar to
count()with distinct set, but it will approximate the number of distinct entries. It may return significantly faster thancount()for some DataFrames.If using the builder functions described in ref:_aggregation this function ignores the options
order_by,null_treatment, anddistinct.- Parameters:
expression – Values to check for distinct entries
filter – If provided, only compute against rows for which the filter is True
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [1, 1, 2, 3]}) >>> result = df.aggregate( ... [], [dfn.functions.approx_distinct( ... dfn.col("a") ... ).alias("v")]) >>> result.collect_column("v")[0].as_py() == 3 True
>>> result = df.aggregate( ... [], [dfn.functions.approx_distinct( ... dfn.col("a"), ... filter=dfn.col("a") > dfn.lit(1) ... ).alias("v")]) >>> result.collect_column("v")[0].as_py() == 2 True
- datafusion.functions.approx_median(expression: datafusion.expr.Expr, filter: datafusion.expr.Expr | None = None) datafusion.expr.Expr¶
Returns the approximate median value.
This aggregate function is similar to
median(), but it will only approximate the median. It may return significantly faster for some DataFrames.If using the builder functions described in ref:_aggregation this function ignores the options
order_byandnull_treatment, anddistinct.- Parameters:
expression – Values to find the median for
filter – If provided, only compute against rows for which the filter is True
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [1.0, 2.0, 3.0]}) >>> result = df.aggregate( ... [], [dfn.functions.approx_median( ... dfn.col("a") ... ).alias("v")]) >>> result.collect_column("v")[0].as_py() 2.0
>>> result = df.aggregate( ... [], [dfn.functions.approx_median( ... dfn.col("a"), ... filter=dfn.col("a") > dfn.lit(1.0) ... ).alias("v")]) >>> result.collect_column("v")[0].as_py() 2.5
- datafusion.functions.approx_percentile_cont(sort_expression: datafusion.expr.Expr | datafusion.expr.SortExpr, percentile: float, num_centroids: int | None = None, filter: datafusion.expr.Expr | None = None) datafusion.expr.Expr¶
Returns the value that is approximately at a given percentile of
expr.This aggregate function assumes the input values form a continuous distribution. Suppose you have a DataFrame which consists of 100 different test scores. If you called this function with a percentile of 0.9, it would return the value of the test score that is above 90% of the other test scores. The returned value may be between two of the values.
This function uses the [t-digest](https://arxiv.org/abs/1902.04023) algorithm to compute the percentile. You can limit the number of bins used in this algorithm by setting the
num_centroidsparameter.If using the builder functions described in ref:_aggregation this function ignores the options
order_by,null_treatment, anddistinct.- Parameters:
sort_expression – Values for which to find the approximate percentile
percentile – This must be between 0.0 and 1.0, inclusive
num_centroids – Max bin size for the t-digest algorithm
filter – If provided, only compute against rows for which the filter is True
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [1.0, 2.0, 3.0, 4.0, 5.0]}) >>> result = df.aggregate( ... [], [dfn.functions.approx_percentile_cont( ... dfn.col("a"), 0.5 ... ).alias("v")]) >>> result.collect_column("v")[0].as_py() 3.0
>>> result = df.aggregate( ... [], [dfn.functions.approx_percentile_cont( ... dfn.col("a"), 0.5, ... num_centroids=10, ... filter=dfn.col("a") > dfn.lit(1.0), ... ).alias("v")]) >>> result.collect_column("v")[0].as_py() 3.5
- datafusion.functions.approx_percentile_cont_with_weight(sort_expression: datafusion.expr.Expr | datafusion.expr.SortExpr, weight: datafusion.expr.Expr, percentile: float, num_centroids: int | None = None, filter: datafusion.expr.Expr | None = None) datafusion.expr.Expr¶
Returns the value of the weighted approximate percentile.
This aggregate function is similar to
approx_percentile_cont()except that it uses the associated associated weights.If using the builder functions described in ref:_aggregation this function ignores the options
order_by,null_treatment, anddistinct.- Parameters:
sort_expression – Values for which to find the approximate percentile
weight – Relative weight for each of the values in
expressionpercentile – This must be between 0.0 and 1.0, inclusive
num_centroids – Max bin size for the t-digest algorithm
filter – If provided, only compute against rows for which the filter is True
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [1.0, 2.0, 3.0], "w": [1.0, 1.0, 1.0]}) >>> result = df.aggregate( ... [], [dfn.functions.approx_percentile_cont_with_weight( ... dfn.col("a"), dfn.col("w"), 0.5 ... ).alias("v")]) >>> result.collect_column("v")[0].as_py() 2.0
>>> result = df.aggregate( ... [], [dfn.functions.approx_percentile_cont_with_weight( ... dfn.col("a"), dfn.col("w"), 0.5, ... num_centroids=10, ... filter=dfn.col("a") > dfn.lit(1.0), ... ).alias("v")]) >>> result.collect_column("v")[0].as_py() 2.5
- datafusion.functions.array(*args: datafusion.expr.Expr) datafusion.expr.Expr¶
Returns an array using the specified input expressions.
See also
This is an alias for
make_array().
- datafusion.functions.array_agg(expression: datafusion.expr.Expr, distinct: bool = False, filter: datafusion.expr.Expr | None = None, order_by: list[datafusion.expr.SortKey] | datafusion.expr.SortKey | None = None) datafusion.expr.Expr¶
Aggregate values into an array.
Currently
distinctandorder_bycannot be used together. As a work around, considerarray_sort()after aggregation. [Issue Tracker](https://github.com/apache/datafusion/issues/12371)If using the builder functions described in ref:_aggregation this function ignores the option
null_treatment.- Parameters:
expression – Values to combine into an array
distinct – If True, a single entry for each distinct value will be in the result
filter – If provided, only compute against rows for which the filter is True
order_by – Order the resultant array values. Accepts column names or expressions.
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [1, 2, 3]}) >>> result = df.aggregate( ... [], [dfn.functions.array_agg( ... dfn.col("a") ... ).alias("v")]) >>> result.collect_column("v")[0].as_py() [1, 2, 3]
>>> df = ctx.from_pydict({"a": [3, 1, 2, 1]}) >>> result = df.aggregate( ... [], [dfn.functions.array_agg( ... dfn.col("a"), distinct=True, ... ).alias("v")]) >>> sorted(result.collect_column("v")[0].as_py()) [1, 2, 3]
>>> result = df.aggregate( ... [], [dfn.functions.array_agg( ... dfn.col("a"), ... filter=dfn.col("a") > dfn.lit(1), ... order_by="a", ... ).alias("v")]) >>> result.collect_column("v")[0].as_py() [2, 3]
- datafusion.functions.array_any_value(array: datafusion.expr.Expr) datafusion.expr.Expr¶
Returns the first non-null element in the array.
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [[None, 2, 3]]}) >>> result = df.select( ... dfn.functions.array_any_value(dfn.col("a")).alias("result")) >>> result.collect_column("result")[0].as_py() 2
- datafusion.functions.array_append(array: datafusion.expr.Expr, element: datafusion.expr.Expr) datafusion.expr.Expr¶
Appends an element to the end of an array.
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [[1, 2, 3]]}) >>> result = df.select( ... dfn.functions.array_append(dfn.col("a"), dfn.lit(4)).alias("result")) >>> result.collect_column("result")[0].as_py() [1, 2, 3, 4]
- datafusion.functions.array_cat(*args: datafusion.expr.Expr) datafusion.expr.Expr¶
Concatenates the input arrays.
See also
This is an alias for
array_concat().
- datafusion.functions.array_concat(*args: datafusion.expr.Expr) datafusion.expr.Expr¶
Concatenates the input arrays.
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [[1, 2]], "b": [[3, 4]]}) >>> result = df.select( ... dfn.functions.array_concat(dfn.col("a"), dfn.col("b")).alias("result")) >>> result.collect_column("result")[0].as_py() [1, 2, 3, 4]
- datafusion.functions.array_contains(array: datafusion.expr.Expr, element: datafusion.expr.Expr) datafusion.expr.Expr¶
Returns true if the element appears in the array, otherwise false.
See also
This is an alias for
array_has().
- datafusion.functions.array_dims(array: datafusion.expr.Expr) datafusion.expr.Expr¶
Returns an array of the array’s dimensions.
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [[1, 2, 3]]}) >>> result = df.select(dfn.functions.array_dims(dfn.col("a")).alias("result")) >>> result.collect_column("result")[0].as_py() [3]
- datafusion.functions.array_distance(array1: datafusion.expr.Expr, array2: datafusion.expr.Expr) datafusion.expr.Expr¶
Returns the Euclidean distance between two numeric arrays.
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [[1.0, 2.0]], "b": [[1.0, 4.0]]}) >>> result = df.select( ... dfn.functions.array_distance( ... dfn.col("a"), dfn.col("b"), ... ).alias("result")) >>> result.collect_column("result")[0].as_py() 2.0
- datafusion.functions.array_distinct(array: datafusion.expr.Expr) datafusion.expr.Expr¶
Returns distinct values from the array after removing duplicates.
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [[1, 1, 2, 3]]}) >>> result = df.select( ... dfn.functions.array_distinct( ... dfn.col("a") ... ).alias("result") ... ) >>> sorted( ... result.collect_column("result")[0].as_py() ... ) [1, 2, 3]
- datafusion.functions.array_element(array: datafusion.expr.Expr, n: datafusion.expr.Expr) datafusion.expr.Expr¶
Extracts the element with the index n from the array.
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [[10, 20, 30]]}) >>> result = df.select( ... dfn.functions.array_element(dfn.col("a"), dfn.lit(2)).alias("result")) >>> result.collect_column("result")[0].as_py() 20
- datafusion.functions.array_empty(array: datafusion.expr.Expr) datafusion.expr.Expr¶
Returns a boolean indicating whether the array is empty.
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [[1, 2]]}) >>> result = df.select(dfn.functions.array_empty(dfn.col("a")).alias("result")) >>> result.collect_column("result")[0].as_py() False
- datafusion.functions.array_except(array1: datafusion.expr.Expr, array2: datafusion.expr.Expr) datafusion.expr.Expr¶
Returns the elements that appear in
array1but not inarray2.Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [[1, 2, 3]], "b": [[2, 3, 4]]}) >>> result = df.select( ... dfn.functions.array_except(dfn.col("a"), dfn.col("b")).alias("result")) >>> result.collect_column("result")[0].as_py() [1]
- datafusion.functions.array_extract(array: datafusion.expr.Expr, n: datafusion.expr.Expr) datafusion.expr.Expr¶
Extracts the element with the index n from the array.
See also
This is an alias for
array_element().
- datafusion.functions.array_has(first_array: datafusion.expr.Expr, second_array: datafusion.expr.Expr) datafusion.expr.Expr¶
Returns true if the element appears in the first array, otherwise false.
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [[1, 2, 3]]}) >>> result = df.select( ... dfn.functions.array_has(dfn.col("a"), dfn.lit(2)).alias("result")) >>> result.collect_column("result")[0].as_py() True
- datafusion.functions.array_has_all(first_array: datafusion.expr.Expr, second_array: datafusion.expr.Expr) datafusion.expr.Expr¶
Determines if there is complete overlap
second_arrayinfirst_array.Returns true if each element of the second array appears in the first array. Otherwise, it returns false.
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [[1, 2, 3]], "b": [[1, 2]]}) >>> result = df.select( ... dfn.functions.array_has_all(dfn.col("a"), dfn.col("b")).alias("result")) >>> result.collect_column("result")[0].as_py() True
- datafusion.functions.array_has_any(first_array: datafusion.expr.Expr, second_array: datafusion.expr.Expr) datafusion.expr.Expr¶
Determine if there is an overlap between
first_arrayandsecond_array.Returns true if at least one element of the second array appears in the first array. Otherwise, it returns false.
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [[1, 2, 3]], "b": [[2, 5]]}) >>> result = df.select( ... dfn.functions.array_has_any(dfn.col("a"), dfn.col("b")).alias("result")) >>> result.collect_column("result")[0].as_py() True
- datafusion.functions.array_indexof(array: datafusion.expr.Expr, element: datafusion.expr.Expr, index: int | None = 1) datafusion.expr.Expr¶
Return the position of the first occurrence of
elementinarray.See also
This is an alias for
array_position().
- datafusion.functions.array_intersect(array1: datafusion.expr.Expr, array2: datafusion.expr.Expr) datafusion.expr.Expr¶
Returns the intersection of
array1andarray2.Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [[1, 2, 3]], "b": [[2, 3, 4]]}) >>> result = df.select( ... dfn.functions.array_intersect( ... dfn.col("a"), dfn.col("b") ... ).alias("result") ... ) >>> sorted( ... result.collect_column("result")[0].as_py() ... ) [2, 3]
- datafusion.functions.array_join(expr: datafusion.expr.Expr, delimiter: datafusion.expr.Expr) datafusion.expr.Expr¶
Converts each element to its text representation.
See also
This is an alias for
array_to_string().
- datafusion.functions.array_length(array: datafusion.expr.Expr) datafusion.expr.Expr¶
Returns the length of the array.
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [[1, 2, 3]]}) >>> result = df.select(dfn.functions.array_length(dfn.col("a")).alias("result")) >>> result.collect_column("result")[0].as_py() 3
- datafusion.functions.array_max(array: datafusion.expr.Expr) datafusion.expr.Expr¶
Returns the maximum value in the array.
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [[1, 2, 3]]}) >>> result = df.select( ... dfn.functions.array_max(dfn.col("a")).alias("result")) >>> result.collect_column("result")[0].as_py() 3
- datafusion.functions.array_min(array: datafusion.expr.Expr) datafusion.expr.Expr¶
Returns the minimum value in the array.
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [[1, 2, 3]]}) >>> result = df.select( ... dfn.functions.array_min(dfn.col("a")).alias("result")) >>> result.collect_column("result")[0].as_py() 1
- datafusion.functions.array_ndims(array: datafusion.expr.Expr) datafusion.expr.Expr¶
Returns the number of dimensions of the array.
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [[1, 2, 3]]}) >>> result = df.select(dfn.functions.array_ndims(dfn.col("a")).alias("result")) >>> result.collect_column("result")[0].as_py() 1
- datafusion.functions.array_pop_back(array: datafusion.expr.Expr) datafusion.expr.Expr¶
Returns the array without the last element.
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [[1, 2, 3]]}) >>> result = df.select( ... dfn.functions.array_pop_back(dfn.col("a")).alias("result")) >>> result.collect_column("result")[0].as_py() [1, 2]
- datafusion.functions.array_pop_front(array: datafusion.expr.Expr) datafusion.expr.Expr¶
Returns the array without the first element.
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [[1, 2, 3]]}) >>> result = df.select( ... dfn.functions.array_pop_front(dfn.col("a")).alias("result")) >>> result.collect_column("result")[0].as_py() [2, 3]
- datafusion.functions.array_position(array: datafusion.expr.Expr, element: datafusion.expr.Expr, index: int | None = 1) datafusion.expr.Expr¶
Return the position of the first occurrence of
elementinarray.Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [[10, 20, 30]]}) >>> result = df.select( ... dfn.functions.array_position( ... dfn.col("a"), dfn.lit(20) ... ).alias("result")) >>> result.collect_column("result")[0].as_py() 2
Use
indexto start searching from a given position:>>> df = ctx.from_pydict({"a": [[10, 20, 10, 20]]}) >>> result = df.select( ... dfn.functions.array_position( ... dfn.col("a"), dfn.lit(20), index=3, ... ).alias("result")) >>> result.collect_column("result")[0].as_py() 4
- datafusion.functions.array_positions(array: datafusion.expr.Expr, element: datafusion.expr.Expr) datafusion.expr.Expr¶
Searches for an element in the array and returns all occurrences.
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [[1, 2, 1]]}) >>> result = df.select( ... dfn.functions.array_positions(dfn.col("a"), dfn.lit(1)).alias("result")) >>> result.collect_column("result")[0].as_py() [1, 3]
- datafusion.functions.array_prepend(element: datafusion.expr.Expr, array: datafusion.expr.Expr) datafusion.expr.Expr¶
Prepends an element to the beginning of an array.
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [[1, 2]]}) >>> result = df.select( ... dfn.functions.array_prepend(dfn.lit(0), dfn.col("a")).alias("result")) >>> result.collect_column("result")[0].as_py() [0, 1, 2]
- datafusion.functions.array_push_back(array: datafusion.expr.Expr, element: datafusion.expr.Expr) datafusion.expr.Expr¶
Appends an element to the end of an array.
See also
This is an alias for
array_append().
- datafusion.functions.array_push_front(element: datafusion.expr.Expr, array: datafusion.expr.Expr) datafusion.expr.Expr¶
Prepends an element to the beginning of an array.
See also
This is an alias for
array_prepend().
- datafusion.functions.array_remove(array: datafusion.expr.Expr, element: datafusion.expr.Expr) datafusion.expr.Expr¶
Removes the first element from the array equal to the given value.
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [[1, 2, 1]]}) >>> result = df.select( ... dfn.functions.array_remove(dfn.col("a"), dfn.lit(1)).alias("result")) >>> result.collect_column("result")[0].as_py() [2, 1]
- datafusion.functions.array_remove_all(array: datafusion.expr.Expr, element: datafusion.expr.Expr) datafusion.expr.Expr¶
Removes all elements from the array equal to the given value.
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [[1, 2, 1]]}) >>> result = df.select( ... dfn.functions.array_remove_all( ... dfn.col("a"), dfn.lit(1) ... ).alias("result")) >>> result.collect_column("result")[0].as_py() [2]
- datafusion.functions.array_remove_n(array: datafusion.expr.Expr, element: datafusion.expr.Expr, max: datafusion.expr.Expr) datafusion.expr.Expr¶
Removes the first
maxelements from the array equal to the given value.Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [[1, 2, 1, 1]]}) >>> result = df.select( ... dfn.functions.array_remove_n(dfn.col("a"), dfn.lit(1), ... dfn.lit(2)).alias("result")) >>> result.collect_column("result")[0].as_py() [2, 1]
- datafusion.functions.array_repeat(element: datafusion.expr.Expr, count: datafusion.expr.Expr) datafusion.expr.Expr¶
Returns an array containing
elementcounttimes.Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [1]}) >>> result = df.select( ... dfn.functions.array_repeat(dfn.lit(3), dfn.lit(3)).alias("result")) >>> result.collect_column("result")[0].as_py() [3, 3, 3]
- datafusion.functions.array_replace(array: datafusion.expr.Expr, from_val: datafusion.expr.Expr, to_val: datafusion.expr.Expr) datafusion.expr.Expr¶
Replaces the first occurrence of
from_valwithto_val.Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [[1, 2, 1]]}) >>> result = df.select( ... dfn.functions.array_replace(dfn.col("a"), dfn.lit(1), ... dfn.lit(9)).alias("result")) >>> result.collect_column("result")[0].as_py() [9, 2, 1]
- datafusion.functions.array_replace_all(array: datafusion.expr.Expr, from_val: datafusion.expr.Expr, to_val: datafusion.expr.Expr) datafusion.expr.Expr¶
Replaces all occurrences of
from_valwithto_val.Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [[1, 2, 1]]}) >>> result = df.select( ... dfn.functions.array_replace_all(dfn.col("a"), dfn.lit(1), ... dfn.lit(9)).alias("result")) >>> result.collect_column("result")[0].as_py() [9, 2, 9]
- datafusion.functions.array_replace_n(array: datafusion.expr.Expr, from_val: datafusion.expr.Expr, to_val: datafusion.expr.Expr, max: datafusion.expr.Expr) datafusion.expr.Expr¶
Replace
noccurrences offrom_valwithto_val.Replaces the first
maxoccurrences of the specified element with another specified element.Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [[1, 2, 1, 1]]}) >>> result = df.select( ... dfn.functions.array_replace_n(dfn.col("a"), dfn.lit(1), dfn.lit(9), ... dfn.lit(2)).alias("result")) >>> result.collect_column("result")[0].as_py() [9, 2, 9, 1]
- datafusion.functions.array_resize(array: datafusion.expr.Expr, size: datafusion.expr.Expr, value: datafusion.expr.Expr) datafusion.expr.Expr¶
Returns an array with the specified size filled.
If
sizeis greater than thearraylength, the additional entries will be filled with the givenvalue.Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [[1, 2]]}) >>> result = df.select( ... dfn.functions.array_resize(dfn.col("a"), dfn.lit(4), ... dfn.lit(0)).alias("result")) >>> result.collect_column("result")[0].as_py() [1, 2, 0, 0]
- datafusion.functions.array_reverse(array: datafusion.expr.Expr) datafusion.expr.Expr¶
Reverses the order of elements in the array.
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [[1, 2, 3]]}) >>> result = df.select( ... dfn.functions.array_reverse(dfn.col("a")).alias("result")) >>> result.collect_column("result")[0].as_py() [3, 2, 1]
- datafusion.functions.array_slice(array: datafusion.expr.Expr, begin: datafusion.expr.Expr, end: datafusion.expr.Expr, stride: datafusion.expr.Expr | None = None) datafusion.expr.Expr¶
Returns a slice of the array.
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [[1, 2, 3, 4]]}) >>> result = df.select( ... dfn.functions.array_slice( ... dfn.col("a"), dfn.lit(2), dfn.lit(3) ... ).alias("result")) >>> result.collect_column("result")[0].as_py() [2, 3]
Use
strideto skip elements:>>> result = df.select( ... dfn.functions.array_slice( ... dfn.col("a"), dfn.lit(1), dfn.lit(4), ... stride=dfn.lit(2), ... ).alias("result")) >>> result.collect_column("result")[0].as_py() [1, 3]
- datafusion.functions.array_sort(array: datafusion.expr.Expr, descending: bool = False, null_first: bool = False) datafusion.expr.Expr¶
Sort an array.
- Parameters:
array – The input array to sort.
descending – If True, sorts in descending order.
null_first – If True, nulls will be returned at the beginning of the array.
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [[3, 1, 2]]}) >>> result = df.select( ... dfn.functions.array_sort( ... dfn.col("a") ... ).alias("result")) >>> result.collect_column("result")[0].as_py() [1, 2, 3]
>>> df = ctx.from_pydict({"a": [[3, None, 1]]}) >>> result = df.select( ... dfn.functions.array_sort( ... dfn.col("a"), descending=True, null_first=True, ... ).alias("result")) >>> result.collect_column("result")[0].as_py() [None, 3, 1]
- datafusion.functions.array_to_string(expr: datafusion.expr.Expr, delimiter: datafusion.expr.Expr) datafusion.expr.Expr¶
Converts each element to its text representation.
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [[1, 2, 3]]}) >>> result = df.select( ... dfn.functions.array_to_string(dfn.col("a"), dfn.lit(",")).alias("s")) >>> result.collect_column("s")[0].as_py() '1,2,3'
- datafusion.functions.array_union(array1: datafusion.expr.Expr, array2: datafusion.expr.Expr) datafusion.expr.Expr¶
Returns an array of the elements in the union of array1 and array2.
Duplicate rows will not be returned.
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [[1, 2, 3]], "b": [[2, 3, 4]]}) >>> result = df.select( ... dfn.functions.array_union( ... dfn.col("a"), dfn.col("b") ... ).alias("result") ... ) >>> sorted( ... result.collect_column("result")[0].as_py() ... ) [1, 2, 3, 4]
- datafusion.functions.arrays_overlap(first_array: datafusion.expr.Expr, second_array: datafusion.expr.Expr) datafusion.expr.Expr¶
Returns true if any element appears in both arrays.
See also
This is an alias for
array_has_any().
- datafusion.functions.arrays_zip(*arrays: datafusion.expr.Expr) datafusion.expr.Expr¶
Combines multiple arrays into a single array of structs.
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [[1, 2]], "b": [[3, 4]]}) >>> result = df.select( ... dfn.functions.arrays_zip(dfn.col("a"), dfn.col("b")).alias("result")) >>> result.collect_column("result")[0].as_py() [{'c0': 1, 'c1': 3}, {'c0': 2, 'c1': 4}]
- datafusion.functions.arrow_cast(expr: datafusion.expr.Expr, data_type: datafusion.expr.Expr | str | pyarrow.DataType) datafusion.expr.Expr¶
Casts an expression to a specified data type.
The
data_typecan be a string, apyarrow.DataType, or anExpr. For simple types,Expr.cast()is more concise (e.g.,col("a").cast(pa.float64())). Usearrow_castwhen you want to specify the target type as a string using DataFusion’s type syntax, which can be more readable for complex types like"Timestamp(Nanosecond, None)".Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [1]}) >>> result = df.select( ... dfn.functions.arrow_cast(dfn.col("a"), "Float64").alias("c") ... ) >>> result.collect_column("c")[0].as_py() 1.0
>>> result = df.select( ... dfn.functions.arrow_cast( ... dfn.col("a"), data_type=pa.float64() ... ).alias("c") ... ) >>> result.collect_column("c")[0].as_py() 1.0
- datafusion.functions.arrow_metadata(expr: datafusion.expr.Expr, key: datafusion.expr.Expr | str | None = None) datafusion.expr.Expr¶
Returns the metadata of the input expression.
If called with one argument, returns a Map of all metadata key-value pairs. If called with two arguments, returns the value for the specified metadata key.
Examples
>>> field = pa.field("val", pa.int64(), metadata={"k": "v"}) >>> schema = pa.schema([field]) >>> batch = pa.RecordBatch.from_arrays([pa.array([1])], schema=schema) >>> ctx = dfn.SessionContext() >>> df = ctx.create_dataframe([[batch]]) >>> result = df.select( ... dfn.functions.arrow_metadata(dfn.col("val")).alias("meta") ... ) >>> ("k", "v") in result.collect_column("meta")[0].as_py() True
>>> result = df.select( ... dfn.functions.arrow_metadata( ... dfn.col("val"), key="k" ... ).alias("meta_val") ... ) >>> result.collect_column("meta_val")[0].as_py() 'v'
- datafusion.functions.arrow_typeof(arg: datafusion.expr.Expr) datafusion.expr.Expr¶
Returns the Arrow type of the expression.
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [1]}) >>> result = df.select(dfn.functions.arrow_typeof(dfn.col("a")).alias("t")) >>> result.collect_column("t")[0].as_py() 'Int64'
- datafusion.functions.ascii(arg: datafusion.expr.Expr) datafusion.expr.Expr¶
Returns the numeric code of the first character of the argument.
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": ["a","b","c"]}) >>> ascii_df = df.select(dfn.functions.ascii(dfn.col("a")).alias("ascii")) >>> ascii_df.collect_column("ascii")[0].as_py() 97
- datafusion.functions.asin(arg: datafusion.expr.Expr) datafusion.expr.Expr¶
Returns the arc sine or inverse sine of a number.
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [0.0]}) >>> result = df.select(dfn.functions.asin(dfn.col("a")).alias("asin")) >>> result.collect_column("asin")[0].as_py() 0.0
- datafusion.functions.asinh(arg: datafusion.expr.Expr) datafusion.expr.Expr¶
Returns inverse hyperbolic sine.
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [0.0]}) >>> result = df.select(dfn.functions.asinh(dfn.col("a")).alias("asinh")) >>> result.collect_column("asinh")[0].as_py() 0.0
- datafusion.functions.atan(arg: datafusion.expr.Expr) datafusion.expr.Expr¶
Returns inverse tangent of a number.
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [0.0]}) >>> result = df.select(dfn.functions.atan(dfn.col("a")).alias("atan")) >>> result.collect_column("atan")[0].as_py() 0.0
- datafusion.functions.atan2(y: datafusion.expr.Expr, x: datafusion.expr.Expr) datafusion.expr.Expr¶
Returns inverse tangent of a division given in the argument.
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"y": [0.0], "x": [1.0]}) >>> result = df.select( ... dfn.functions.atan2(dfn.col("y"), dfn.col("x")).alias("atan2")) >>> result.collect_column("atan2")[0].as_py() 0.0
- datafusion.functions.atanh(arg: datafusion.expr.Expr) datafusion.expr.Expr¶
Returns inverse hyperbolic tangent.
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [0.0]}) >>> result = df.select(dfn.functions.atanh(dfn.col("a")).alias("atanh")) >>> result.collect_column("atanh")[0].as_py() 0.0
- datafusion.functions.avg(expression: datafusion.expr.Expr, filter: datafusion.expr.Expr | None = None) datafusion.expr.Expr¶
Returns the average value.
This aggregate function expects a numeric expression and will return a float.
If using the builder functions described in ref:_aggregation this function ignores the options
order_by,null_treatment, anddistinct.- Parameters:
expression – Values to combine into an array
filter – If provided, only compute against rows for which the filter is True
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [1.0, 2.0, 3.0]}) >>> result = df.aggregate( ... [], [dfn.functions.avg( ... dfn.col("a") ... ).alias("v")]) >>> result.collect_column("v")[0].as_py() 2.0
>>> result = df.aggregate( ... [], [dfn.functions.avg( ... dfn.col("a"), ... filter=dfn.col("a") > dfn.lit(1.0) ... ).alias("v")]) >>> result.collect_column("v")[0].as_py() 2.5
- datafusion.functions.bit_and(expression: datafusion.expr.Expr, filter: datafusion.expr.Expr | None = None) datafusion.expr.Expr¶
Computes the bitwise AND of the argument.
This aggregate function will bitwise compare every value in the input partition.
If using the builder functions described in ref:_aggregation this function ignores the options
order_by,null_treatment, anddistinct.- Parameters:
expression – Argument to perform bitwise calculation on
filter – If provided, only compute against rows for which the filter is True
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [7, 3]}) >>> result = df.aggregate( ... [], [dfn.functions.bit_and( ... dfn.col("a") ... ).alias("v")]) >>> result.collect_column("v")[0].as_py() 3
>>> df = ctx.from_pydict({"a": [7, 5, 3]}) >>> result = df.aggregate( ... [], [dfn.functions.bit_and( ... dfn.col("a"), ... filter=dfn.col("a") > dfn.lit(3) ... ).alias("v")]) >>> result.collect_column("v")[0].as_py() 5
- datafusion.functions.bit_length(arg: datafusion.expr.Expr) datafusion.expr.Expr¶
Returns the number of bits in the string argument.
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": ["a","b","c"]}) >>> bit_df = df.select(dfn.functions.bit_length(dfn.col("a")).alias("bit_len")) >>> bit_df.collect_column("bit_len")[0].as_py() 8
- datafusion.functions.bit_or(expression: datafusion.expr.Expr, filter: datafusion.expr.Expr | None = None) datafusion.expr.Expr¶
Computes the bitwise OR of the argument.
This aggregate function will bitwise compare every value in the input partition.
If using the builder functions described in ref:_aggregation this function ignores the options
order_by,null_treatment, anddistinct.- Parameters:
expression – Argument to perform bitwise calculation on
filter – If provided, only compute against rows for which the filter is True
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [1, 2]}) >>> result = df.aggregate( ... [], [dfn.functions.bit_or( ... dfn.col("a") ... ).alias("v")] ... ) >>> result.collect_column("v")[0].as_py() 3
>>> df = ctx.from_pydict({"a": [1, 2, 4]}) >>> result = df.aggregate( ... [], [dfn.functions.bit_or( ... dfn.col("a"), ... filter=dfn.col("a") > dfn.lit(1) ... ).alias("v")] ... ) >>> result.collect_column("v")[0].as_py() 6
- datafusion.functions.bit_xor(expression: datafusion.expr.Expr, distinct: bool = False, filter: datafusion.expr.Expr | None = None) datafusion.expr.Expr¶
Computes the bitwise XOR of the argument.
This aggregate function will bitwise compare every value in the input partition.
If using the builder functions described in ref:_aggregation this function ignores the options
order_byandnull_treatment.- Parameters:
expression – Argument to perform bitwise calculation on
distinct – If True, evaluate each unique value of expression only once
filter – If provided, only compute against rows for which the filter is True
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [5, 3]}) >>> result = df.aggregate( ... [], [dfn.functions.bit_xor( ... dfn.col("a") ... ).alias("v")] ... ) >>> result.collect_column("v")[0].as_py() 6
>>> df = ctx.from_pydict({"a": [5, 5, 3]}) >>> result = df.aggregate( ... [], [dfn.functions.bit_xor( ... dfn.col("a"), distinct=True, ... filter=dfn.col("a") > dfn.lit(3), ... ).alias("v")] ... ) >>> result.collect_column("v")[0].as_py() 5
- datafusion.functions.bool_and(expression: datafusion.expr.Expr, filter: datafusion.expr.Expr | None = None) datafusion.expr.Expr¶
Computes the boolean AND of the argument.
This aggregate function will compare every value in the input partition. These are expected to be boolean values.
If using the builder functions described in ref:_aggregation this function ignores the options
order_by,null_treatment, anddistinct.- Parameters:
expression – Argument to perform calculation on
filter – If provided, only compute against rows for which the filter is True
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [True, True, False]}) >>> result = df.aggregate( ... [], [dfn.functions.bool_and( ... dfn.col("a") ... ).alias("v")] ... ) >>> result.collect_column("v")[0].as_py() False
>>> df = ctx.from_pydict( ... {"a": [True, True, False], "b": [1, 2, 3]}) >>> result = df.aggregate( ... [], [dfn.functions.bool_and( ... dfn.col("a"), ... filter=dfn.col("b") < dfn.lit(3) ... ).alias("v")] ... ) >>> result.collect_column("v")[0].as_py() True
- datafusion.functions.bool_or(expression: datafusion.expr.Expr, filter: datafusion.expr.Expr | None = None) datafusion.expr.Expr¶
Computes the boolean OR of the argument.
This aggregate function will compare every value in the input partition. These are expected to be boolean values.
If using the builder functions described in ref:_aggregation this function ignores the options
order_by,null_treatment, anddistinct.- Parameters:
expression – Argument to perform calculation on
filter – If provided, only compute against rows for which the filter is True
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [False, False, True]}) >>> result = df.aggregate( ... [], [dfn.functions.bool_or( ... dfn.col("a") ... ).alias("v")] ... ) >>> result.collect_column("v")[0].as_py() True
>>> df = ctx.from_pydict( ... {"a": [False, False, True], "b": [1, 2, 3]}) >>> result = df.aggregate( ... [], [dfn.functions.bool_or( ... dfn.col("a"), ... filter=dfn.col("b") < dfn.lit(3) ... ).alias("v")] ... ) >>> result.collect_column("v")[0].as_py() False
- datafusion.functions.btrim(arg: datafusion.expr.Expr) datafusion.expr.Expr¶
Removes all characters, spaces by default, from both sides of a string.
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [" a "]}) >>> trim_df = df.select(dfn.functions.btrim(dfn.col("a")).alias("trimmed")) >>> trim_df.collect_column("trimmed")[0].as_py() 'a'
- datafusion.functions.cardinality(array: datafusion.expr.Expr) datafusion.expr.Expr¶
Returns the total number of elements in the array.
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [[1, 2, 3]]}) >>> result = df.select(dfn.functions.cardinality(dfn.col("a")).alias("result")) >>> result.collect_column("result")[0].as_py() 3
- datafusion.functions.case(expr: datafusion.expr.Expr) datafusion.expr.CaseBuilder¶
Create a case expression.
Create a
CaseBuilderto match cases for the expressionexpr. SeeCaseBuilderfor detailed usage.Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [1, 2, 3]}) >>> result = df.select( ... dfn.functions.case(dfn.col("a")).when(dfn.lit(1), ... dfn.lit("one")).otherwise(dfn.lit("other")).alias("c")) >>> result.collect_column("c")[0].as_py() 'one'
- datafusion.functions.cbrt(arg: datafusion.expr.Expr) datafusion.expr.Expr¶
Returns the cube root of a number.
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [27]}) >>> cbrt_df = df.select(dfn.functions.cbrt(dfn.col("a")).alias("cbrt")) >>> cbrt_df.collect_column("cbrt")[0].as_py() 3.0
- datafusion.functions.ceil(arg: datafusion.expr.Expr) datafusion.expr.Expr¶
Returns the nearest integer greater than or equal to argument.
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [1.9]}) >>> ceil_df = df.select(dfn.functions.ceil(dfn.col("a")).alias("ceil")) >>> ceil_df.collect_column("ceil")[0].as_py() 2.0
- datafusion.functions.char_length(string: datafusion.expr.Expr) datafusion.expr.Expr¶
The number of characters in the
string.Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": ["hello"]}) >>> result = df.select(dfn.functions.char_length(dfn.col("a")).alias("len")) >>> result.collect_column("len")[0].as_py() 5
- datafusion.functions.character_length(arg: datafusion.expr.Expr) datafusion.expr.Expr¶
Returns the number of characters in the argument.
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": ["abc","b","c"]}) >>> char_len_df = df.select( ... dfn.functions.character_length(dfn.col("a")).alias("char_len")) >>> char_len_df.collect_column("char_len")[0].as_py() 3
- datafusion.functions.chr(arg: datafusion.expr.Expr) datafusion.expr.Expr¶
Converts the Unicode code point to a UTF8 character.
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [65]}) >>> result = df.select(dfn.functions.chr(dfn.col("a")).alias("chr")) >>> result.collect_column("chr")[0].as_py() 'A'
- datafusion.functions.coalesce(*args: datafusion.expr.Expr) datafusion.expr.Expr¶
Returns the value of the first expr in
argswhich is not NULL.Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [None, 1], "b": [2, 3]}) >>> result = df.select( ... dfn.functions.coalesce(dfn.col("a"), dfn.col("b")).alias("c")) >>> result.collect_column("c")[0].as_py() 2
- datafusion.functions.col(name: str) datafusion.expr.Expr¶
Creates a column reference expression.
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [1, 2, 3]}) >>> df.select(dfn.functions.col("a")).collect_column("a")[0].as_py() 1
- datafusion.functions.concat(*args: datafusion.expr.Expr) datafusion.expr.Expr¶
Concatenates the text representations of all the arguments.
NULL arguments are ignored.
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": ["hello"], "b": [" world"]}) >>> result = df.select( ... dfn.functions.concat(dfn.col("a"), dfn.col("b")).alias("c") ... ) >>> result.collect_column("c")[0].as_py() 'hello world'
- datafusion.functions.concat_ws(separator: str, *args: datafusion.expr.Expr) datafusion.expr.Expr¶
Concatenates the list
argswith the separator.NULLarguments are ignored.separatorshould not beNULL.Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": ["hello"], "b": ["world"]}) >>> result = df.select( ... dfn.functions.concat_ws("-", dfn.col("a"), dfn.col("b")).alias("c")) >>> result.collect_column("c")[0].as_py() 'hello-world'
- datafusion.functions.contains(string: datafusion.expr.Expr, search_str: datafusion.expr.Expr) datafusion.expr.Expr¶
Returns true if
search_stris found withinstring(case-sensitive).Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": ["the quick brown fox"]}) >>> result = df.select( ... dfn.functions.contains(dfn.col("a"), dfn.lit("brown")).alias("c")) >>> result.collect_column("c")[0].as_py() True
- datafusion.functions.corr(value_y: datafusion.expr.Expr, value_x: datafusion.expr.Expr, filter: datafusion.expr.Expr | None = None) datafusion.expr.Expr¶
Returns the correlation coefficient between
value1andvalue2.This aggregate function expects both values to be numeric and will return a float.
If using the builder functions described in ref:_aggregation this function ignores the options
order_by,null_treatment, anddistinct.- Parameters:
value_y – The dependent variable for correlation
value_x – The independent variable for correlation
filter – If provided, only compute against rows for which the filter is True
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [1.0, 2.0, 3.0], "b": [1.0, 2.0, 3.0]}) >>> result = df.aggregate( ... [], [dfn.functions.corr( ... dfn.col("a"), dfn.col("b") ... ).alias("v")]) >>> result.collect_column("v")[0].as_py() 1.0
>>> result = df.aggregate( ... [], [dfn.functions.corr( ... dfn.col("a"), dfn.col("b"), ... filter=dfn.col("a") > dfn.lit(1.0) ... ).alias("v")]) >>> result.collect_column("v")[0].as_py() 1.0
- datafusion.functions.cos(arg: datafusion.expr.Expr) datafusion.expr.Expr¶
Returns the cosine of the argument.
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [0,-1,1]}) >>> cos_df = df.select(dfn.functions.cos(dfn.col("a")).alias("cos")) >>> cos_df.collect_column("cos")[0].as_py() 1.0
- datafusion.functions.cosh(arg: datafusion.expr.Expr) datafusion.expr.Expr¶
Returns the hyperbolic cosine of the argument.
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [0,-1,1]}) >>> cosh_df = df.select(dfn.functions.cosh(dfn.col("a")).alias("cosh")) >>> cosh_df.collect_column("cosh")[0].as_py() 1.0
- datafusion.functions.cot(arg: datafusion.expr.Expr) datafusion.expr.Expr¶
Returns the cotangent of the argument.
Examples
>>> from math import pi >>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [pi / 4]}) >>> result = df.select( ... dfn.functions.cot(dfn.col("a")).alias("cot") ... ) >>> result.collect_column("cot")[0].as_py() 1.0...
- datafusion.functions.count(expressions: datafusion.expr.Expr | list[datafusion.expr.Expr] | None = None, distinct: bool = False, filter: datafusion.expr.Expr | None = None) datafusion.expr.Expr¶
Returns the number of rows that match the given arguments.
This aggregate function will count the non-null rows provided in the expression.
If using the builder functions described in ref:_aggregation this function ignores the options
order_byandnull_treatment.- Parameters:
expressions – Argument to perform bitwise calculation on
distinct – If True, a single entry for each distinct value will be in the result
filter – If provided, only compute against rows for which the filter is True
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [1, 2, 3]}) >>> result = df.aggregate( ... [], [dfn.functions.count( ... dfn.col("a") ... ).alias("v")]) >>> result.collect_column("v")[0].as_py() 3
>>> df = ctx.from_pydict({"a": [1, 1, 2, 3]}) >>> result = df.aggregate( ... [], [dfn.functions.count( ... dfn.col("a"), distinct=True, ... filter=dfn.col("a") > dfn.lit(1), ... ).alias("v")]) >>> result.collect_column("v")[0].as_py() 2
- datafusion.functions.count_star(filter: datafusion.expr.Expr | None = None) datafusion.expr.Expr¶
Create a COUNT(1) aggregate expression.
This aggregate function will count all of the rows in the partition.
If using the builder functions described in ref:_aggregation this function ignores the options
order_by,distinct, andnull_treatment.- Parameters:
filter – If provided, only count rows for which the filter is True
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [1, 2, 3]}) >>> result = df.aggregate( ... [], [dfn.functions.count_star( ... ).alias("cnt")]) >>> result.collect_column("cnt")[0].as_py() 3
>>> result = df.aggregate( ... [], [dfn.functions.count_star( ... filter=dfn.col("a") > dfn.lit(1) ... ).alias("cnt")]) >>> result.collect_column("cnt")[0].as_py() 2
- datafusion.functions.covar(value_y: datafusion.expr.Expr, value_x: datafusion.expr.Expr, filter: datafusion.expr.Expr | None = None) datafusion.expr.Expr¶
Computes the sample covariance.
See also
This is an alias for
covar_samp().
- datafusion.functions.covar_pop(value_y: datafusion.expr.Expr, value_x: datafusion.expr.Expr, filter: datafusion.expr.Expr | None = None) datafusion.expr.Expr¶
Computes the population covariance.
This aggregate function expects both values to be numeric and will return a float.
If using the builder functions described in ref:_aggregation this function ignores the options
order_by,null_treatment, anddistinct.- Parameters:
value_y – The dependent variable for covariance
value_x – The independent variable for covariance
filter – If provided, only compute against rows for which the filter is True
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [1.0, 5.0, 10.0], "b": [1.0, 2.0, 3.0]}) >>> result = df.aggregate( ... [], ... [dfn.functions.covar_pop( ... dfn.col("a"), dfn.col("b") ... ).alias("v")] ... ) >>> result.collect_column("v")[0].as_py() 3.0
>>> df = ctx.from_pydict( ... {"a": [0.0, 1.0, 3.0], "b": [0.0, 1.0, 3.0]}) >>> result = df.aggregate( ... [], ... [dfn.functions.covar_pop( ... dfn.col("a"), dfn.col("b"), ... filter=dfn.col("a") > dfn.lit(0.0) ... ).alias("v")] ... ) >>> result.collect_column("v")[0].as_py() 1.0
- datafusion.functions.covar_samp(value_y: datafusion.expr.Expr, value_x: datafusion.expr.Expr, filter: datafusion.expr.Expr | None = None) datafusion.expr.Expr¶
Computes the sample covariance.
This aggregate function expects both values to be numeric and will return a float.
If using the builder functions described in ref:_aggregation this function ignores the options
order_by,null_treatment, anddistinct.- Parameters:
value_y – The dependent variable for covariance
value_x – The independent variable for covariance
filter – If provided, only compute against rows for which the filter is True
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [1.0, 2.0, 3.0], "b": [4.0, 5.0, 6.0]}) >>> result = df.aggregate( ... [], [dfn.functions.covar_samp( ... dfn.col("a"), dfn.col("b") ... ).alias("v")]) >>> result.collect_column("v")[0].as_py() 1.0
>>> result = df.aggregate( ... [], [dfn.functions.covar_samp( ... dfn.col("a"), dfn.col("b"), ... filter=dfn.col("a") > dfn.lit(1.0) ... ).alias("v")]) >>> result.collect_column("v")[0].as_py() 0.5
- datafusion.functions.cume_dist(partition_by: list[datafusion.expr.Expr] | datafusion.expr.Expr | None = None, order_by: list[datafusion.expr.SortKey] | datafusion.expr.SortKey | None = None) datafusion.expr.Expr¶
Create a cumulative distribution window function.
This window function is similar to
rank()except that the returned values are the ratio of the row number to the total number of rows. Here is an example of a dataframe with a window ordered by descendingpointsand the associated cumulative distribution:+--------+-----------+ | points | cume_dist | +--------+-----------+ | 100 | 0.5 | | 100 | 0.5 | | 50 | 0.75 | | 25 | 1.0 | +--------+-----------+
- Parameters:
partition_by – Expressions to partition the window frame on.
order_by – Set ordering within the window frame. Accepts column names or expressions.
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [1., 2., 2., 3.]}) >>> result = df.select( ... dfn.col("a"), ... dfn.functions.cume_dist( ... order_by="a" ... ).alias("cd") ... ) >>> result.collect_column("cd").to_pylist() [0.25..., 0.75..., 0.75..., 1.0...]
>>> df = ctx.from_pydict( ... {"g": ["a", "a", "b", "b"], "v": [1, 2, 3, 4]}) >>> result = df.select( ... dfn.col("g"), dfn.col("v"), ... dfn.functions.cume_dist( ... partition_by=dfn.col("g"), order_by="v", ... ).alias("cd")) >>> result.sort(dfn.col("g"), dfn.col("v")).collect_column("cd").to_pylist() [0.5, 1.0, 0.5, 1.0]
- datafusion.functions.current_date() datafusion.expr.Expr¶
Returns current UTC date as a Date32 value.
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [1]}) >>> result = df.select( ... dfn.functions.current_date().alias("d") ... ) >>> result.collect_column("d")[0].as_py() is not None True
- datafusion.functions.current_time() datafusion.expr.Expr¶
Returns current UTC time as a Time64 value.
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [1]}) >>> result = df.select( ... dfn.functions.current_time().alias("t") ... )
Use .value instead of .as_py() because nanosecond timestamps require pandas to convert to Python datetime objects.
>>> result.collect_column("t")[0].value > 0 True
- datafusion.functions.current_timestamp() datafusion.expr.Expr¶
Returns the current timestamp in nanoseconds.
See also
This is an alias for
now().
- datafusion.functions.date_bin(stride: datafusion.expr.Expr, source: datafusion.expr.Expr, origin: datafusion.expr.Expr) datafusion.expr.Expr¶
Coerces an arbitrary timestamp to the start of the nearest specified interval.
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"timestamp": ['2021-07-15 12:34:56', '2021-01-01']}) >>> result = df.select( ... dfn.functions.date_bin( ... dfn.string_literal("15 minutes"), ... dfn.col("timestamp"), ... dfn.string_literal("2001-01-01 00:00:00") ... ).alias("b") ... ) >>> str(result.collect_column("b")[0].as_py()) '2021-07-15 12:30:00' >>> str(result.collect_column("b")[1].as_py()) '2021-01-01 00:00:00'
- datafusion.functions.date_format(arg: datafusion.expr.Expr, formatter: datafusion.expr.Expr) datafusion.expr.Expr¶
Returns a string representation of a date, time, timestamp or duration.
See also
This is an alias for
to_char().
- datafusion.functions.date_part(part: datafusion.expr.Expr, date: datafusion.expr.Expr) datafusion.expr.Expr¶
Extracts a subfield from the date.
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": ["2021-07-15T00:00:00"]}) >>> df = df.select(dfn.functions.to_timestamp(dfn.col("a")).alias("a")) >>> result = df.select( ... dfn.functions.date_part(dfn.lit("year"), dfn.col("a")).alias("y")) >>> result.collect_column("y")[0].as_py() 2021
- datafusion.functions.date_trunc(part: datafusion.expr.Expr, date: datafusion.expr.Expr) datafusion.expr.Expr¶
Truncates the date to a specified level of precision.
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": ["2021-07-15T12:34:56"]}) >>> df = df.select(dfn.functions.to_timestamp(dfn.col("a")).alias("a")) >>> result = df.select( ... dfn.functions.date_trunc( ... dfn.lit("month"), dfn.col("a") ... ).alias("t") ... ) >>> str(result.collect_column("t")[0].as_py()) '2021-07-01 00:00:00'
- datafusion.functions.datepart(part: datafusion.expr.Expr, date: datafusion.expr.Expr) datafusion.expr.Expr¶
Return a specified part of a date.
See also
This is an alias for
date_part().
- datafusion.functions.datetrunc(part: datafusion.expr.Expr, date: datafusion.expr.Expr) datafusion.expr.Expr¶
Truncates the date to a specified level of precision.
See also
This is an alias for
date_trunc().
- datafusion.functions.decode(expr: datafusion.expr.Expr, encoding: datafusion.expr.Expr) datafusion.expr.Expr¶
Decode the
input, using theencoding. encoding can be base64 or hex.Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": ["aGVsbG8="]}) >>> result = df.select( ... dfn.functions.decode(dfn.col("a"), dfn.lit("base64")).alias("dec")) >>> result.collect_column("dec")[0].as_py() b'hello'
- datafusion.functions.degrees(arg: datafusion.expr.Expr) datafusion.expr.Expr¶
Converts the argument from radians to degrees.
Examples
>>> from math import pi >>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [0,pi,2*pi]}) >>> deg_df = df.select(dfn.functions.degrees(dfn.col("a")).alias("deg")) >>> deg_df.collect_column("deg")[2].as_py() 360.0
- datafusion.functions.dense_rank(partition_by: list[datafusion.expr.Expr] | datafusion.expr.Expr | None = None, order_by: list[datafusion.expr.SortKey] | datafusion.expr.SortKey | None = None) datafusion.expr.Expr¶
Create a dense_rank window function.
This window function is similar to
rank()except that the returned values will be consecutive. Here is an example of a dataframe with a window ordered by descendingpointsand the associated dense rank:+--------+------------+ | points | dense_rank | +--------+------------+ | 100 | 1 | | 100 | 1 | | 50 | 2 | | 25 | 3 | +--------+------------+
- Parameters:
partition_by – Expressions to partition the window frame on.
order_by – Set ordering within the window frame. Accepts column names or expressions.
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [10, 10, 20]}) >>> result = df.select( ... dfn.col("a"), ... dfn.functions.dense_rank( ... order_by="a" ... ).alias("dr")) >>> result.sort(dfn.col("a")).collect_column("dr").to_pylist() [1, 1, 2]
>>> df = ctx.from_pydict( ... {"g": ["a", "a", "b", "b"], "v": [1, 1, 2, 3]}) >>> result = df.select( ... dfn.col("g"), dfn.col("v"), ... dfn.functions.dense_rank( ... partition_by=dfn.col("g"), order_by="v", ... ).alias("dr")) >>> result.sort(dfn.col("g"), dfn.col("v")).collect_column("dr").to_pylist() [1, 1, 1, 2]
- datafusion.functions.digest(value: datafusion.expr.Expr, method: datafusion.expr.Expr) datafusion.expr.Expr¶
Computes the binary hash of an expression using the specified algorithm.
Standard algorithms are md5, sha224, sha256, sha384, sha512, blake2s, blake2b, and blake3.
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": ["hello"]}) >>> result = df.select( ... dfn.functions.digest(dfn.col("a"), dfn.lit("md5")).alias("d")) >>> len(result.collect_column("d")[0].as_py()) > 0 True
- datafusion.functions.element_at(map: datafusion.expr.Expr, key: datafusion.expr.Expr) datafusion.expr.Expr¶
Returns the value for a given key in the map.
Returns
[None]if the key is absent.See also
This is an alias for
map_extract().
- datafusion.functions.empty(array: datafusion.expr.Expr) datafusion.expr.Expr¶
Returns true if the array is empty.
See also
This is an alias for
array_empty().
- datafusion.functions.encode(expr: datafusion.expr.Expr, encoding: datafusion.expr.Expr) datafusion.expr.Expr¶
Encode the
input, using theencoding. encoding can be base64 or hex.Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": ["hello"]}) >>> result = df.select( ... dfn.functions.encode(dfn.col("a"), dfn.lit("base64")).alias("enc")) >>> result.collect_column("enc")[0].as_py() 'aGVsbG8'
- datafusion.functions.ends_with(arg: datafusion.expr.Expr, suffix: datafusion.expr.Expr) datafusion.expr.Expr¶
Returns true if the
stringends with thesuffix, false otherwise.Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": ["abc","b","c"]}) >>> ends_with_df = df.select( ... dfn.functions.ends_with(dfn.col("a"), dfn.lit("c")).alias("ends_with")) >>> ends_with_df.collect_column("ends_with")[0].as_py() True
- datafusion.functions.exp(arg: datafusion.expr.Expr) datafusion.expr.Expr¶
Returns the exponential of the argument.
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [0.0]}) >>> result = df.select(dfn.functions.exp(dfn.col("a")).alias("exp")) >>> result.collect_column("exp")[0].as_py() 1.0
- datafusion.functions.extract(part: datafusion.expr.Expr, date: datafusion.expr.Expr) datafusion.expr.Expr¶
Extracts a subfield from the date.
See also
This is an alias for
date_part().
- datafusion.functions.factorial(arg: datafusion.expr.Expr) datafusion.expr.Expr¶
Returns the factorial of the argument.
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [3]}) >>> result = df.select( ... dfn.functions.factorial(dfn.col("a")).alias("factorial") ... ) >>> result.collect_column("factorial")[0].as_py() 6
- datafusion.functions.find_in_set(string: datafusion.expr.Expr, string_list: datafusion.expr.Expr) datafusion.expr.Expr¶
Find a string in a list of strings.
Returns a value in the range of 1 to N if the string is in the string list
string_listconsisting of N substrings.The string list is a string composed of substrings separated by
,characters.Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": ["b"]}) >>> result = df.select( ... dfn.functions.find_in_set(dfn.col("a"), dfn.lit("a,b,c")).alias("pos")) >>> result.collect_column("pos")[0].as_py() 2
- datafusion.functions.first_value(expression: datafusion.expr.Expr, filter: datafusion.expr.Expr | None = None, order_by: list[datafusion.expr.SortKey] | datafusion.expr.SortKey | None = None, null_treatment: datafusion.common.NullTreatment = NullTreatment.RESPECT_NULLS) datafusion.expr.Expr¶
Returns the first value in a group of values.
This aggregate function will return the first value in the partition.
If using the builder functions described in ref:_aggregation this function ignores the option
distinct.- Parameters:
expression – Argument to perform bitwise calculation on
filter – If provided, only compute against rows for which the filter is True
order_by – Set the ordering of the expression to evaluate. Accepts column names or expressions.
null_treatment – Assign whether to respect or ignore null values.
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [10, 20, 30]}) >>> result = df.aggregate( ... [], [dfn.functions.first_value( ... dfn.col("a") ... ).alias("v")] ... ) >>> result.collect_column("v")[0].as_py() 10
>>> df = ctx.from_pydict({"a": [None, 20, 10]}) >>> result = df.aggregate( ... [], [dfn.functions.first_value( ... dfn.col("a"), ... filter=dfn.col("a") > dfn.lit(10), ... order_by="a", ... null_treatment=dfn.common.NullTreatment.IGNORE_NULLS, ... ).alias("v")] ... ) >>> result.collect_column("v")[0].as_py() 20
- datafusion.functions.flatten(array: datafusion.expr.Expr) datafusion.expr.Expr¶
Flattens an array of arrays into a single array.
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [[[1, 2], [3, 4]]]}) >>> result = df.select(dfn.functions.flatten(dfn.col("a")).alias("result")) >>> result.collect_column("result")[0].as_py() [1, 2, 3, 4]
- datafusion.functions.floor(arg: datafusion.expr.Expr) datafusion.expr.Expr¶
Returns the nearest integer less than or equal to the argument.
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [1.9]}) >>> floor_df = df.select(dfn.functions.floor(dfn.col("a")).alias("floor")) >>> floor_df.collect_column("floor")[0].as_py() 1.0
- datafusion.functions.from_unixtime(arg: datafusion.expr.Expr) datafusion.expr.Expr¶
Converts an integer to RFC3339 timestamp format string.
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [0]}) >>> result = df.select( ... dfn.functions.from_unixtime( ... dfn.col("a") ... ).alias("ts") ... ) >>> str(result.collect_column("ts")[0].as_py()) '1970-01-01 00:00:00'
- datafusion.functions.gcd(x: datafusion.expr.Expr, y: datafusion.expr.Expr) datafusion.expr.Expr¶
Returns the greatest common divisor.
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [12], "b": [8]}) >>> result = df.select( ... dfn.functions.gcd(dfn.col("a"), dfn.col("b")).alias("gcd") ... ) >>> result.collect_column("gcd")[0].as_py() 4
- datafusion.functions.gen_series(start: datafusion.expr.Expr, stop: datafusion.expr.Expr, step: datafusion.expr.Expr | None = None) datafusion.expr.Expr¶
Creates a list of values in the range between start and stop.
Unlike
range(), this includes the upper bound.Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [0]}) >>> result = df.select( ... dfn.functions.gen_series( ... dfn.lit(1), dfn.lit(5), ... ).alias("result")) >>> result.collect_column("result")[0].as_py() [1, 2, 3, 4, 5]
Specify a custom
step:>>> result = df.select( ... dfn.functions.gen_series( ... dfn.lit(1), dfn.lit(10), step=dfn.lit(3), ... ).alias("result")) >>> result.collect_column("result")[0].as_py() [1, 4, 7, 10]
- datafusion.functions.generate_series(start: datafusion.expr.Expr, stop: datafusion.expr.Expr, step: datafusion.expr.Expr | None = None) datafusion.expr.Expr¶
Creates a list of values in the range between start and stop.
Unlike
range(), this includes the upper bound.See also
This is an alias for
gen_series().
- datafusion.functions.get_field(expr: datafusion.expr.Expr, name: datafusion.expr.Expr | str) datafusion.expr.Expr¶
Extracts a field from a struct or map by name.
When the field name is a static string, the bracket operator
expr["field"]is a convenient shorthand. Useget_fieldwhen the field name is a dynamic expression.Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [1], "b": [2]}) >>> df = df.with_column( ... "s", ... dfn.functions.named_struct( ... [("x", dfn.col("a")), ("y", dfn.col("b"))] ... ), ... ) >>> result = df.select( ... dfn.functions.get_field(dfn.col("s"), "x").alias("x_val") ... ) >>> result.collect_column("x_val")[0].as_py() 1
Equivalent using bracket syntax:
>>> result = df.select( ... dfn.col("s")["x"].alias("x_val") ... ) >>> result.collect_column("x_val")[0].as_py() 1
- datafusion.functions.greatest(*args: datafusion.expr.Expr) datafusion.expr.Expr¶
Returns the greatest value from a list of expressions.
Returns NULL if all expressions are NULL.
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [1, 3], "b": [2, 1]}) >>> result = df.select( ... dfn.functions.greatest(dfn.col("a"), dfn.col("b")).alias("greatest")) >>> result.collect_column("greatest")[0].as_py() 2 >>> result.collect_column("greatest")[1].as_py() 3
- datafusion.functions.grouping(expression: datafusion.expr.Expr, distinct: bool = False, filter: datafusion.expr.Expr | None = None) datafusion.expr.Expr¶
Indicates whether a column is aggregated across in the current row.
Returns 0 when the column is part of the grouping key for that row (i.e., the row contains per-group results for that column). Returns 1 when the column is not part of the grouping key (i.e., the row’s aggregate spans all values of that column).
This function is meaningful with
GroupingSet.rollup,GroupingSet.cube, orGroupingSet.grouping_sets, where different rows are grouped by different subsets of columns. In a default aggregation without grouping sets every column is always part of the key, sogrouping()always returns 0.Warning
Due to an upstream DataFusion limitation (#21411),
.alias()cannot be applied directly to agrouping()expression. Doing so will raise an error at execution time. To rename the column, usewith_column_renamed()on the result DataFrame instead.- Parameters:
expression – The column to check grouping status for
distinct – If True, compute on distinct values only
filter – If provided, only compute against rows for which the filter is True
Examples
With
rollup(), the result includes both per-group rows (grouping(a) = 0) and a grand-total row whereais aggregated across (grouping(a) = 1):>>> from datafusion.expr import GroupingSet >>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [1, 1, 2], "b": [10, 20, 30]}) >>> result = df.aggregate( ... [GroupingSet.rollup(dfn.col("a"))], ... [dfn.functions.sum(dfn.col("b")).alias("s"), ... dfn.functions.grouping(dfn.col("a"))], ... ).sort(dfn.col("a").sort(nulls_first=False)) >>> result.collect_column("s").to_pylist() [30, 30, 60]
See also
- datafusion.functions.ifnull(x: datafusion.expr.Expr, y: datafusion.expr.Expr) datafusion.expr.Expr¶
Returns
xifxis not NULL. Otherwise returnsy.See also
This is an alias for
nvl().
- datafusion.functions.in_list(arg: datafusion.expr.Expr, values: list[datafusion.expr.Expr], negated: bool = False) datafusion.expr.Expr¶
Returns whether the argument is contained within the list
values.Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [1, 2, 3]}) >>> result = df.select( ... dfn.functions.in_list( ... dfn.col("a"), [dfn.lit(1), dfn.lit(3)] ... ).alias("in") ... ) >>> result.collect_column("in").to_pylist() [True, False, True]
>>> result = df.select( ... dfn.functions.in_list( ... dfn.col("a"), [dfn.lit(1), dfn.lit(3)], ... negated=True, ... ).alias("not_in") ... ) >>> result.collect_column("not_in").to_pylist() [False, True, False]
- datafusion.functions.initcap(string: datafusion.expr.Expr) datafusion.expr.Expr¶
Set the initial letter of each word to capital.
Converts the first letter of each word in
stringto uppercase and the remaining characters to lowercase.Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": ["the cat"]}) >>> cap_df = df.select(dfn.functions.initcap(dfn.col("a")).alias("cap")) >>> cap_df.collect_column("cap")[0].as_py() 'The Cat'
- datafusion.functions.isnan(expr: datafusion.expr.Expr) datafusion.expr.Expr¶
Returns true if a given number is +NaN or -NaN otherwise returns false.
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [1.0, np.nan]}) >>> result = df.select(dfn.functions.isnan(dfn.col("a")).alias("isnan")) >>> result.collect_column("isnan")[1].as_py() True
- datafusion.functions.iszero(arg: datafusion.expr.Expr) datafusion.expr.Expr¶
Returns true if a given number is +0.0 or -0.0 otherwise returns false.
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [0.0, 1.0]}) >>> result = df.select(dfn.functions.iszero(dfn.col("a")).alias("iz")) >>> result.collect_column("iz")[0].as_py() True
- datafusion.functions.lag(arg: datafusion.expr.Expr, shift_offset: int = 1, default_value: Any | None = None, partition_by: list[datafusion.expr.Expr] | datafusion.expr.Expr | None = None, order_by: list[datafusion.expr.SortKey] | datafusion.expr.SortKey | None = None) datafusion.expr.Expr¶
Create a lag window function.
Lag operation will return the argument that is in the previous shift_offset-th row in the partition. For example
lag(col("b"), shift_offset=3, default_value=5)will return the 3rd previous value in columnb. At the beginning of the partition, where no values can be returned it will return the default value of 5.Here is an example of both the
laganddatafusion.functions.lead()functions on a simple DataFrame:+--------+------+-----+ | points | lead | lag | +--------+------+-----+ | 100 | 100 | | | 100 | 50 | 100 | | 50 | 25 | 100 | | 25 | | 50 | +--------+------+-----+
- Parameters:
arg – Value to return
shift_offset – Number of rows before the current row.
default_value – Value to return if shift_offet row does not exist.
partition_by – Expressions to partition the window frame on.
order_by – Set ordering within the window frame. Accepts column names or expressions.
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [1, 2, 3]}) >>> result = df.select( ... dfn.col("a"), ... dfn.functions.lag( ... dfn.col("a"), shift_offset=1, ... default_value=0, order_by="a" ... ).alias("lag")) >>> result.sort(dfn.col("a")).collect_column("lag").to_pylist() [0, 1, 2]
>>> df = ctx.from_pydict({"g": ["a", "a", "b"], "v": [1, 2, 3]}) >>> result = df.select( ... dfn.col("g"), dfn.col("v"), ... dfn.functions.lag( ... dfn.col("v"), shift_offset=1, default_value=0, ... partition_by=dfn.col("g"), order_by="v", ... ).alias("lag")) >>> result.sort(dfn.col("g"), dfn.col("v")).collect_column("lag").to_pylist() [0, 1, 0]
- datafusion.functions.last_value(expression: datafusion.expr.Expr, filter: datafusion.expr.Expr | None = None, order_by: list[datafusion.expr.SortKey] | datafusion.expr.SortKey | None = None, null_treatment: datafusion.common.NullTreatment = NullTreatment.RESPECT_NULLS) datafusion.expr.Expr¶
Returns the last value in a group of values.
This aggregate function will return the last value in the partition.
If using the builder functions described in ref:_aggregation this function ignores the option
distinct.- Parameters:
expression – Argument to perform bitwise calculation on
filter – If provided, only compute against rows for which the filter is True
order_by – Set the ordering of the expression to evaluate. Accepts column names or expressions.
null_treatment – Assign whether to respect or ignore null values.
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [10, 20, 30]}) >>> result = df.aggregate( ... [], [dfn.functions.last_value( ... dfn.col("a") ... ).alias("v")] ... ) >>> result.collect_column("v")[0].as_py() 30
>>> df = ctx.from_pydict({"a": [None, 20, 10]}) >>> result = df.aggregate( ... [], [dfn.functions.last_value( ... dfn.col("a"), ... filter=dfn.col("a") > dfn.lit(10), ... order_by="a", ... null_treatment=dfn.common.NullTreatment.IGNORE_NULLS, ... ).alias("v")] ... ) >>> result.collect_column("v")[0].as_py() 20
- datafusion.functions.lcm(x: datafusion.expr.Expr, y: datafusion.expr.Expr) datafusion.expr.Expr¶
Returns the least common multiple.
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [4], "b": [6]}) >>> result = df.select( ... dfn.functions.lcm(dfn.col("a"), dfn.col("b")).alias("lcm") ... ) >>> result.collect_column("lcm")[0].as_py() 12
- datafusion.functions.lead(arg: datafusion.expr.Expr, shift_offset: int = 1, default_value: Any | None = None, partition_by: list[datafusion.expr.Expr] | datafusion.expr.Expr | None = None, order_by: list[datafusion.expr.SortKey] | datafusion.expr.SortKey | None = None) datafusion.expr.Expr¶
Create a lead window function.
Lead operation will return the argument that is in the next shift_offset-th row in the partition. For example
lead(col("b"), shift_offset=3, default_value=5)will return the 3rd following value in columnb. At the end of the partition, where no further values can be returned it will return the default value of 5.Here is an example of both the
leadanddatafusion.functions.lag()functions on a simple DataFrame:+--------+------+-----+ | points | lead | lag | +--------+------+-----+ | 100 | 100 | | | 100 | 50 | 100 | | 50 | 25 | 100 | | 25 | | 50 | +--------+------+-----+
To set window function parameters use the window builder approach described in the ref:_window_functions online documentation.
- Parameters:
arg – Value to return
shift_offset – Number of rows following the current row.
default_value – Value to return if shift_offet row does not exist.
partition_by – Expressions to partition the window frame on.
order_by – Set ordering within the window frame. Accepts column names or expressions.
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [1, 2, 3]}) >>> result = df.select( ... dfn.col("a"), ... dfn.functions.lead( ... dfn.col("a"), shift_offset=1, ... default_value=0, order_by="a" ... ).alias("lead")) >>> result.sort(dfn.col("a")).collect_column("lead").to_pylist() [2, 3, 0]
>>> df = ctx.from_pydict({"g": ["a", "a", "b"], "v": [1, 2, 3]}) >>> result = df.select( ... dfn.col("g"), dfn.col("v"), ... dfn.functions.lead( ... dfn.col("v"), shift_offset=1, default_value=0, ... partition_by=dfn.col("g"), order_by="v", ... ).alias("lead")) >>> result.sort(dfn.col("g"), dfn.col("v")).collect_column("lead").to_pylist() [2, 0, 0]
- datafusion.functions.least(*args: datafusion.expr.Expr) datafusion.expr.Expr¶
Returns the least value from a list of expressions.
Returns NULL if all expressions are NULL.
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [1, 3], "b": [2, 1]}) >>> result = df.select( ... dfn.functions.least(dfn.col("a"), dfn.col("b")).alias("least")) >>> result.collect_column("least")[0].as_py() 1 >>> result.collect_column("least")[1].as_py() 1
- datafusion.functions.left(string: datafusion.expr.Expr, n: datafusion.expr.Expr) datafusion.expr.Expr¶
Returns the first
ncharacters in thestring.Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": ["the cat"]}) >>> left_df = df.select( ... dfn.functions.left(dfn.col("a"), dfn.lit(3)).alias("left")) >>> left_df.collect_column("left")[0].as_py() 'the'
- datafusion.functions.length(string: datafusion.expr.Expr) datafusion.expr.Expr¶
The number of characters in the
string.Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": ["hello"]}) >>> result = df.select(dfn.functions.length(dfn.col("a")).alias("len")) >>> result.collect_column("len")[0].as_py() 5
- datafusion.functions.levenshtein(string1: datafusion.expr.Expr, string2: datafusion.expr.Expr) datafusion.expr.Expr¶
Returns the Levenshtein distance between the two given strings.
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": ["kitten"]}) >>> result = df.select( ... dfn.functions.levenshtein(dfn.col("a"), dfn.lit("sitting")).alias("d")) >>> result.collect_column("d")[0].as_py() 3
- datafusion.functions.list_any_value(array: datafusion.expr.Expr) datafusion.expr.Expr¶
Returns the first non-null element in the array.
See also
This is an alias for
array_any_value().
- datafusion.functions.list_append(array: datafusion.expr.Expr, element: datafusion.expr.Expr) datafusion.expr.Expr¶
Appends an element to the end of an array.
See also
This is an alias for
array_append().
- datafusion.functions.list_cat(*args: datafusion.expr.Expr) datafusion.expr.Expr¶
Concatenates the input arrays.
See also
This is an alias for
array_concat(),array_cat().
- datafusion.functions.list_concat(*args: datafusion.expr.Expr) datafusion.expr.Expr¶
Concatenates the input arrays.
See also
This is an alias for
array_concat(),array_cat().
- datafusion.functions.list_contains(array: datafusion.expr.Expr, element: datafusion.expr.Expr) datafusion.expr.Expr¶
Returns true if the element appears in the array, otherwise false.
See also
This is an alias for
array_has().
- datafusion.functions.list_dims(array: datafusion.expr.Expr) datafusion.expr.Expr¶
Returns an array of the array’s dimensions.
See also
This is an alias for
array_dims().
- datafusion.functions.list_distance(array1: datafusion.expr.Expr, array2: datafusion.expr.Expr) datafusion.expr.Expr¶
Returns the Euclidean distance between two numeric arrays.
See also
This is an alias for
array_distance().
- datafusion.functions.list_distinct(array: datafusion.expr.Expr) datafusion.expr.Expr¶
Returns distinct values from the array after removing duplicates.
See also
This is an alias for
array_distinct().
- datafusion.functions.list_element(array: datafusion.expr.Expr, n: datafusion.expr.Expr) datafusion.expr.Expr¶
Extracts the element with the index n from the array.
See also
This is an alias for
array_element().
- datafusion.functions.list_empty(array: datafusion.expr.Expr) datafusion.expr.Expr¶
Returns a boolean indicating whether the array is empty.
See also
This is an alias for
array_empty().
- datafusion.functions.list_except(array1: datafusion.expr.Expr, array2: datafusion.expr.Expr) datafusion.expr.Expr¶
Returns the elements that appear in
array1but not in thearray2.See also
This is an alias for
array_except().
- datafusion.functions.list_extract(array: datafusion.expr.Expr, n: datafusion.expr.Expr) datafusion.expr.Expr¶
Extracts the element with the index n from the array.
See also
This is an alias for
array_element().
- datafusion.functions.list_has(array: datafusion.expr.Expr, element: datafusion.expr.Expr) datafusion.expr.Expr¶
Returns true if the element appears in the array, otherwise false.
See also
This is an alias for
array_has().
- datafusion.functions.list_has_all(first_array: datafusion.expr.Expr, second_array: datafusion.expr.Expr) datafusion.expr.Expr¶
Determines if there is complete overlap
second_arrayinfirst_array.See also
This is an alias for
array_has_all().
- datafusion.functions.list_has_any(first_array: datafusion.expr.Expr, second_array: datafusion.expr.Expr) datafusion.expr.Expr¶
Determine if there is an overlap between
first_arrayandsecond_array.See also
This is an alias for
array_has_any().
- datafusion.functions.list_indexof(array: datafusion.expr.Expr, element: datafusion.expr.Expr, index: int | None = 1) datafusion.expr.Expr¶
Return the position of the first occurrence of
elementinarray.See also
This is an alias for
array_position().
- datafusion.functions.list_intersect(array1: datafusion.expr.Expr, array2: datafusion.expr.Expr) datafusion.expr.Expr¶
Returns an the intersection of
array1andarray2.See also
This is an alias for
array_intersect().
- datafusion.functions.list_join(expr: datafusion.expr.Expr, delimiter: datafusion.expr.Expr) datafusion.expr.Expr¶
Converts each element to its text representation.
See also
This is an alias for
array_to_string().
- datafusion.functions.list_length(array: datafusion.expr.Expr) datafusion.expr.Expr¶
Returns the length of the array.
See also
This is an alias for
array_length().
- datafusion.functions.list_max(array: datafusion.expr.Expr) datafusion.expr.Expr¶
Returns the maximum value in the array.
See also
This is an alias for
array_max().
- datafusion.functions.list_min(array: datafusion.expr.Expr) datafusion.expr.Expr¶
Returns the minimum value in the array.
See also
This is an alias for
array_min().
- datafusion.functions.list_ndims(array: datafusion.expr.Expr) datafusion.expr.Expr¶
Returns the number of dimensions of the array.
See also
This is an alias for
array_ndims().
- datafusion.functions.list_overlap(first_array: datafusion.expr.Expr, second_array: datafusion.expr.Expr) datafusion.expr.Expr¶
Returns true if any element appears in both arrays.
See also
This is an alias for
array_has_any().
- datafusion.functions.list_pop_back(array: datafusion.expr.Expr) datafusion.expr.Expr¶
Returns the array without the last element.
See also
This is an alias for
array_pop_back().
- datafusion.functions.list_pop_front(array: datafusion.expr.Expr) datafusion.expr.Expr¶
Returns the array without the first element.
See also
This is an alias for
array_pop_front().
- datafusion.functions.list_position(array: datafusion.expr.Expr, element: datafusion.expr.Expr, index: int | None = 1) datafusion.expr.Expr¶
Return the position of the first occurrence of
elementinarray.See also
This is an alias for
array_position().
- datafusion.functions.list_positions(array: datafusion.expr.Expr, element: datafusion.expr.Expr) datafusion.expr.Expr¶
Searches for an element in the array and returns all occurrences.
See also
This is an alias for
array_positions().
- datafusion.functions.list_prepend(element: datafusion.expr.Expr, array: datafusion.expr.Expr) datafusion.expr.Expr¶
Prepends an element to the beginning of an array.
See also
This is an alias for
array_prepend().
- datafusion.functions.list_push_back(array: datafusion.expr.Expr, element: datafusion.expr.Expr) datafusion.expr.Expr¶
Appends an element to the end of an array.
See also
This is an alias for
array_append().
- datafusion.functions.list_push_front(element: datafusion.expr.Expr, array: datafusion.expr.Expr) datafusion.expr.Expr¶
Prepends an element to the beginning of an array.
See also
This is an alias for
array_prepend().
- datafusion.functions.list_remove(array: datafusion.expr.Expr, element: datafusion.expr.Expr) datafusion.expr.Expr¶
Removes the first element from the array equal to the given value.
See also
This is an alias for
array_remove().
- datafusion.functions.list_remove_all(array: datafusion.expr.Expr, element: datafusion.expr.Expr) datafusion.expr.Expr¶
Removes all elements from the array equal to the given value.
See also
This is an alias for
array_remove_all().
- datafusion.functions.list_remove_n(array: datafusion.expr.Expr, element: datafusion.expr.Expr, max: datafusion.expr.Expr) datafusion.expr.Expr¶
Removes the first
maxelements from the array equal to the given value.See also
This is an alias for
array_remove_n().
- datafusion.functions.list_repeat(element: datafusion.expr.Expr, count: datafusion.expr.Expr) datafusion.expr.Expr¶
Returns an array containing
elementcounttimes.See also
This is an alias for
array_repeat().
- datafusion.functions.list_replace(array: datafusion.expr.Expr, from_val: datafusion.expr.Expr, to_val: datafusion.expr.Expr) datafusion.expr.Expr¶
Replaces the first occurrence of
from_valwithto_val.See also
This is an alias for
array_replace().
- datafusion.functions.list_replace_all(array: datafusion.expr.Expr, from_val: datafusion.expr.Expr, to_val: datafusion.expr.Expr) datafusion.expr.Expr¶
Replaces all occurrences of
from_valwithto_val.See also
This is an alias for
array_replace_all().
- datafusion.functions.list_replace_n(array: datafusion.expr.Expr, from_val: datafusion.expr.Expr, to_val: datafusion.expr.Expr, max: datafusion.expr.Expr) datafusion.expr.Expr¶
Replace
noccurrences offrom_valwithto_val.Replaces the first
maxoccurrences of the specified element with another specified element.See also
This is an alias for
array_replace_n().
- datafusion.functions.list_resize(array: datafusion.expr.Expr, size: datafusion.expr.Expr, value: datafusion.expr.Expr) datafusion.expr.Expr¶
Returns an array with the specified size filled.
If
sizeis greater than thearraylength, the additional entries will be filled with the givenvalue.See also
This is an alias for
array_resize().
- datafusion.functions.list_reverse(array: datafusion.expr.Expr) datafusion.expr.Expr¶
Reverses the order of elements in the array.
See also
This is an alias for
array_reverse().
- datafusion.functions.list_slice(array: datafusion.expr.Expr, begin: datafusion.expr.Expr, end: datafusion.expr.Expr, stride: datafusion.expr.Expr | None = None) datafusion.expr.Expr¶
Returns a slice of the array.
See also
This is an alias for
array_slice().
- datafusion.functions.list_sort(array: datafusion.expr.Expr, descending: bool = False, null_first: bool = False) datafusion.expr.Expr¶
Sorts the array.
See also
This is an alias for
array_sort().
- datafusion.functions.list_to_string(expr: datafusion.expr.Expr, delimiter: datafusion.expr.Expr) datafusion.expr.Expr¶
Converts each element to its text representation.
See also
This is an alias for
array_to_string().
- datafusion.functions.list_union(array1: datafusion.expr.Expr, array2: datafusion.expr.Expr) datafusion.expr.Expr¶
Returns an array of the elements in the union of array1 and array2.
Duplicate rows will not be returned.
See also
This is an alias for
array_union().
- datafusion.functions.list_zip(*arrays: datafusion.expr.Expr) datafusion.expr.Expr¶
Combines multiple arrays into a single array of structs.
See also
This is an alias for
arrays_zip().
- datafusion.functions.ln(arg: datafusion.expr.Expr) datafusion.expr.Expr¶
Returns the natural logarithm (base e) of the argument.
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [1.0]}) >>> result = df.select(dfn.functions.ln(dfn.col("a")).alias("ln")) >>> result.collect_column("ln")[0].as_py() 0.0
- datafusion.functions.log(base: datafusion.expr.Expr, num: datafusion.expr.Expr) datafusion.expr.Expr¶
Returns the logarithm of a number for a particular
base.Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [100.0]}) >>> result = df.select( ... dfn.functions.log(dfn.lit(10.0), dfn.col("a")).alias("log") ... ) >>> result.collect_column("log")[0].as_py() 2.0
- datafusion.functions.log10(arg: datafusion.expr.Expr) datafusion.expr.Expr¶
Base 10 logarithm of the argument.
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [100.0]}) >>> result = df.select(dfn.functions.log10(dfn.col("a")).alias("log10")) >>> result.collect_column("log10")[0].as_py() 2.0
- datafusion.functions.log2(arg: datafusion.expr.Expr) datafusion.expr.Expr¶
Base 2 logarithm of the argument.
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [8.0]}) >>> result = df.select(dfn.functions.log2(dfn.col("a")).alias("log2")) >>> result.collect_column("log2")[0].as_py() 3.0
- datafusion.functions.lower(arg: datafusion.expr.Expr) datafusion.expr.Expr¶
Converts a string to lowercase.
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": ["THE CaT"]}) >>> lower_df = df.select(dfn.functions.lower(dfn.col("a")).alias("lower")) >>> lower_df.collect_column("lower")[0].as_py() 'the cat'
- datafusion.functions.lpad(string: datafusion.expr.Expr, count: datafusion.expr.Expr, characters: datafusion.expr.Expr | None = None) datafusion.expr.Expr¶
Add left padding to a string.
Extends the string to length length by prepending the characters fill (a space by default). If the string is already longer than length then it is truncated (on the right).
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": ["the cat", "a hat"]}) >>> lpad_df = df.select( ... dfn.functions.lpad( ... dfn.col("a"), dfn.lit(6) ... ).alias("lpad")) >>> lpad_df.collect_column("lpad")[0].as_py() 'the ca' >>> lpad_df.collect_column("lpad")[1].as_py() ' a hat'
>>> result = df.select( ... dfn.functions.lpad( ... dfn.col("a"), dfn.lit(10), characters=dfn.lit(".") ... ).alias("lpad")) >>> result.collect_column("lpad")[0].as_py() '...the cat'
- datafusion.functions.ltrim(arg: datafusion.expr.Expr) datafusion.expr.Expr¶
Removes all characters, spaces by default, from the beginning of a string.
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [" a "]}) >>> trim_df = df.select(dfn.functions.ltrim(dfn.col("a")).alias("trimmed")) >>> trim_df.collect_column("trimmed")[0].as_py() 'a '
- datafusion.functions.make_array(*args: datafusion.expr.Expr) datafusion.expr.Expr¶
Returns an array using the specified input expressions.
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [1]}) >>> result = df.select( ... dfn.functions.make_array( ... dfn.lit(1), dfn.lit(2), dfn.lit(3) ... ).alias("arr")) >>> result.collect_column("arr")[0].as_py() [1, 2, 3]
- datafusion.functions.make_date(year: datafusion.expr.Expr, month: datafusion.expr.Expr, day: datafusion.expr.Expr) datafusion.expr.Expr¶
Make a date from year, month and day component parts.
Examples
>>> from datetime import date >>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"y": [2024], "m": [1], "d": [15]}) >>> result = df.select( ... dfn.functions.make_date(dfn.col("y"), dfn.col("m"), ... dfn.col("d")).alias("dt")) >>> result.collect_column("dt")[0].as_py() datetime.date(2024, 1, 15)
- datafusion.functions.make_list(*args: datafusion.expr.Expr) datafusion.expr.Expr¶
Returns an array using the specified input expressions.
See also
This is an alias for
make_array().
- datafusion.functions.make_map(*args: Any) datafusion.expr.Expr¶
Returns a map expression.
Supports three calling conventions:
make_map({"a": 1, "b": 2})— from a Python dictionary.make_map([keys], [values])— from a list of keys and a list of their associated values. Both lists must be the same length.make_map(k1, v1, k2, v2, ...)— from alternating keys and their associated values.
Keys and values that are not already
Exprare automatically converted to literal expressions.Examples
From a dictionary:
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [1]}) >>> result = df.select( ... dfn.functions.make_map({"a": 1, "b": 2}).alias("m")) >>> result.collect_column("m")[0].as_py() [('a', 1), ('b', 2)]
From two lists:
>>> df = ctx.from_pydict({"key": ["x", "y"], "val": [10, 20]}) >>> df = df.select( ... dfn.functions.make_map( ... [dfn.col("key")], [dfn.col("val")] ... ).alias("m")) >>> df.collect_column("m")[0].as_py() [('x', 10)]
From alternating keys and values:
>>> df = ctx.from_pydict({"a": [1]}) >>> result = df.select( ... dfn.functions.make_map("x", 1, "y", 2).alias("m")) >>> result.collect_column("m")[0].as_py() [('x', 1), ('y', 2)]
- datafusion.functions.make_time(hour: datafusion.expr.Expr, minute: datafusion.expr.Expr, second: datafusion.expr.Expr) datafusion.expr.Expr¶
Make a time from hour, minute and second component parts.
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"h": [12], "m": [30], "s": [0]}) >>> result = df.select( ... dfn.functions.make_time(dfn.col("h"), dfn.col("m"), ... dfn.col("s")).alias("t")) >>> result.collect_column("t")[0].as_py() datetime.time(12, 30)
- datafusion.functions.map_entries(map: datafusion.expr.Expr) datafusion.expr.Expr¶
Returns a list of all entries (key-value struct pairs) in the map.
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [1]}) >>> df = df.select( ... dfn.functions.make_map({"x": 1, "y": 2}).alias("m")) >>> result = df.select( ... dfn.functions.map_entries(dfn.col("m")).alias("entries")) >>> result.collect_column("entries")[0].as_py() [{'key': 'x', 'value': 1}, {'key': 'y', 'value': 2}]
- datafusion.functions.map_extract(map: datafusion.expr.Expr, key: datafusion.expr.Expr) datafusion.expr.Expr¶
Returns the value for a given key in the map.
Returns
[None]if the key is absent.Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [1]}) >>> df = df.select( ... dfn.functions.make_map({"x": 1, "y": 2}).alias("m")) >>> result = df.select( ... dfn.functions.map_extract( ... dfn.col("m"), dfn.lit("x") ... ).alias("val")) >>> result.collect_column("val")[0].as_py() [1]
- datafusion.functions.map_keys(map: datafusion.expr.Expr) datafusion.expr.Expr¶
Returns a list of all keys in the map.
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [1]}) >>> df = df.select( ... dfn.functions.make_map({"x": 1, "y": 2}).alias("m")) >>> result = df.select( ... dfn.functions.map_keys(dfn.col("m")).alias("keys")) >>> result.collect_column("keys")[0].as_py() ['x', 'y']
- datafusion.functions.map_values(map: datafusion.expr.Expr) datafusion.expr.Expr¶
Returns a list of all values in the map.
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [1]}) >>> df = df.select( ... dfn.functions.make_map({"x": 1, "y": 2}).alias("m")) >>> result = df.select( ... dfn.functions.map_values(dfn.col("m")).alias("vals")) >>> result.collect_column("vals")[0].as_py() [1, 2]
- datafusion.functions.max(expression: datafusion.expr.Expr, filter: datafusion.expr.Expr | None = None) datafusion.expr.Expr¶
Aggregate function that returns the maximum value of the argument.
If using the builder functions described in ref:_aggregation this function ignores the options
order_by,null_treatment, anddistinct.- Parameters:
expression – The value to find the maximum of
filter – If provided, only compute against rows for which the filter is True
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [1, 2, 3]}) >>> result = df.aggregate( ... [], [dfn.functions.max( ... dfn.col("a") ... ).alias("v")]) >>> result.collect_column("v")[0].as_py() 3
>>> result = df.aggregate( ... [], [dfn.functions.max( ... dfn.col("a"), ... filter=dfn.col("a") < dfn.lit(3) ... ).alias("v")]) >>> result.collect_column("v")[0].as_py() 2
- datafusion.functions.md5(arg: datafusion.expr.Expr) datafusion.expr.Expr¶
Computes an MD5 128-bit checksum for a string expression.
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": ["hello"]}) >>> result = df.select(dfn.functions.md5(dfn.col("a")).alias("md5")) >>> result.collect_column("md5")[0].as_py() '5d41402abc4b2a76b9719d911017c592'
- datafusion.functions.mean(expression: datafusion.expr.Expr, filter: datafusion.expr.Expr | None = None) datafusion.expr.Expr¶
Returns the average (mean) value of the argument.
See also
This is an alias for
avg().
- datafusion.functions.median(expression: datafusion.expr.Expr, distinct: bool = False, filter: datafusion.expr.Expr | None = None) datafusion.expr.Expr¶
Computes the median of a set of numbers.
This aggregate function returns the median value of the expression for the given aggregate function.
If using the builder functions described in ref:_aggregation this function ignores the options
order_byandnull_treatment.- Parameters:
expression – The value to compute the median of
distinct – If True, a single entry for each distinct value will be in the result
filter – If provided, only compute against rows for which the filter is True
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [1.0, 2.0, 3.0]}) >>> result = df.aggregate( ... [], [dfn.functions.median( ... dfn.col("a") ... ).alias("v")]) >>> result.collect_column("v")[0].as_py() 2.0
>>> df = ctx.from_pydict({"a": [1.0, 1.0, 2.0, 3.0]}) >>> result = df.aggregate( ... [], [dfn.functions.median( ... dfn.col("a"), distinct=True, ... filter=dfn.col("a") < dfn.lit(3.0), ... ).alias("v")]) >>> result.collect_column("v")[0].as_py() 1.5
- datafusion.functions.min(expression: datafusion.expr.Expr, filter: datafusion.expr.Expr | None = None) datafusion.expr.Expr¶
Aggregate function that returns the minimum value of the argument.
If using the builder functions described in ref:_aggregation this function ignores the options
order_by,null_treatment, anddistinct.- Parameters:
expression – The value to find the minimum of
filter – If provided, only compute against rows for which the filter is True
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [1, 2, 3]}) >>> result = df.aggregate( ... [], [dfn.functions.min( ... dfn.col("a") ... ).alias("v")]) >>> result.collect_column("v")[0].as_py() 1
>>> result = df.aggregate( ... [], [dfn.functions.min( ... dfn.col("a"), ... filter=dfn.col("a") > dfn.lit(1) ... ).alias("v")]) >>> result.collect_column("v")[0].as_py() 2
- datafusion.functions.named_struct(name_pairs: list[tuple[str, datafusion.expr.Expr]]) datafusion.expr.Expr¶
Returns a struct with the given names and arguments pairs.
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [1]}) >>> result = df.select( ... dfn.functions.named_struct( ... [("x", dfn.lit(10)), ("y", dfn.lit(20))] ... ).alias("s") ... ) >>> result.collect_column("s")[0].as_py() == {"x": 10, "y": 20} True
- datafusion.functions.nanvl(x: datafusion.expr.Expr, y: datafusion.expr.Expr) datafusion.expr.Expr¶
Returns
xifxis notNaN. Otherwise returnsy.Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [np.nan, 1.0], "b": [0.0, 0.0]}) >>> nanvl_df = df.select( ... dfn.functions.nanvl(dfn.col("a"), dfn.col("b")).alias("nanvl")) >>> nanvl_df.collect_column("nanvl")[0].as_py() 0.0 >>> nanvl_df.collect_column("nanvl")[1].as_py() 1.0
- datafusion.functions.now() datafusion.expr.Expr¶
Returns the current timestamp in nanoseconds.
This will use the same value for all instances of now() in same statement.
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [1]}) >>> result = df.select( ... dfn.functions.now().alias("now") ... )
Use .value instead of .as_py() because nanosecond timestamps require pandas to convert to Python datetime objects.
>>> result.collect_column("now")[0].value > 0 True
- datafusion.functions.nth_value(expression: datafusion.expr.Expr, n: int, filter: datafusion.expr.Expr | None = None, order_by: list[datafusion.expr.SortKey] | datafusion.expr.SortKey | None = None, null_treatment: datafusion.common.NullTreatment = NullTreatment.RESPECT_NULLS) datafusion.expr.Expr¶
Returns the n-th value in a group of values.
This aggregate function will return the n-th value in the partition.
If using the builder functions described in ref:_aggregation this function ignores the option
distinct.- Parameters:
expression – Argument to perform bitwise calculation on
n – Index of value to return. Starts at 1.
filter – If provided, only compute against rows for which the filter is True
order_by – Set the ordering of the expression to evaluate. Accepts column names or expressions.
null_treatment – Assign whether to respect or ignore null values.
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [10, 20, 30]}) >>> result = df.aggregate( ... [], [dfn.functions.nth_value( ... dfn.col("a"), 1 ... ).alias("v")] ... ) >>> result.collect_column("v")[0].as_py() 10
>>> result = df.aggregate( ... [], [dfn.functions.nth_value( ... dfn.col("a"), 1, ... filter=dfn.col("a") > dfn.lit(10), ... order_by="a", ... null_treatment=dfn.common.NullTreatment.IGNORE_NULLS, ... ).alias("v")] ... ) >>> result.collect_column("v")[0].as_py() 20
- datafusion.functions.ntile(groups: int, partition_by: list[datafusion.expr.Expr] | datafusion.expr.Expr | None = None, order_by: list[datafusion.expr.SortKey] | datafusion.expr.SortKey | None = None) datafusion.expr.Expr¶
Create a n-tile window function.
This window function orders the window frame into a give number of groups based on the ordering criteria. It then returns which group the current row is assigned to. Here is an example of a dataframe with a window ordered by descending
pointsand the associated n-tile function:+--------+-------+ | points | ntile | +--------+-------+ | 120 | 1 | | 100 | 1 | | 80 | 2 | | 60 | 2 | | 40 | 3 | | 20 | 3 | +--------+-------+
- Parameters:
groups – Number of groups for the n-tile to be divided into.
partition_by – Expressions to partition the window frame on.
order_by – Set ordering within the window frame. Accepts column names or expressions.
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [10, 20, 30, 40]}) >>> result = df.select( ... dfn.col("a"), ... dfn.functions.ntile( ... 2, order_by="a" ... ).alias("nt")) >>> result.sort(dfn.col("a")).collect_column("nt").to_pylist() [1, 1, 2, 2]
>>> df = ctx.from_pydict( ... {"g": ["a", "a", "b", "b"], "v": [1, 2, 3, 4]}) >>> result = df.select( ... dfn.col("g"), dfn.col("v"), ... dfn.functions.ntile( ... 2, partition_by=dfn.col("g"), order_by="v", ... ).alias("nt")) >>> result.sort(dfn.col("g"), dfn.col("v")).collect_column("nt").to_pylist() [1, 2, 1, 2]
- datafusion.functions.nullif(expr1: datafusion.expr.Expr, expr2: datafusion.expr.Expr) datafusion.expr.Expr¶
Returns NULL if expr1 equals expr2; otherwise it returns expr1.
This can be used to perform the inverse operation of the COALESCE expression.
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [1, 2], "b": [1, 3]}) >>> result = df.select( ... dfn.functions.nullif(dfn.col("a"), dfn.col("b")).alias("nullif")) >>> result.collect_column("nullif").to_pylist() [None, 2]
- datafusion.functions.nvl(x: datafusion.expr.Expr, y: datafusion.expr.Expr) datafusion.expr.Expr¶
Returns
xifxis notNULL. Otherwise returnsy.Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [None, 1], "b": [0, 0]}) >>> nvl_df = df.select( ... dfn.functions.nvl(dfn.col("a"), dfn.col("b")).alias("nvl") ... ) >>> nvl_df.collect_column("nvl")[0].as_py() 0 >>> nvl_df.collect_column("nvl")[1].as_py() 1
- datafusion.functions.nvl2(x: datafusion.expr.Expr, y: datafusion.expr.Expr, z: datafusion.expr.Expr) datafusion.expr.Expr¶
Returns
yifxis not NULL. Otherwise returnsz.Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [None, 1], "b": [10, 20], "c": [30, 40]}) >>> result = df.select( ... dfn.functions.nvl2( ... dfn.col("a"), dfn.col("b"), dfn.col("c")).alias("nvl2") ... ) >>> result.collect_column("nvl2")[0].as_py() 30 >>> result.collect_column("nvl2")[1].as_py() 20
- datafusion.functions.octet_length(arg: datafusion.expr.Expr) datafusion.expr.Expr¶
Returns the number of bytes of a string.
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": ["hello"]}) >>> result = df.select(dfn.functions.octet_length(dfn.col("a")).alias("len")) >>> result.collect_column("len")[0].as_py() 5
- datafusion.functions.order_by(expr: datafusion.expr.Expr, ascending: bool = True, nulls_first: bool = True) datafusion.expr.SortExpr¶
Creates a new sort expression.
Examples
>>> sort_expr = dfn.functions.order_by( ... dfn.col("a"), ascending=False) >>> sort_expr.ascending() False
>>> sort_expr = dfn.functions.order_by( ... dfn.col("a"), ascending=True, nulls_first=False) >>> sort_expr.nulls_first() False
- datafusion.functions.overlay(string: datafusion.expr.Expr, substring: datafusion.expr.Expr, start: datafusion.expr.Expr, length: datafusion.expr.Expr | None = None) datafusion.expr.Expr¶
Replace a substring with a new substring.
Replace the substring of string that starts at the
start’th character and extends forlengthcharacters with new substring.Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": ["abcdef"]}) >>> result = df.select( ... dfn.functions.overlay(dfn.col("a"), dfn.lit("XY"), dfn.lit(3), ... dfn.lit(2)).alias("o")) >>> result.collect_column("o")[0].as_py() 'abXYef'
- datafusion.functions.percent_rank(partition_by: list[datafusion.expr.Expr] | datafusion.expr.Expr | None = None, order_by: list[datafusion.expr.SortKey] | datafusion.expr.SortKey | None = None) datafusion.expr.Expr¶
Create a percent_rank window function.
This window function is similar to
rank()except that the returned values are the percentage from 0.0 to 1.0 from first to last. Here is an example of a dataframe with a window ordered by descendingpointsand the associated percent rank:+--------+--------------+ | points | percent_rank | +--------+--------------+ | 100 | 0.0 | | 100 | 0.0 | | 50 | 0.666667 | | 25 | 1.0 | +--------+--------------+
- Parameters:
partition_by – Expressions to partition the window frame on.
order_by – Set ordering within the window frame. Accepts column names or expressions.
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [10, 20, 30]}) >>> result = df.select( ... dfn.col("a"), ... dfn.functions.percent_rank( ... order_by="a" ... ).alias("pr")) >>> result.sort(dfn.col("a")).collect_column("pr").to_pylist() [0.0, 0.5, 1.0]
>>> df = ctx.from_pydict( ... {"g": ["a", "a", "a", "b", "b"], "v": [1, 2, 3, 4, 5]}) >>> result = df.select( ... dfn.col("g"), dfn.col("v"), ... dfn.functions.percent_rank( ... partition_by=dfn.col("g"), order_by="v", ... ).alias("pr")) >>> result.sort(dfn.col("g"), dfn.col("v")).collect_column("pr").to_pylist() [0.0, 0.5, 1.0, 0.0, 1.0]
- datafusion.functions.percentile_cont(sort_expression: datafusion.expr.Expr | datafusion.expr.SortExpr, percentile: float, filter: datafusion.expr.Expr | None = None) datafusion.expr.Expr¶
Computes the exact percentile of input values using continuous interpolation.
Unlike
approx_percentile_cont(), this function computes the exact percentile value rather than an approximation.If using the builder functions described in ref:_aggregation this function ignores the options
order_by,null_treatment, anddistinct.- Parameters:
sort_expression – Values for which to find the percentile
percentile – This must be between 0.0 and 1.0, inclusive
filter – If provided, only compute against rows for which the filter is True
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [1.0, 2.0, 3.0, 4.0, 5.0]}) >>> result = df.aggregate( ... [], [dfn.functions.percentile_cont( ... dfn.col("a"), 0.5 ... ).alias("v")]) >>> result.collect_column("v")[0].as_py() 3.0
>>> result = df.aggregate( ... [], [dfn.functions.percentile_cont( ... dfn.col("a"), 0.5, ... filter=dfn.col("a") > dfn.lit(1.0), ... ).alias("v")]) >>> result.collect_column("v")[0].as_py() 3.5
- datafusion.functions.pi() datafusion.expr.Expr¶
Returns an approximate value of π.
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [1]}) >>> from math import pi >>> result = df.select( ... dfn.functions.pi().alias("pi") ... ) >>> result.collect_column("pi")[0].as_py() == pi True
- datafusion.functions.pow(base: datafusion.expr.Expr, exponent: datafusion.expr.Expr) datafusion.expr.Expr¶
Returns
baseraised to the power ofexponent.See also
This is an alias of
power().
- datafusion.functions.power(base: datafusion.expr.Expr, exponent: datafusion.expr.Expr) datafusion.expr.Expr¶
Returns
baseraised to the power ofexponent.Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [2.0]}) >>> result = df.select( ... dfn.functions.power(dfn.col("a"), dfn.lit(3.0)).alias("pow") ... ) >>> result.collect_column("pow")[0].as_py() 8.0
- datafusion.functions.quantile_cont(sort_expression: datafusion.expr.Expr | datafusion.expr.SortExpr, percentile: float, filter: datafusion.expr.Expr | None = None) datafusion.expr.Expr¶
Computes the exact percentile of input values using continuous interpolation.
See also
This is an alias for
percentile_cont().
- datafusion.functions.radians(arg: datafusion.expr.Expr) datafusion.expr.Expr¶
Converts the argument from degrees to radians.
Examples
>>> from math import pi >>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [180.0]}) >>> result = df.select( ... dfn.functions.radians(dfn.col("a")).alias("rad") ... ) >>> result.collect_column("rad")[0].as_py() == pi True
- datafusion.functions.random() datafusion.expr.Expr¶
Returns a random value in the range
0.0 <= x < 1.0.Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [1]}) >>> result = df.select( ... dfn.functions.random().alias("r") ... ) >>> val = result.collect_column("r")[0].as_py() >>> 0.0 <= val < 1.0 True
- datafusion.functions.range(start: datafusion.expr.Expr, stop: datafusion.expr.Expr, step: datafusion.expr.Expr) datafusion.expr.Expr¶
Create a list of values in the range between start and stop.
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [1]}) >>> result = df.select( ... dfn.functions.range(dfn.lit(0), dfn.lit(5), dfn.lit(2)).alias("r")) >>> result.collect_column("r")[0].as_py() [0, 2, 4]
- datafusion.functions.rank(partition_by: list[datafusion.expr.Expr] | datafusion.expr.Expr | None = None, order_by: list[datafusion.expr.SortKey] | datafusion.expr.SortKey | None = None) datafusion.expr.Expr¶
Create a rank window function.
Returns the rank based upon the window order. Consecutive equal values will receive the same rank, but the next different value will not be consecutive but rather the number of rows that precede it plus one. This is similar to Olympic medals. If two people tie for gold, the next place is bronze. There would be no silver medal. Here is an example of a dataframe with a window ordered by descending
pointsand the associated rank.You should set
order_byto produce meaningful results:+--------+------+ | points | rank | +--------+------+ | 100 | 1 | | 100 | 1 | | 50 | 3 | | 25 | 4 | +--------+------+
- Parameters:
partition_by – Expressions to partition the window frame on.
order_by – Set ordering within the window frame. Accepts column names or expressions.
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [10, 10, 20]}) >>> result = df.select( ... dfn.col("a"), ... dfn.functions.rank( ... order_by="a" ... ).alias("rnk") ... ) >>> result.sort(dfn.col("a")).collect_column("rnk").to_pylist() [1, 1, 3]
>>> df = ctx.from_pydict( ... {"g": ["a", "a", "b", "b"], "v": [1, 1, 2, 3]}) >>> result = df.select( ... dfn.col("g"), dfn.col("v"), ... dfn.functions.rank( ... partition_by=dfn.col("g"), order_by="v", ... ).alias("rnk")) >>> result.sort(dfn.col("g"), dfn.col("v")).collect_column("rnk").to_pylist() [1, 1, 1, 2]
- datafusion.functions.regexp_count(string: datafusion.expr.Expr, pattern: datafusion.expr.Expr, start: datafusion.expr.Expr | None = None, flags: datafusion.expr.Expr | None = None) datafusion.expr.Expr¶
Returns the number of matches in a string.
Optional start position (the first position is 1) to search for the regular expression.
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": ["abcabc"]}) >>> result = df.select( ... dfn.functions.regexp_count( ... dfn.col("a"), dfn.lit("abc") ... ).alias("c")) >>> result.collect_column("c")[0].as_py() 2
Use
startto begin searching from a position, andflagsfor case-insensitive matching:>>> result = df.select( ... dfn.functions.regexp_count( ... dfn.col("a"), dfn.lit("ABC"), ... start=dfn.lit(4), flags=dfn.lit("i"), ... ).alias("c")) >>> result.collect_column("c")[0].as_py() 1
- datafusion.functions.regexp_instr(values: datafusion.expr.Expr, regex: datafusion.expr.Expr, start: datafusion.expr.Expr | None = None, n: datafusion.expr.Expr | None = None, flags: datafusion.expr.Expr | None = None, sub_expr: datafusion.expr.Expr | None = None) datafusion.expr.Expr¶
Returns the position of a regular expression match in a string.
- Parameters:
values – Data to search for the regular expression match.
regex – Regular expression to search for.
start – Optional position to start the search (the first position is 1).
n – Optional occurrence of the match to find (the first occurrence is 1).
flags – Optional regular expression flags to control regex behavior.
sub_expr – Optionally capture group position instead of the entire match.
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": ["hello 42 world"]}) >>> result = df.select( ... dfn.functions.regexp_instr( ... dfn.col("a"), dfn.lit("\\d+") ... ).alias("pos") ... ) >>> result.collect_column("pos")[0].as_py() 7
Use
startto search from a position,nfor the nth occurrence, andflagsfor case-insensitive mode:>>> df = ctx.from_pydict({"a": ["abc ABC abc"]}) >>> result = df.select( ... dfn.functions.regexp_instr( ... dfn.col("a"), dfn.lit("abc"), ... start=dfn.lit(2), n=dfn.lit(1), ... flags=dfn.lit("i"), ... ).alias("pos") ... ) >>> result.collect_column("pos")[0].as_py() 5
Use
sub_exprto get the position of a capture group:>>> result = df.select( ... dfn.functions.regexp_instr( ... dfn.col("a"), dfn.lit("(abc)"), ... sub_expr=dfn.lit(1), ... ).alias("pos") ... ) >>> result.collect_column("pos")[0].as_py() 1
- datafusion.functions.regexp_like(string: datafusion.expr.Expr, regex: datafusion.expr.Expr, flags: datafusion.expr.Expr | None = None) datafusion.expr.Expr¶
Find if any regular expression (regex) matches exist.
Tests a string using a regular expression returning true if at least one match, false otherwise.
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": ["hello123"]}) >>> result = df.select( ... dfn.functions.regexp_like( ... dfn.col("a"), dfn.lit("\\d+") ... ).alias("m") ... ) >>> result.collect_column("m")[0].as_py() True
Use
flagsfor case-insensitive matching:>>> result = df.select( ... dfn.functions.regexp_like( ... dfn.col("a"), dfn.lit("HELLO"), ... flags=dfn.lit("i"), ... ).alias("m") ... ) >>> result.collect_column("m")[0].as_py() True
- datafusion.functions.regexp_match(string: datafusion.expr.Expr, regex: datafusion.expr.Expr, flags: datafusion.expr.Expr | None = None) datafusion.expr.Expr¶
Perform regular expression (regex) matching.
Returns an array with each element containing the leftmost-first match of the corresponding index in
regexto string instring.Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": ["hello 42 world"]}) >>> result = df.select( ... dfn.functions.regexp_match( ... dfn.col("a"), dfn.lit("(\\d+)") ... ).alias("m") ... ) >>> result.collect_column("m")[0].as_py() ['42']
Use
flagsfor case-insensitive matching:>>> result = df.select( ... dfn.functions.regexp_match( ... dfn.col("a"), dfn.lit("(HELLO)"), ... flags=dfn.lit("i"), ... ).alias("m") ... ) >>> result.collect_column("m")[0].as_py() ['hello']
- datafusion.functions.regexp_replace(string: datafusion.expr.Expr, pattern: datafusion.expr.Expr, replacement: datafusion.expr.Expr, flags: datafusion.expr.Expr | None = None) datafusion.expr.Expr¶
Replaces substring(s) matching a PCRE-like regular expression.
The full list of supported features and syntax can be found at <https://docs.rs/regex/latest/regex/#syntax>
Supported flags with the addition of ‘g’ can be found at <https://docs.rs/regex/latest/regex/#grouping-and-flags>
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": ["hello 42"]}) >>> result = df.select( ... dfn.functions.regexp_replace( ... dfn.col("a"), dfn.lit("\\d+"), ... dfn.lit("XX") ... ).alias("r") ... ) >>> result.collect_column("r")[0].as_py() 'hello XX'
Use the
gflag to replace all occurrences:>>> df = ctx.from_pydict({"a": ["a1 b2 c3"]}) >>> result = df.select( ... dfn.functions.regexp_replace( ... dfn.col("a"), dfn.lit("\\d+"), ... dfn.lit("X"), flags=dfn.lit("g"), ... ).alias("r") ... ) >>> result.collect_column("r")[0].as_py() 'aX bX cX'
- datafusion.functions.regr_avgx(y: datafusion.expr.Expr, x: datafusion.expr.Expr, filter: datafusion.expr.Expr | None = None) datafusion.expr.Expr¶
Computes the average of the independent variable
x.This is a linear regression aggregate function. Only non-null pairs of the inputs are evaluated.
If using the builder functions described in ref:_aggregation this function ignores the options
order_by,null_treatment, anddistinct.- Parameters:
y – The linear regression dependent variable
x – The linear regression independent variable
filter – If provided, only compute against rows for which the filter is True
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"y": [1.0, 2.0, 3.0], "x": [4.0, 5.0, 6.0]}) >>> result = df.aggregate( ... [], [dfn.functions.regr_avgx( ... dfn.col("y"), dfn.col("x") ... ).alias("v")]) >>> result.collect_column("v")[0].as_py() 5.0
>>> result = df.aggregate( ... [], [dfn.functions.regr_avgx( ... dfn.col("y"), dfn.col("x"), ... filter=dfn.col("y") > dfn.lit(1.0) ... ).alias("v")]) >>> result.collect_column("v")[0].as_py() 5.5
- datafusion.functions.regr_avgy(y: datafusion.expr.Expr, x: datafusion.expr.Expr, filter: datafusion.expr.Expr | None = None) datafusion.expr.Expr¶
Computes the average of the dependent variable
y.This is a linear regression aggregate function. Only non-null pairs of the inputs are evaluated.
If using the builder functions described in ref:_aggregation this function ignores the options
order_by,null_treatment, anddistinct.- Parameters:
y – The linear regression dependent variable
x – The linear regression independent variable
filter – If provided, only compute against rows for which the filter is True
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"y": [1.0, 2.0, 3.0], "x": [4.0, 5.0, 6.0]}) >>> result = df.aggregate( ... [], [dfn.functions.regr_avgy( ... dfn.col("y"), dfn.col("x") ... ).alias("v")]) >>> result.collect_column("v")[0].as_py() 2.0
>>> result = df.aggregate( ... [], [dfn.functions.regr_avgy( ... dfn.col("y"), dfn.col("x"), ... filter=dfn.col("y") > dfn.lit(1.0) ... ).alias("v")]) >>> result.collect_column("v")[0].as_py() 2.5
- datafusion.functions.regr_count(y: datafusion.expr.Expr, x: datafusion.expr.Expr, filter: datafusion.expr.Expr | None = None) datafusion.expr.Expr¶
Counts the number of rows in which both expressions are not null.
This is a linear regression aggregate function. Only non-null pairs of the inputs are evaluated.
If using the builder functions described in ref:_aggregation this function ignores the options
order_by,null_treatment, anddistinct.- Parameters:
y – The linear regression dependent variable
x – The linear regression independent variable
filter – If provided, only compute against rows for which the filter is True
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"y": [1.0, 2.0, 3.0], "x": [4.0, 5.0, 6.0]}) >>> result = df.aggregate( ... [], [dfn.functions.regr_count( ... dfn.col("y"), dfn.col("x") ... ).alias("v")]) >>> result.collect_column("v")[0].as_py() 3
>>> result = df.aggregate( ... [], [dfn.functions.regr_count( ... dfn.col("y"), dfn.col("x"), ... filter=dfn.col("y") > dfn.lit(1.0) ... ).alias("v")]) >>> result.collect_column("v")[0].as_py() 2
- datafusion.functions.regr_intercept(y: datafusion.expr.Expr, x: datafusion.expr.Expr, filter: datafusion.expr.Expr | None = None) datafusion.expr.Expr¶
Computes the intercept from the linear regression.
This is a linear regression aggregate function. Only non-null pairs of the inputs are evaluated.
If using the builder functions described in ref:_aggregation this function ignores the options
order_by,null_treatment, anddistinct.- Parameters:
y – The linear regression dependent variable
x – The linear regression independent variable
filter – If provided, only compute against rows for which the filter is True
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"y": [2.0, 4.0, 6.0], "x": [4.0, 16.0, 36.0]}) >>> result = df.aggregate( ... [], ... [dfn.functions.regr_intercept( ... dfn.col("y"), dfn.col("x") ... ).alias("v")]) >>> result.collect_column("v")[0].as_py() 1.714...
>>> result = df.aggregate( ... [], ... [dfn.functions.regr_intercept( ... dfn.col("y"), dfn.col("x"), ... filter=dfn.col("y") > dfn.lit(2.0) ... ).alias("v")]) >>> result.collect_column("v")[0].as_py() 2.4
- datafusion.functions.regr_r2(y: datafusion.expr.Expr, x: datafusion.expr.Expr, filter: datafusion.expr.Expr | None = None) datafusion.expr.Expr¶
Computes the R-squared value from linear regression.
This is a linear regression aggregate function. Only non-null pairs of the inputs are evaluated.
If using the builder functions described in ref:_aggregation this function ignores the options
order_by,null_treatment, anddistinct.- Parameters:
y – The linear regression dependent variable
x – The linear regression independent variable
filter – If provided, only compute against rows for which the filter is True
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"y": [2.0, 4.0, 6.0], "x": [4.0, 16.0, 36.0]}) >>> result = df.aggregate( ... [], [dfn.functions.regr_r2( ... dfn.col("y"), dfn.col("x") ... ).alias("v")]) >>> result.collect_column("v")[0].as_py() 0.9795...
>>> result = df.aggregate( ... [], [dfn.functions.regr_r2( ... dfn.col("y"), dfn.col("x"), ... filter=dfn.col("y") > dfn.lit(2.0) ... ).alias("v")]) >>> result.collect_column("v")[0].as_py() 1.0
- datafusion.functions.regr_slope(y: datafusion.expr.Expr, x: datafusion.expr.Expr, filter: datafusion.expr.Expr | None = None) datafusion.expr.Expr¶
Computes the slope from linear regression.
This is a linear regression aggregate function. Only non-null pairs of the inputs are evaluated.
If using the builder functions described in ref:_aggregation this function ignores the options
order_by,null_treatment, anddistinct.- Parameters:
y – The linear regression dependent variable
x – The linear regression independent variable
filter – If provided, only compute against rows for which the filter is True
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"y": [2.0, 4.0, 6.0], "x": [4.0, 16.0, 36.0]}) >>> result = df.aggregate( ... [], [dfn.functions.regr_slope( ... dfn.col("y"), dfn.col("x") ... ).alias("v")]) >>> result.collect_column("v")[0].as_py() 0.122...
>>> result = df.aggregate( ... [], [dfn.functions.regr_slope( ... dfn.col("y"), dfn.col("x"), ... filter=dfn.col("y") > dfn.lit(2.0) ... ).alias("v")]) >>> result.collect_column("v")[0].as_py() 0.1
- datafusion.functions.regr_sxx(y: datafusion.expr.Expr, x: datafusion.expr.Expr, filter: datafusion.expr.Expr | None = None) datafusion.expr.Expr¶
Computes the sum of squares of the independent variable
x.This is a linear regression aggregate function. Only non-null pairs of the inputs are evaluated.
If using the builder functions described in ref:_aggregation this function ignores the options
order_by,null_treatment, anddistinct.- Parameters:
y – The linear regression dependent variable
x – The linear regression independent variable
filter – If provided, only compute against rows for which the filter is True
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"y": [1.0, 2.0, 3.0], "x": [1.0, 2.0, 3.0]}) >>> result = df.aggregate( ... [], [dfn.functions.regr_sxx( ... dfn.col("y"), dfn.col("x") ... ).alias("v")]) >>> result.collect_column("v")[0].as_py() 2.0
>>> result = df.aggregate( ... [], [dfn.functions.regr_sxx( ... dfn.col("y"), dfn.col("x"), ... filter=dfn.col("y") > dfn.lit(1.0) ... ).alias("v")]) >>> result.collect_column("v")[0].as_py() 0.5
- datafusion.functions.regr_sxy(y: datafusion.expr.Expr, x: datafusion.expr.Expr, filter: datafusion.expr.Expr | None = None) datafusion.expr.Expr¶
Computes the sum of products of pairs of numbers.
This is a linear regression aggregate function. Only non-null pairs of the inputs are evaluated.
If using the builder functions described in ref:_aggregation this function ignores the options
order_by,null_treatment, anddistinct.- Parameters:
y – The linear regression dependent variable
x – The linear regression independent variable
filter – If provided, only compute against rows for which the filter is True
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"y": [1.0, 2.0, 3.0], "x": [1.0, 2.0, 3.0]}) >>> result = df.aggregate( ... [], [dfn.functions.regr_sxy( ... dfn.col("y"), dfn.col("x") ... ).alias("v")]) >>> result.collect_column("v")[0].as_py() 2.0
>>> result = df.aggregate( ... [], [dfn.functions.regr_sxy( ... dfn.col("y"), dfn.col("x"), ... filter=dfn.col("y") > dfn.lit(1.0) ... ).alias("v")]) >>> result.collect_column("v")[0].as_py() 0.5
- datafusion.functions.regr_syy(y: datafusion.expr.Expr, x: datafusion.expr.Expr, filter: datafusion.expr.Expr | None = None) datafusion.expr.Expr¶
Computes the sum of squares of the dependent variable
y.This is a linear regression aggregate function. Only non-null pairs of the inputs are evaluated.
If using the builder functions described in ref:_aggregation this function ignores the options
order_by,null_treatment, anddistinct.- Parameters:
y – The linear regression dependent variable
x – The linear regression independent variable
filter – If provided, only compute against rows for which the filter is True
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"y": [1.0, 2.0, 3.0], "x": [1.0, 2.0, 3.0]}) >>> result = df.aggregate( ... [], [dfn.functions.regr_syy( ... dfn.col("y"), dfn.col("x") ... ).alias("v")]) >>> result.collect_column("v")[0].as_py() 2.0
>>> result = df.aggregate( ... [], [dfn.functions.regr_syy( ... dfn.col("y"), dfn.col("x"), ... filter=dfn.col("y") > dfn.lit(1.0) ... ).alias("v")]) >>> result.collect_column("v")[0].as_py() 0.5
- datafusion.functions.repeat(string: datafusion.expr.Expr, n: datafusion.expr.Expr) datafusion.expr.Expr¶
Repeats the
stringtontimes.Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": ["ha"]}) >>> result = df.select( ... dfn.functions.repeat(dfn.col("a"), dfn.lit(3)).alias("r")) >>> result.collect_column("r")[0].as_py() 'hahaha'
- datafusion.functions.replace(string: datafusion.expr.Expr, from_val: datafusion.expr.Expr, to_val: datafusion.expr.Expr) datafusion.expr.Expr¶
Replaces all occurrences of
from_valwithto_valin thestring.Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": ["hello world"]}) >>> result = df.select( ... dfn.functions.replace(dfn.col("a"), dfn.lit("world"), ... dfn.lit("there")).alias("r")) >>> result.collect_column("r")[0].as_py() 'hello there'
- datafusion.functions.reverse(arg: datafusion.expr.Expr) datafusion.expr.Expr¶
Reverse the string argument.
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": ["hello"]}) >>> result = df.select(dfn.functions.reverse(dfn.col("a")).alias("r")) >>> result.collect_column("r")[0].as_py() 'olleh'
- datafusion.functions.right(string: datafusion.expr.Expr, n: datafusion.expr.Expr) datafusion.expr.Expr¶
Returns the last
ncharacters in thestring.Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": ["hello"]}) >>> result = df.select(dfn.functions.right(dfn.col("a"), dfn.lit(3)).alias("r")) >>> result.collect_column("r")[0].as_py() 'llo'
- datafusion.functions.round(value: datafusion.expr.Expr, decimal_places: datafusion.expr.Expr | None = None) datafusion.expr.Expr¶
Round the argument to the nearest integer.
If the optional
decimal_placesis specified, round to the nearest number of decimal places. You can specify a negative number of decimal places. For exampleround(lit(125.2345), lit(-2))would yield a value of100.0.Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [1.567]}) >>> result = df.select(dfn.functions.round(dfn.col("a"), dfn.lit(2)).alias("r")) >>> result.collect_column("r")[0].as_py() 1.57
- datafusion.functions.row(*args: datafusion.expr.Expr) datafusion.expr.Expr¶
Returns a struct with the given arguments.
See also
This is an alias for
struct().
- datafusion.functions.row_number(partition_by: list[datafusion.expr.Expr] | datafusion.expr.Expr | None = None, order_by: list[datafusion.expr.SortKey] | datafusion.expr.SortKey | None = None) datafusion.expr.Expr¶
Create a row number window function.
Returns the row number of the window function.
Here is an example of the
row_numberon a simple DataFrame:+--------+------------+ | points | row number | +--------+------------+ | 100 | 1 | | 100 | 2 | | 50 | 3 | | 25 | 4 | +--------+------------+
- Parameters:
partition_by – Expressions to partition the window frame on.
order_by – Set ordering within the window frame. Accepts column names or expressions.
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [10, 20, 30]}) >>> result = df.select( ... dfn.col("a"), ... dfn.functions.row_number( ... order_by="a" ... ).alias("rn")) >>> result.sort(dfn.col("a")).collect_column("rn").to_pylist() [1, 2, 3]
>>> df = ctx.from_pydict( ... {"g": ["a", "a", "b", "b"], "v": [1, 2, 3, 4]}) >>> result = df.select( ... dfn.col("g"), dfn.col("v"), ... dfn.functions.row_number( ... partition_by=dfn.col("g"), order_by="v", ... ).alias("rn")) >>> result.sort(dfn.col("g"), dfn.col("v")).collect_column("rn").to_pylist() [1, 2, 1, 2]
- datafusion.functions.rpad(string: datafusion.expr.Expr, count: datafusion.expr.Expr, characters: datafusion.expr.Expr | None = None) datafusion.expr.Expr¶
Add right padding to a string.
Extends the string to length length by appending the characters fill (a space by default). If the string is already longer than length then it is truncated.
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": ["hi"]}) >>> result = df.select( ... dfn.functions.rpad(dfn.col("a"), dfn.lit(5), dfn.lit("!")).alias("r")) >>> result.collect_column("r")[0].as_py() 'hi!!!'
- datafusion.functions.rtrim(arg: datafusion.expr.Expr) datafusion.expr.Expr¶
Removes all characters, spaces by default, from the end of a string.
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [" a "]}) >>> trim_df = df.select(dfn.functions.rtrim(dfn.col("a")).alias("trimmed")) >>> trim_df.collect_column("trimmed")[0].as_py() ' a'
- datafusion.functions.sha224(arg: datafusion.expr.Expr) datafusion.expr.Expr¶
Computes the SHA-224 hash of a binary string.
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": ["hello"]}) >>> result = df.select( ... dfn.functions.sha224(dfn.col("a")).alias("h") ... ) >>> result.collect_column("h")[0].as_py().hex() 'ea09ae9cc6768c50fcee903ed054556e5bfc8347907f12598aa24193'
- datafusion.functions.sha256(arg: datafusion.expr.Expr) datafusion.expr.Expr¶
Computes the SHA-256 hash of a binary string.
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": ["hello"]}) >>> result = df.select( ... dfn.functions.sha256(dfn.col("a")).alias("h") ... ) >>> result.collect_column("h")[0].as_py().hex() '2cf24dba5fb0a30e26e83b2ac5b9e29e1b161e5c1fa7425e73043362938b9824'
- datafusion.functions.sha384(arg: datafusion.expr.Expr) datafusion.expr.Expr¶
Computes the SHA-384 hash of a binary string.
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": ["hello"]}) >>> result = df.select( ... dfn.functions.sha384(dfn.col("a")).alias("h") ... ) >>> result.collect_column("h")[0].as_py().hex() '59e1748777448c69de6b800d7a33bbfb9ff1b...
- datafusion.functions.sha512(arg: datafusion.expr.Expr) datafusion.expr.Expr¶
Computes the SHA-512 hash of a binary string.
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": ["hello"]}) >>> result = df.select( ... dfn.functions.sha512(dfn.col("a")).alias("h") ... ) >>> result.collect_column("h")[0].as_py().hex() '9b71d224bd62f3785d96d46ad3ea3d73319bfb...
- datafusion.functions.signum(arg: datafusion.expr.Expr) datafusion.expr.Expr¶
Returns the sign of the argument (-1, 0, +1).
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [-5.0, 0.0, 5.0]}) >>> result = df.select(dfn.functions.signum(dfn.col("a")).alias("s")) >>> result.collect_column("s").to_pylist() [-1.0, 0.0, 1.0]
- datafusion.functions.sin(arg: datafusion.expr.Expr) datafusion.expr.Expr¶
Returns the sine of the argument.
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [0.0]}) >>> result = df.select(dfn.functions.sin(dfn.col("a")).alias("sin")) >>> result.collect_column("sin")[0].as_py() 0.0
- datafusion.functions.sinh(arg: datafusion.expr.Expr) datafusion.expr.Expr¶
Returns the hyperbolic sine of the argument.
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [0.0]}) >>> result = df.select(dfn.functions.sinh(dfn.col("a")).alias("sinh")) >>> result.collect_column("sinh")[0].as_py() 0.0
- datafusion.functions.split_part(string: datafusion.expr.Expr, delimiter: datafusion.expr.Expr, index: datafusion.expr.Expr) datafusion.expr.Expr¶
Split a string and return one part.
Splits a string based on a delimiter and picks out the desired field based on the index.
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": ["a,b,c"]}) >>> result = df.select( ... dfn.functions.split_part( ... dfn.col("a"), dfn.lit(","), dfn.lit(2) ... ).alias("s")) >>> result.collect_column("s")[0].as_py() 'b'
- datafusion.functions.sqrt(arg: datafusion.expr.Expr) datafusion.expr.Expr¶
Returns the square root of the argument.
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [9.0]}) >>> result = df.select(dfn.functions.sqrt(dfn.col("a")).alias("sqrt")) >>> result.collect_column("sqrt")[0].as_py() 3.0
- datafusion.functions.starts_with(string: datafusion.expr.Expr, prefix: datafusion.expr.Expr) datafusion.expr.Expr¶
Returns true if string starts with prefix.
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": ["hello_from_datafusion"]}) >>> result = df.select( ... dfn.functions.starts_with(dfn.col("a"), dfn.lit("hello")).alias("sw")) >>> result.collect_column("sw")[0].as_py() True
- datafusion.functions.stddev(expression: datafusion.expr.Expr, filter: datafusion.expr.Expr | None = None) datafusion.expr.Expr¶
Computes the standard deviation of the argument.
If using the builder functions described in ref:_aggregation this function ignores the options
order_by,null_treatment, anddistinct.- Parameters:
expression – The value to find the minimum of
filter – If provided, only compute against rows for which the filter is True
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [2.0, 4.0, 6.0]}) >>> result = df.aggregate( ... [], [dfn.functions.stddev( ... dfn.col("a") ... ).alias("v")]) >>> result.collect_column("v")[0].as_py() 2.0
>>> result = df.aggregate( ... [], [dfn.functions.stddev( ... dfn.col("a"), ... filter=dfn.col("a") > dfn.lit(2.0) ... ).alias("v")]) >>> result.collect_column("v")[0].as_py() 1.41...
- datafusion.functions.stddev_pop(expression: datafusion.expr.Expr, filter: datafusion.expr.Expr | None = None) datafusion.expr.Expr¶
Computes the population standard deviation of the argument.
If using the builder functions described in ref:_aggregation this function ignores the options
order_by,null_treatment, anddistinct.- Parameters:
expression – The value to find the minimum of
filter – If provided, only compute against rows for which the filter is True
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [0.0, 1.0, 3.0]}) >>> result = df.aggregate( ... [], [dfn.functions.stddev_pop( ... dfn.col("a") ... ).alias("v")] ... ) >>> result.collect_column("v")[0].as_py() 1.247...
>>> df = ctx.from_pydict({"a": [0.0, 1.0, 3.0]}) >>> result = df.aggregate( ... [], [dfn.functions.stddev_pop( ... dfn.col("a"), ... filter=dfn.col("a") > dfn.lit(0.0) ... ).alias("v")] ... ) >>> result.collect_column("v")[0].as_py() 1.0
- datafusion.functions.stddev_samp(arg: datafusion.expr.Expr, filter: datafusion.expr.Expr | None = None) datafusion.expr.Expr¶
Computes the sample standard deviation of the argument.
See also
This is an alias for
stddev().
- datafusion.functions.string_agg(expression: datafusion.expr.Expr, delimiter: str, filter: datafusion.expr.Expr | None = None, order_by: list[datafusion.expr.SortKey] | datafusion.expr.SortKey | None = None) datafusion.expr.Expr¶
Concatenates the input strings.
This aggregate function will concatenate input strings, ignoring null values, and separating them with the specified delimiter. Non-string values will be converted to their string equivalents.
If using the builder functions described in ref:_aggregation this function ignores the options
distinctandnull_treatment.- Parameters:
expression – Argument to perform bitwise calculation on
delimiter – Text to place between each value of expression
filter – If provided, only compute against rows for which the filter is True
order_by – Set the ordering of the expression to evaluate. Accepts column names or expressions.
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": ["x", "y", "z"]}) >>> result = df.aggregate( ... [], [dfn.functions.string_agg( ... dfn.col("a"), ",", order_by="a" ... ).alias("s")]) >>> result.collect_column("s")[0].as_py() 'x,y,z'
>>> result = df.aggregate( ... [], [dfn.functions.string_agg( ... dfn.col("a"), ",", ... filter=dfn.col("a") > dfn.lit("x"), ... order_by="a", ... ).alias("s")]) >>> result.collect_column("s")[0].as_py() 'y,z'
- datafusion.functions.string_to_array(string: datafusion.expr.Expr, delimiter: datafusion.expr.Expr, null_string: datafusion.expr.Expr | None = None) datafusion.expr.Expr¶
Splits a string based on a delimiter and returns an array of parts.
Any parts matching the optional
null_stringwill be replaced withNULL.Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": ["hello,world"]}) >>> result = df.select( ... dfn.functions.string_to_array( ... dfn.col("a"), dfn.lit(","), ... ).alias("result")) >>> result.collect_column("result")[0].as_py() ['hello', 'world']
Replace parts matching a
null_stringwithNULL:>>> result = df.select( ... dfn.functions.string_to_array( ... dfn.col("a"), dfn.lit(","), null_string=dfn.lit("world"), ... ).alias("result")) >>> result.collect_column("result")[0].as_py() ['hello', None]
- datafusion.functions.string_to_list(string: datafusion.expr.Expr, delimiter: datafusion.expr.Expr, null_string: datafusion.expr.Expr | None = None) datafusion.expr.Expr¶
Splits a string based on a delimiter and returns an array of parts.
See also
This is an alias for
string_to_array().
- datafusion.functions.strpos(string: datafusion.expr.Expr, substring: datafusion.expr.Expr) datafusion.expr.Expr¶
Finds the position from where the
substringmatches thestring.Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": ["hello"]}) >>> result = df.select( ... dfn.functions.strpos(dfn.col("a"), dfn.lit("llo")).alias("pos")) >>> result.collect_column("pos")[0].as_py() 3
- datafusion.functions.struct(*args: datafusion.expr.Expr) datafusion.expr.Expr¶
Returns a struct with the given arguments.
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [1], "b": [2]}) >>> result = df.select( ... dfn.functions.struct( ... dfn.col("a"), dfn.col("b") ... ).alias("s") ... )
Children in the new struct will always be c0, …, cN-1 for N children.
>>> result.collect_column("s")[0].as_py() == {"c0": 1, "c1": 2} True
- datafusion.functions.substr(string: datafusion.expr.Expr, position: datafusion.expr.Expr) datafusion.expr.Expr¶
Substring from the
positionto the end.Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": ["hello"]}) >>> result = df.select( ... dfn.functions.substr(dfn.col("a"), dfn.lit(3)).alias("s")) >>> result.collect_column("s")[0].as_py() 'llo'
- datafusion.functions.substr_index(string: datafusion.expr.Expr, delimiter: datafusion.expr.Expr, count: datafusion.expr.Expr) datafusion.expr.Expr¶
Returns an indexed substring.
The return will be the
stringfrom beforecountoccurrences ofdelimiter.Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": ["a.b.c"]}) >>> result = df.select( ... dfn.functions.substr_index(dfn.col("a"), dfn.lit("."), ... dfn.lit(2)).alias("s")) >>> result.collect_column("s")[0].as_py() 'a.b'
- datafusion.functions.substring(string: datafusion.expr.Expr, position: datafusion.expr.Expr, length: datafusion.expr.Expr) datafusion.expr.Expr¶
Substring from the
positionwithlengthcharacters.Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": ["hello world"]}) >>> result = df.select( ... dfn.functions.substring( ... dfn.col("a"), dfn.lit(1), dfn.lit(5) ... ).alias("s")) >>> result.collect_column("s")[0].as_py() 'hello'
- datafusion.functions.sum(expression: datafusion.expr.Expr, filter: datafusion.expr.Expr | None = None) datafusion.expr.Expr¶
Computes the sum of a set of numbers.
This aggregate function expects a numeric expression.
If using the builder functions described in ref:_aggregation this function ignores the options
order_by,null_treatment, anddistinct.- Parameters:
expression – Values to combine into an array
filter – If provided, only compute against rows for which the filter is True
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [1, 2, 3]}) >>> result = df.aggregate( ... [], [dfn.functions.sum( ... dfn.col("a") ... ).alias("v")]) >>> result.collect_column("v")[0].as_py() 6
>>> result = df.aggregate( ... [], [dfn.functions.sum( ... dfn.col("a"), ... filter=dfn.col("a") > dfn.lit(1) ... ).alias("v")]) >>> result.collect_column("v")[0].as_py() 5
- datafusion.functions.tan(arg: datafusion.expr.Expr) datafusion.expr.Expr¶
Returns the tangent of the argument.
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [0.0]}) >>> result = df.select(dfn.functions.tan(dfn.col("a")).alias("tan")) >>> result.collect_column("tan")[0].as_py() 0.0
- datafusion.functions.tanh(arg: datafusion.expr.Expr) datafusion.expr.Expr¶
Returns the hyperbolic tangent of the argument.
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [0.0]}) >>> result = df.select(dfn.functions.tanh(dfn.col("a")).alias("tanh")) >>> result.collect_column("tanh")[0].as_py() 0.0
- datafusion.functions.to_char(arg: datafusion.expr.Expr, formatter: datafusion.expr.Expr) datafusion.expr.Expr¶
Returns a string representation of a date, time, timestamp or duration.
For usage of
formattersee the rust chrono packagestrftimepackage.[Documentation here.](https://docs.rs/chrono/latest/chrono/format/strftime/index.html)
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": ["2021-01-01T00:00:00"]}) >>> result = df.select( ... dfn.functions.to_char( ... dfn.functions.to_timestamp(dfn.col("a")), ... dfn.lit("%Y/%m/%d"), ... ).alias("formatted") ... ) >>> result.collect_column("formatted")[0].as_py() '2021/01/01'
- datafusion.functions.to_date(arg: datafusion.expr.Expr, *formatters: datafusion.expr.Expr) datafusion.expr.Expr¶
Converts a value to a date (YYYY-MM-DD).
Supports strings, numeric and timestamp types as input. Integers and doubles are interpreted as days since the unix epoch. Strings are parsed as YYYY-MM-DD (e.g. ‘2023-07-20’) if
formattersare not provided.For usage of
formatterssee the rust chrono packagestrftimepackage.[Documentation here.](https://docs.rs/chrono/latest/chrono/format/strftime/index.html)
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": ["2021-07-20"]}) >>> result = df.select( ... dfn.functions.to_date(dfn.col("a")).alias("dt")) >>> str(result.collect_column("dt")[0].as_py()) '2021-07-20'
- datafusion.functions.to_hex(arg: datafusion.expr.Expr) datafusion.expr.Expr¶
Converts an integer to a hexadecimal string.
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [255]}) >>> result = df.select(dfn.functions.to_hex(dfn.col("a")).alias("hex")) >>> result.collect_column("hex")[0].as_py() 'ff'
- datafusion.functions.to_local_time(*args: datafusion.expr.Expr) datafusion.expr.Expr¶
Converts a timestamp with a timezone to a timestamp without a timezone.
This function handles daylight saving time changes.
- datafusion.functions.to_time(arg: datafusion.expr.Expr, *formatters: datafusion.expr.Expr) datafusion.expr.Expr¶
Converts a value to a time. Supports strings and timestamps as input.
If
formattersis not provided strings are parsed as HH:MM:SS, HH:MM or HH:MM:SS.nnnnnnnnn;For usage of
formatterssee the rust chrono packagestrftimepackage.[Documentation here.](https://docs.rs/chrono/latest/chrono/format/strftime/index.html)
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": ["14:30:00"]}) >>> result = df.select( ... dfn.functions.to_time(dfn.col("a")).alias("t")) >>> str(result.collect_column("t")[0].as_py()) '14:30:00'
- datafusion.functions.to_timestamp(arg: datafusion.expr.Expr, *formatters: datafusion.expr.Expr) datafusion.expr.Expr¶
Converts a string and optional formats to a
Timestampin nanoseconds.For usage of
formatterssee the rust chrono packagestrftimepackage.[Documentation here.](https://docs.rs/chrono/latest/chrono/format/strftime/index.html)
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": ["2021-01-01T00:00:00"]}) >>> result = df.select( ... dfn.functions.to_timestamp( ... dfn.col("a") ... ).alias("ts") ... ) >>> str(result.collect_column("ts")[0].as_py()) '2021-01-01 00:00:00'
- datafusion.functions.to_timestamp_micros(arg: datafusion.expr.Expr, *formatters: datafusion.expr.Expr) datafusion.expr.Expr¶
Converts a string and optional formats to a
Timestampin microseconds.See
to_timestamp()for a description on how to use formatters.Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": ["2021-01-01T00:00:00"]}) >>> result = df.select( ... dfn.functions.to_timestamp_micros( ... dfn.col("a") ... ).alias("ts") ... ) >>> str(result.collect_column("ts")[0].as_py()) '2021-01-01 00:00:00'
- datafusion.functions.to_timestamp_millis(arg: datafusion.expr.Expr, *formatters: datafusion.expr.Expr) datafusion.expr.Expr¶
Converts a string and optional formats to a
Timestampin milliseconds.See
to_timestamp()for a description on how to use formatters.Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": ["2021-01-01T00:00:00"]}) >>> result = df.select( ... dfn.functions.to_timestamp_millis( ... dfn.col("a") ... ).alias("ts") ... ) >>> str(result.collect_column("ts")[0].as_py()) '2021-01-01 00:00:00'
- datafusion.functions.to_timestamp_nanos(arg: datafusion.expr.Expr, *formatters: datafusion.expr.Expr) datafusion.expr.Expr¶
Converts a string and optional formats to a
Timestampin nanoseconds.See
to_timestamp()for a description on how to use formatters.Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": ["2021-01-01T00:00:00"]}) >>> result = df.select( ... dfn.functions.to_timestamp_nanos( ... dfn.col("a") ... ).alias("ts") ... ) >>> str(result.collect_column("ts")[0].as_py()) '2021-01-01 00:00:00'
- datafusion.functions.to_timestamp_seconds(arg: datafusion.expr.Expr, *formatters: datafusion.expr.Expr) datafusion.expr.Expr¶
Converts a string and optional formats to a
Timestampin seconds.See
to_timestamp()for a description on how to use formatters.Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": ["2021-01-01T00:00:00"]}) >>> result = df.select( ... dfn.functions.to_timestamp_seconds( ... dfn.col("a") ... ).alias("ts") ... ) >>> str(result.collect_column("ts")[0].as_py()) '2021-01-01 00:00:00'
- datafusion.functions.to_unixtime(string: datafusion.expr.Expr, *format_arguments: datafusion.expr.Expr) datafusion.expr.Expr¶
Converts a string and optional formats to a Unixtime.
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": ["1970-01-01T00:00:00"]}) >>> result = df.select(dfn.functions.to_unixtime(dfn.col("a")).alias("u")) >>> result.collect_column("u")[0].as_py() 0
- datafusion.functions.translate(string: datafusion.expr.Expr, from_val: datafusion.expr.Expr, to_val: datafusion.expr.Expr) datafusion.expr.Expr¶
Replaces the characters in
from_valwith the counterpart into_val.Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": ["hello"]}) >>> result = df.select( ... dfn.functions.translate(dfn.col("a"), dfn.lit("helo"), ... dfn.lit("HELO")).alias("t")) >>> result.collect_column("t")[0].as_py() 'HELLO'
- datafusion.functions.trim(arg: datafusion.expr.Expr) datafusion.expr.Expr¶
Removes all characters, spaces by default, from both sides of a string.
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [" hello "]}) >>> result = df.select(dfn.functions.trim(dfn.col("a")).alias("t")) >>> result.collect_column("t")[0].as_py() 'hello'
- datafusion.functions.trunc(num: datafusion.expr.Expr, precision: datafusion.expr.Expr | None = None) datafusion.expr.Expr¶
Truncate the number toward zero with optional precision.
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [1.567]}) >>> result = df.select( ... dfn.functions.trunc( ... dfn.col("a") ... ).alias("t")) >>> result.collect_column("t")[0].as_py() 1.0
>>> result = df.select( ... dfn.functions.trunc( ... dfn.col("a"), precision=dfn.lit(2) ... ).alias("t")) >>> result.collect_column("t")[0].as_py() 1.56
- datafusion.functions.union_extract(union_expr: datafusion.expr.Expr, field_name: datafusion.expr.Expr | str) datafusion.expr.Expr¶
Extracts a value from a union type by field name.
Returns the value of the named field if it is the currently selected variant, otherwise returns NULL.
Examples
>>> ctx = dfn.SessionContext() >>> types = pa.array([0, 1, 0], type=pa.int8()) >>> offsets = pa.array([0, 0, 1], type=pa.int32()) >>> arr = pa.UnionArray.from_dense( ... types, offsets, [pa.array([1, 2]), pa.array(["hi"])], ... ["int", "str"], [0, 1], ... ) >>> batch = pa.RecordBatch.from_arrays([arr], names=["u"]) >>> df = ctx.create_dataframe([[batch]]) >>> result = df.select( ... dfn.functions.union_extract(dfn.col("u"), "int").alias("val") ... ) >>> result.collect_column("val").to_pylist() [1, None, 2]
- datafusion.functions.union_tag(union_expr: datafusion.expr.Expr) datafusion.expr.Expr¶
Returns the tag (active field name) of a union type.
Examples
>>> ctx = dfn.SessionContext() >>> types = pa.array([0, 1, 0], type=pa.int8()) >>> offsets = pa.array([0, 0, 1], type=pa.int32()) >>> arr = pa.UnionArray.from_dense( ... types, offsets, [pa.array([1, 2]), pa.array(["hi"])], ... ["int", "str"], [0, 1], ... ) >>> batch = pa.RecordBatch.from_arrays([arr], names=["u"]) >>> df = ctx.create_dataframe([[batch]]) >>> result = df.select( ... dfn.functions.union_tag(dfn.col("u")).alias("tag") ... ) >>> result.collect_column("tag").to_pylist() ['int', 'str', 'int']
- datafusion.functions.upper(arg: datafusion.expr.Expr) datafusion.expr.Expr¶
Converts a string to uppercase.
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": ["hello"]}) >>> result = df.select(dfn.functions.upper(dfn.col("a")).alias("u")) >>> result.collect_column("u")[0].as_py() 'HELLO'
- datafusion.functions.uuid() datafusion.expr.Expr¶
Returns uuid v4 as a string value.
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [1]}) >>> result = df.select( ... dfn.functions.uuid().alias("u") ... ) >>> len(result.collect_column("u")[0].as_py()) == 36 True
- datafusion.functions.var(expression: datafusion.expr.Expr, filter: datafusion.expr.Expr | None = None) datafusion.expr.Expr¶
Computes the sample variance of the argument.
See also
This is an alias for
var_samp().
- datafusion.functions.var_pop(expression: datafusion.expr.Expr, filter: datafusion.expr.Expr | None = None) datafusion.expr.Expr¶
Computes the population variance of the argument.
If using the builder functions described in ref:_aggregation this function ignores the options
order_by,null_treatment, anddistinct.- Parameters:
expression – The variable to compute the variance for
filter – If provided, only compute against rows for which the filter is True
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [-1.0, 0.0, 2.0]}) >>> result = df.aggregate( ... [], [dfn.functions.var_pop( ... dfn.col("a") ... ).alias("v")]) >>> result.collect_column("v")[0].as_py() 1.555...
>>> result = df.aggregate( ... [], [dfn.functions.var_pop( ... dfn.col("a"), ... filter=dfn.col("a") > dfn.lit(-1.0) ... ).alias("v")]) >>> result.collect_column("v")[0].as_py() 1.0
- datafusion.functions.var_population(expression: datafusion.expr.Expr, filter: datafusion.expr.Expr | None = None) datafusion.expr.Expr¶
Computes the population variance of the argument.
See also
This is an alias for
var_pop().
- datafusion.functions.var_samp(expression: datafusion.expr.Expr, filter: datafusion.expr.Expr | None = None) datafusion.expr.Expr¶
Computes the sample variance of the argument.
If using the builder functions described in ref:_aggregation this function ignores the options
order_by,null_treatment, anddistinct.- Parameters:
expression – The variable to compute the variance for
filter – If provided, only compute against rows for which the filter is True
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [1.0, 2.0, 3.0]}) >>> result = df.aggregate( ... [], [dfn.functions.var_samp( ... dfn.col("a") ... ).alias("v")]) >>> result.collect_column("v")[0].as_py() 1.0
>>> result = df.aggregate( ... [], [dfn.functions.var_samp( ... dfn.col("a"), ... filter=dfn.col("a") > dfn.lit(1.0) ... ).alias("v")]) >>> result.collect_column("v")[0].as_py() 0.5
- datafusion.functions.var_sample(expression: datafusion.expr.Expr, filter: datafusion.expr.Expr | None = None) datafusion.expr.Expr¶
Computes the sample variance of the argument.
See also
This is an alias for
var_samp().
- datafusion.functions.version() datafusion.expr.Expr¶
Returns the DataFusion version string.
Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.empty_table() >>> result = df.select(dfn.functions.version().alias("v")) >>> "Apache DataFusion" in result.collect_column("v")[0].as_py() True
- datafusion.functions.when(when: datafusion.expr.Expr, then: datafusion.expr.Expr) datafusion.expr.CaseBuilder¶
Create a case expression that has no base expression.
Create a
CaseBuilderto match cases for the expressionexpr. SeeCaseBuilderfor detailed usage.Examples
>>> ctx = dfn.SessionContext() >>> df = ctx.from_pydict({"a": [1, 2, 3]}) >>> result = df.select( ... dfn.functions.when(dfn.col("a") > dfn.lit(2), ... dfn.lit("big")).otherwise(dfn.lit("small")).alias("c")) >>> result.collect_column("c")[2].as_py() 'big'
- datafusion.functions.today¶