Functions¶

DataFusion provides a large number of built-in functions for performing complex queries without requiring user-defined functions. In here we will cover some of the more popular use cases. If you want to view all the functions go to the Functions API Reference.

We’ll use the pokemon dataset in the following examples.

In [1]: from datafusion import SessionContext

In [2]: ctx = SessionContext()

In [3]: ctx.register_csv("pokemon", "pokemon.csv")

In [4]: df = ctx.table("pokemon")

Mathematical¶

DataFusion offers mathematical functions such as pow() or log()

In [5]: from datafusion import col, literal, string_literal, str_lit

In [6]: from datafusion import functions as f

In [7]: df.select(
   ...:     f.pow(col('"Attack"'), literal(2)) - f.pow(col('"Defense"'), literal(2))
   ...: ).limit(10)
   ...: 
Out[7]: 
DataFrame()
+------------------------------------------------------------------+
| power(pokemon.Attack,Int64(2)) - power(pokemon.Defense,Int64(2)) |
+------------------------------------------------------------------+
| 0                                                                |
| -125                                                             |
| -165                                                             |
| -5129                                                            |
| 855                                                              |
| 732                                                              |
| 972                                                              |
| 4579                                                             |
| 4732                                                             |
| -1921                                                            |
+------------------------------------------------------------------+

Conditional¶

There 3 conditional functions in DataFusion coalesce(), nullif() and case().

In [8]: df.select(
   ...:     f.coalesce(col('"Type 1"'), col('"Type 2"')).alias("dominant_type")
   ...: ).limit(10)
   ...: 
Out[8]: 
DataFrame()
+---------------+
| dominant_type |
+---------------+
| Grass         |
| Grass         |
| Grass         |
| Grass         |
| Fire          |
| Fire          |
| Fire          |
| Fire          |
| Fire          |
| Water         |
+---------------+

Temporal¶

For selecting the current time use now()

In [9]: df.select(f.now())
Out[9]: 
DataFrame()
+--------------------------------+
| now()                          |
+--------------------------------+
| 2025-07-12T11:49:57.473788797Z |
| 2025-07-12T11:49:57.473788797Z |
| 2025-07-12T11:49:57.473788797Z |
| 2025-07-12T11:49:57.473788797Z |
| 2025-07-12T11:49:57.473788797Z |
| 2025-07-12T11:49:57.473788797Z |
| 2025-07-12T11:49:57.473788797Z |
| 2025-07-12T11:49:57.473788797Z |
| 2025-07-12T11:49:57.473788797Z |
| 2025-07-12T11:49:57.473788797Z |
+--------------------------------+
Data truncated.

Convert to timestamps using to_timestamp()

In [10]: df.select(f.to_timestamp(col('"Total"')).alias("timestamp"))
Out[10]: 
DataFrame()
+---------------------+
| timestamp           |
+---------------------+
| 1970-01-01T00:05:18 |
| 1970-01-01T00:06:45 |
| 1970-01-01T00:08:45 |
| 1970-01-01T00:10:25 |
| 1970-01-01T00:05:09 |
| 1970-01-01T00:06:45 |
| 1970-01-01T00:08:54 |
| 1970-01-01T00:10:34 |
| 1970-01-01T00:10:34 |
| 1970-01-01T00:05:14 |
+---------------------+
Data truncated.

Extracting parts of a date using date_part() (alias extract())

In [11]: df.select(
   ....:    f.date_part(literal("month"), f.to_timestamp(col('"Total"'))).alias("month"),
   ....:    f.extract(literal("day"), f.to_timestamp(col('"Total"'))).alias("day")
   ....: )
   ....: 
Out[11]: 
DataFrame()
+-------+-----+
| month | day |
+-------+-----+
| 1     | 1   |
| 1     | 1   |
| 1     | 1   |
| 1     | 1   |
| 1     | 1   |
| 1     | 1   |
| 1     | 1   |
| 1     | 1   |
| 1     | 1   |
| 1     | 1   |
+-------+-----+
Data truncated.

String¶

In the field of data science, working with textual data is a common task. To make string manipulation easier, DataFusion offers a range of helpful options.

In [12]: df.select(
   ....:     f.char_length(col('"Name"')).alias("len"),
   ....:     f.lower(col('"Name"')).alias("lower"),
   ....:     f.left(col('"Name"'), literal(4)).alias("code")
   ....: )
   ....: 
Out[12]: 
DataFrame()
+-----+---------------------------+------+
| len | lower                     | code |
+-----+---------------------------+------+
| 9   | bulbasaur                 | Bulb |
| 7   | ivysaur                   | Ivys |
| 8   | venusaur                  | Venu |
| 21  | venusaurmega venusaur     | Venu |
| 10  | charmander                | Char |
| 10  | charmeleon                | Char |
| 9   | charizard                 | Char |
| 25  | charizardmega charizard x | Char |
| 25  | charizardmega charizard y | Char |
| 8   | squirtle                  | Squi |
+-----+---------------------------+------+
Data truncated.

This also includes the functions for regular expressions like regexp_replace() and regexp_match()

In [13]: df.select(
   ....:     f.regexp_match(col('"Name"'), literal("Char")).alias("dragons"),
   ....:     f.regexp_replace(col('"Name"'), literal("saur"), literal("fleur")).alias("flowers")
   ....: )
   ....: 
Out[13]: 
DataFrame()
+---------+---------------------------+
| dragons | flowers                   |
+---------+---------------------------+
|         | Bulbafleur                |
|         | Ivyfleur                  |
|         | Venufleur                 |
|         | VenufleurMega Venusaur    |
| [Char]  | Charmander                |
| [Char]  | Charmeleon                |
| [Char]  | Charizard                 |
| [Char]  | CharizardMega Charizard X |
| [Char]  | CharizardMega Charizard Y |
|         | Squirtle                  |
+---------+---------------------------+
Data truncated.

Casting¶

Casting expressions to different data types using arrow_cast()

In [14]: df.select(
   ....:     f.arrow_cast(col('"Total"'), string_literal("Float64")).alias("total_as_float"),
   ....:     f.arrow_cast(col('"Total"'), str_lit("Int32")).alias("total_as_int")
   ....: )
   ....: 
Out[14]: 
DataFrame()
+----------------+--------------+
| total_as_float | total_as_int |
+----------------+--------------+
| 318.0          | 318          |
| 405.0          | 405          |
| 525.0          | 525          |
| 625.0          | 625          |
| 309.0          | 309          |
| 405.0          | 405          |
| 534.0          | 534          |
| 634.0          | 634          |
| 634.0          | 634          |
| 314.0          | 314          |
+----------------+--------------+
Data truncated.

Other¶

The function in_list() allows to check a column for the presence of multiple values:

In [15]: types = [literal("Grass"), literal("Fire"), literal("Water")]

In [16]: (
   ....:     df.select(f.in_list(col('"Type 1"'), types, negated=False).alias("basic_types"))
   ....:       .limit(20)
   ....:       .to_pandas()
   ....: )
   ....: 
Out[16]: 
    basic_types
0          True
1          True
2          True
3          True
4          True
5          True
6          True
7          True
8          True
9          True
10         True
11         True
12         True
13        False
14        False
15        False
16        False
17        False
18        False
19        False

Handling Missing Values¶

DataFusion provides methods to handle missing values in DataFrames:

fill_null¶

The fill_null() method replaces NULL values in specified columns with a provided value:

# Fill all NULL values with 0 where possible
df = df.fill_null(0)

# Fill NULL values only in specific string columns
df = df.fill_null("missing", subset=["name", "category"])

The fill value will be cast to match each column’s type. If casting fails for a column, that column remains unchanged.

Joins

Aggregation