Functions

DataFusion provides a large number of built-in functions for performing complex queries without requiring user-defined functions. In here we will cover some of the more popular use cases. If you want to view all the functions go to the Functions API Reference.

We’ll use the pokemon dataset in the following examples.

In [1]: from datafusion import SessionContext

In [2]: ctx = SessionContext()

In [3]: ctx.register_csv("pokemon", "pokemon.csv")

In [4]: df = ctx.table("pokemon")

Mathematical

DataFusion offers mathematical functions such as pow() or log()

In [5]: from datafusion import col, literal, string_literal, str_lit

In [6]: from datafusion import functions as f

In [7]: df.select(
   ...:     f.pow(col('"Attack"'), literal(2)) - f.pow(col('"Defense"'), literal(2))
   ...: ).limit(10)
   ...: 
Out[7]: 
DataFrame()
+------------------------------------------------------------------+
| power(pokemon.Attack,Int64(2)) - power(pokemon.Defense,Int64(2)) |
+------------------------------------------------------------------+
| 0                                                                |
| -125                                                             |
| -165                                                             |
| -5129                                                            |
| 855                                                              |
| 732                                                              |
| 972                                                              |
| 4579                                                             |
| 4732                                                             |
| -1921                                                            |
+------------------------------------------------------------------+

Conditional

There 3 conditional functions in DataFusion coalesce(), nullif() and case().

In [8]: df.select(
   ...:     f.coalesce(col('"Type 1"'), col('"Type 2"')).alias("dominant_type")
   ...: ).limit(10)
   ...: 
Out[8]: 
DataFrame()
+---------------+
| dominant_type |
+---------------+
| Grass         |
| Grass         |
| Grass         |
| Grass         |
| Fire          |
| Fire          |
| Fire          |
| Fire          |
| Fire          |
| Water         |
+---------------+

Temporal

For selecting the current time use now()

In [9]: df.select(f.now())
Out[9]: 
DataFrame()
+--------------------------------+
| now()                          |
+--------------------------------+
| 2025-02-15T19:55:21.246940956Z |
| 2025-02-15T19:55:21.246940956Z |
| 2025-02-15T19:55:21.246940956Z |
| 2025-02-15T19:55:21.246940956Z |
| 2025-02-15T19:55:21.246940956Z |
| 2025-02-15T19:55:21.246940956Z |
| 2025-02-15T19:55:21.246940956Z |
| 2025-02-15T19:55:21.246940956Z |
| 2025-02-15T19:55:21.246940956Z |
| 2025-02-15T19:55:21.246940956Z |
+--------------------------------+

Convert to timestamps using to_timestamp()

In [10]: df.select(f.to_timestamp(col('"Total"')).alias("timestamp"))
Out[10]: 
DataFrame()
+---------------------+
| timestamp           |
+---------------------+
| 1970-01-01T00:05:18 |
| 1970-01-01T00:06:45 |
| 1970-01-01T00:08:45 |
| 1970-01-01T00:10:25 |
| 1970-01-01T00:05:09 |
| 1970-01-01T00:06:45 |
| 1970-01-01T00:08:54 |
| 1970-01-01T00:10:34 |
| 1970-01-01T00:10:34 |
| 1970-01-01T00:05:14 |
+---------------------+

Extracting parts of a date using date_part() (alias extract())

In [11]: df.select(
   ....:    f.date_part(literal("month"), f.to_timestamp(col('"Total"'))).alias("month"),
   ....:    f.extract(literal("day"), f.to_timestamp(col('"Total"'))).alias("day")
   ....: )
   ....: 
Out[11]: 
DataFrame()
+-------+-----+
| month | day |
+-------+-----+
| 1     | 1   |
| 1     | 1   |
| 1     | 1   |
| 1     | 1   |
| 1     | 1   |
| 1     | 1   |
| 1     | 1   |
| 1     | 1   |
| 1     | 1   |
| 1     | 1   |
+-------+-----+

String

In the field of data science, working with textual data is a common task. To make string manipulation easier, DataFusion offers a range of helpful options.

In [12]: df.select(
   ....:     f.char_length(col('"Name"')).alias("len"),
   ....:     f.lower(col('"Name"')).alias("lower"),
   ....:     f.left(col('"Name"'), literal(4)).alias("code")
   ....: )
   ....: 
Out[12]: 
DataFrame()
+-----+---------------------------+------+
| len | lower                     | code |
+-----+---------------------------+------+
| 9   | bulbasaur                 | Bulb |
| 7   | ivysaur                   | Ivys |
| 8   | venusaur                  | Venu |
| 21  | venusaurmega venusaur     | Venu |
| 10  | charmander                | Char |
| 10  | charmeleon                | Char |
| 9   | charizard                 | Char |
| 25  | charizardmega charizard x | Char |
| 25  | charizardmega charizard y | Char |
| 8   | squirtle                  | Squi |
+-----+---------------------------+------+

This also includes the functions for regular expressions like regexp_replace() and regexp_match()

In [13]: df.select(
   ....:     f.regexp_match(col('"Name"'), literal("Char")).alias("dragons"),
   ....:     f.regexp_replace(col('"Name"'), literal("saur"), literal("fleur")).alias("flowers")
   ....: )
   ....: 
Out[13]: 
DataFrame()
+---------+---------------------------+
| dragons | flowers                   |
+---------+---------------------------+
|         | Bulbafleur                |
|         | Ivyfleur                  |
|         | Venufleur                 |
|         | VenufleurMega Venusaur    |
| [Char]  | Charmander                |
| [Char]  | Charmeleon                |
| [Char]  | Charizard                 |
| [Char]  | CharizardMega Charizard X |
| [Char]  | CharizardMega Charizard Y |
|         | Squirtle                  |
+---------+---------------------------+

Casting

Casting expressions to different data types using arrow_cast()

In [14]: df.select(
   ....:     f.arrow_cast(col('"Total"'), string_literal("Float64")).alias("total_as_float"),
   ....:     f.arrow_cast(col('"Total"'), str_lit("Int32")).alias("total_as_int")
   ....: )
   ....: 
Out[14]: 
DataFrame()
+----------------+--------------+
| total_as_float | total_as_int |
+----------------+--------------+
| 318.0          | 318          |
| 405.0          | 405          |
| 525.0          | 525          |
| 625.0          | 625          |
| 309.0          | 309          |
| 405.0          | 405          |
| 534.0          | 534          |
| 634.0          | 634          |
| 634.0          | 634          |
| 314.0          | 314          |
+----------------+--------------+

Other

The function in_list() allows to check a column for the presence of multiple values:

In [15]: types = [literal("Grass"), literal("Fire"), literal("Water")]

In [16]: (
   ....:     df.select(f.in_list(col('"Type 1"'), types, negated=False).alias("basic_types"))
   ....:       .limit(20)
   ....:       .to_pandas()
   ....: )
   ....: 
Out[16]: 
    basic_types
0          True
1          True
2          True
3          True
4          True
5          True
6          True
7          True
8          True
9          True
10         True
11         True
12         True
13        False
14        False
15        False
16        False
17        False
18        False
19        False