Functions¶
DataFusion provides a large number of built-in functions for performing complex queries without requiring user-defined functions.
In here we will cover some of the more popular use cases. If you want to view all the functions go to the Functions
API Reference.
We’ll use the pokemon dataset in the following examples.
In [1]: import urllib.request
In [2]: from datafusion import SessionContext
In [3]: urllib.request.urlretrieve(
...: "https://gist.githubusercontent.com/ritchie46/cac6b337ea52281aa23c049250a4ff03/raw/89a957ff3919d90e6ef2d34235e6bf22304f3366/pokemon.csv",
...: "pokemon.csv",
...: )
...:
Out[3]: ('pokemon.csv', <http.client.HTTPMessage at 0x7f742bff5490>)
In [4]: ctx = SessionContext()
In [5]: ctx.register_csv("pokemon", "pokemon.csv")
In [6]: df = ctx.table("pokemon")
Mathematical¶
DataFusion offers mathematical functions such as pow()
or log()
In [7]: from datafusion import col, literal
In [8]: from datafusion import functions as f
In [9]: df.select(
...: f.pow(col('"Attack"'), literal(2)) - f.pow(col('"Defense"'), literal(2))
...: ).limit(10)
...:
Out[9]:
DataFrame()
+------------------------------------------------------------------+
| power(pokemon.Attack,Int64(2)) - power(pokemon.Defense,Int64(2)) |
+------------------------------------------------------------------+
| 0 |
| -125 |
| -165 |
| -5129 |
| 855 |
| 732 |
| 972 |
| 4579 |
| 4732 |
| -1921 |
+------------------------------------------------------------------+
Conditional¶
There 3 conditional functions in DataFusion coalesce()
, nullif()
and case()
.
In [10]: df.select(
....: f.coalesce(col('"Type 1"'), col('"Type 2"')).alias("dominant_type")
....: ).limit(10)
....:
Out[10]:
DataFrame()
+---------------+
| dominant_type |
+---------------+
| Grass |
| Grass |
| Grass |
| Grass |
| Fire |
| Fire |
| Fire |
| Fire |
| Fire |
| Water |
+---------------+
Temporal¶
For selecting the current time use now()
In [11]: df.select(f.now())
Out[11]:
DataFrame()
+--------------------------------+
| now() |
+--------------------------------+
| 2024-08-20T13:59:02.612506184Z |
| 2024-08-20T13:59:02.612506184Z |
| 2024-08-20T13:59:02.612506184Z |
| 2024-08-20T13:59:02.612506184Z |
| 2024-08-20T13:59:02.612506184Z |
| 2024-08-20T13:59:02.612506184Z |
| 2024-08-20T13:59:02.612506184Z |
| 2024-08-20T13:59:02.612506184Z |
| 2024-08-20T13:59:02.612506184Z |
| 2024-08-20T13:59:02.612506184Z |
+--------------------------------+
Convert to timestamps using to_timestamp()
In [12]: df.select(f.to_timestamp(col('"Total"')).alias("timestamp"))
Out[12]:
DataFrame()
+---------------------+
| timestamp |
+---------------------+
| 1970-01-01T00:05:18 |
| 1970-01-01T00:06:45 |
| 1970-01-01T00:08:45 |
| 1970-01-01T00:10:25 |
| 1970-01-01T00:05:09 |
| 1970-01-01T00:06:45 |
| 1970-01-01T00:08:54 |
| 1970-01-01T00:10:34 |
| 1970-01-01T00:10:34 |
| 1970-01-01T00:05:14 |
+---------------------+
String¶
In the field of data science, working with textual data is a common task. To make string manipulation easier, DataFusion offers a range of helpful options.
In [13]: df.select(
....: f.char_length(col('"Name"')).alias("len"),
....: f.lower(col('"Name"')).alias("lower"),
....: f.left(col('"Name"'), literal(4)).alias("code")
....: )
....:
Out[13]:
DataFrame()
+-----+---------------------------+------+
| len | lower | code |
+-----+---------------------------+------+
| 9 | bulbasaur | Bulb |
| 7 | ivysaur | Ivys |
| 8 | venusaur | Venu |
| 21 | venusaurmega venusaur | Venu |
| 10 | charmander | Char |
| 10 | charmeleon | Char |
| 9 | charizard | Char |
| 25 | charizardmega charizard x | Char |
| 25 | charizardmega charizard y | Char |
| 8 | squirtle | Squi |
+-----+---------------------------+------+
This also includes the functions for regular expressions like regexp_replace()
and regexp_match()
In [14]: df.select(
....: f.regexp_match(col('"Name"'), literal("Char")).alias("dragons"),
....: f.regexp_replace(col('"Name"'), literal("saur"), literal("fleur")).alias("flowers")
....: )
....:
Out[14]:
DataFrame()
+---------+---------------------------+
| dragons | flowers |
+---------+---------------------------+
| | Bulbafleur |
| | Ivyfleur |
| | Venufleur |
| | VenufleurMega Venusaur |
| [Char] | Charmander |
| [Char] | Charmeleon |
| [Char] | Charizard |
| [Char] | CharizardMega Charizard X |
| [Char] | CharizardMega Charizard Y |
| | Squirtle |
+---------+---------------------------+
Other¶
The function in_list()
allows to check a column for the presence of multiple values:
In [15]: types = [literal("Grass"), literal("Fire"), literal("Water")]
In [16]: (
....: df.select(f.in_list(col('"Type 1"'), types, negated=False).alias("basic_types"))
....: .limit(20)
....: .to_pandas()
....: )
....:
Out[16]:
basic_types
0 True
1 True
2 True
3 True
4 True
5 True
6 True
7 True
8 True
9 True
10 True
11 True
12 True
13 False
14 False
15 False
16 False
17 False
18 False
19 False