Window Functions¶
In this section you will learn about window functions. A window function utilizes values from one or multiple rows to produce a result for each individual row, unlike an aggregate function that provides a single value for multiple rows.
The functionality of window functions in DataFusion is supported by the dedicated window()
function.
We’ll use the pokemon dataset (from Ritchie Vink) in the following examples.
In [1]: import urllib.request
In [2]: from datafusion import SessionContext
In [3]: from datafusion import col
In [4]: from datafusion import functions as f
In [5]: urllib.request.urlretrieve(
...: "https://gist.githubusercontent.com/ritchie46/cac6b337ea52281aa23c049250a4ff03/raw/89a957ff3919d90e6ef2d34235e6bf22304f3366/pokemon.csv",
...: "pokemon.csv",
...: )
...:
Out[5]: ('pokemon.csv', <http.client.HTTPMessage at 0x7f7415f8d0d0>)
In [6]: ctx = SessionContext()
In [7]: df = ctx.read_csv("pokemon.csv")
Here is an example that shows how to compare each pokemons’s attack power with the average attack power in its "Type 1"
In [8]: df.select(
...: col('"Name"'),
...: col('"Attack"'),
...: f.alias(
...: f.window("avg", [col('"Attack"')], partition_by=[col('"Type 1"')]),
...: "Average Attack",
...: )
...: )
...:
Out[8]:
DataFrame()
+------------+--------+--------------------+
| Name | Attack | Average Attack |
+------------+--------+--------------------+
| Jynx | 50 | 50.0 |
| Articuno | 85 | 67.5 |
| Farfetch'd | 65 | 65.0 |
| Jigglypuff | 45 | 55.0 |
| Rattata | 56 | 55.333333333333336 |
| Raticate | 81 | 61.75 |
| Spearow | 60 | 61.4 |
| Fearow | 90 | 66.16666666666667 |
| Doduo | 85 | 68.85714285714286 |
| Wigglytuff | 70 | 69.0 |
+------------+--------+--------------------+
You can also control the order in which rows are processed by window functions by providing
a list of order_by
functions for the order_by
parameter.
In [9]: df.select(
...: col('"Name"'),
...: col('"Attack"'),
...: f.alias(
...: f.window(
...: "rank",
...: [],
...: partition_by=[col('"Type 1"')],
...: order_by=[f.order_by(col('"Attack"'))],
...: ),
...: "rank",
...: ),
...: )
...:
Out[9]:
DataFrame()
+------------+--------+------+
| Name | Attack | rank |
+------------+--------+------+
| Jynx | 50 | 1 |
| Articuno | 85 | 2 |
| Chansey | 5 | 1 |
| Jigglypuff | 45 | 2 |
| Meowth | 45 | 2 |
| Pidgey | 45 | 2 |
| Ditto | 48 | 5 |
| Lickitung | 55 | 6 |
| Eevee | 55 | 6 |
| Rattata | 56 | 8 |
+------------+--------+------+
The possible window functions are:
- Rank Functions
rank
dense_rank
row_number
ntile
- Analytical Functions
cume_dist
percent_rank
lag
lead
first_value
last_value
nth_value
- Aggregate Functions
All aggregate functions can be used as window functions.