Window Functions

In this section you will learn about window functions. A window function utilizes values from one or multiple rows to produce a result for each individual row, unlike an aggregate function that provides a single value for multiple rows.

The functionality of window functions in DataFusion is supported by the dedicated window() function.

We’ll use the pokemon dataset (from Ritchie Vink) in the following examples.

In [1]: import urllib.request

In [2]: from datafusion import SessionContext

In [3]: from datafusion import col

In [4]: from datafusion import functions as f

In [5]: urllib.request.urlretrieve(
   ...:     "https://gist.githubusercontent.com/ritchie46/cac6b337ea52281aa23c049250a4ff03/raw/89a957ff3919d90e6ef2d34235e6bf22304f3366/pokemon.csv",
   ...:     "pokemon.csv",
   ...: )
   ...: 
Out[5]: ('pokemon.csv', <http.client.HTTPMessage at 0x7f7415f8d0d0>)

In [6]: ctx = SessionContext()

In [7]: df = ctx.read_csv("pokemon.csv")

Here is an example that shows how to compare each pokemons’s attack power with the average attack power in its "Type 1"

In [8]: df.select(
   ...:     col('"Name"'),
   ...:     col('"Attack"'),
   ...:     f.alias(
   ...:         f.window("avg", [col('"Attack"')], partition_by=[col('"Type 1"')]),
   ...:         "Average Attack",
   ...:     )
   ...: )
   ...: 
Out[8]: 
DataFrame()
+------------+--------+--------------------+
| Name       | Attack | Average Attack     |
+------------+--------+--------------------+
| Jynx       | 50     | 50.0               |
| Articuno   | 85     | 67.5               |
| Farfetch'd | 65     | 65.0               |
| Jigglypuff | 45     | 55.0               |
| Rattata    | 56     | 55.333333333333336 |
| Raticate   | 81     | 61.75              |
| Spearow    | 60     | 61.4               |
| Fearow     | 90     | 66.16666666666667  |
| Doduo      | 85     | 68.85714285714286  |
| Wigglytuff | 70     | 69.0               |
+------------+--------+--------------------+

You can also control the order in which rows are processed by window functions by providing a list of order_by functions for the order_by parameter.

In [9]: df.select(
   ...:     col('"Name"'),
   ...:     col('"Attack"'),
   ...:     f.alias(
   ...:         f.window(
   ...:             "rank",
   ...:             [],
   ...:             partition_by=[col('"Type 1"')],
   ...:             order_by=[f.order_by(col('"Attack"'))],
   ...:         ),
   ...:         "rank",
   ...:     ),
   ...: )
   ...: 
Out[9]: 
DataFrame()
+------------+--------+------+
| Name       | Attack | rank |
+------------+--------+------+
| Jynx       | 50     | 1    |
| Articuno   | 85     | 2    |
| Chansey    | 5      | 1    |
| Jigglypuff | 45     | 2    |
| Meowth     | 45     | 2    |
| Pidgey     | 45     | 2    |
| Ditto      | 48     | 5    |
| Lickitung  | 55     | 6    |
| Eevee      | 55     | 6    |
| Rattata    | 56     | 8    |
+------------+--------+------+

The possible window functions are:

  1. Rank Functions
    • rank

    • dense_rank

    • row_number

    • ntile

  2. Analytical Functions
    • cume_dist

    • percent_rank

    • lag

    • lead

    • first_value

    • last_value

    • nth_value

  3. Aggregate Functions
    • All aggregate functions can be used as window functions.