Aggregation#
An aggregate or aggregation is a function where the values of multiple rows are processed together
to form a single summary value. For performing an aggregation, DataFusion provides the
aggregate()
from datafusion import SessionContext, col, lit, functions as f
ctx = SessionContext()
df = ctx.read_csv("pokemon.csv")
col_type_1 = col('"Type 1"')
col_type_2 = col('"Type 2"')
col_speed = col('"Speed"')
col_attack = col('"Attack"')
df.aggregate([col_type_1], [
f.approx_distinct(col_speed).alias("Count"),
f.approx_median(col_speed).alias("Median Speed"),
f.approx_percentile_cont(col_speed, 0.9).alias("90% Speed")])
DataFrame()
+----------+-------+--------------+--------------------+
| Type 1 | Count | Median Speed | 90% Speed |
+----------+-------+--------------+--------------------+
| Water | 21 | 70.0 | 90.0 |
| Rock | 8 | 55.0 | 140.0 |
| Ghost | 4 | 101.25 | 130.0 |
| Ice | 2 | 90.0 | 95.0 |
| Dragon | 3 | 70.0 | 80.0 |
| Grass | 8 | 55.0 | 80.0 |
| Fire | 8 | 91.75 | 100.25 |
| Normal | 20 | 71.0 | 110.70000000000002 |
| Poison | 12 | 55.0 | 85.5 |
| Fighting | 7 | 70.0 | 93.4 |
+----------+-------+--------------+--------------------+
Data truncated.
When group_by is None or an empty list, the aggregation is done over the whole
DataFrame. For grouping the group_by list must contain at least one column.
df.aggregate([col_type_1], [
f.max(col_speed).alias("Max Speed"),
f.avg(col_speed).alias("Avg Speed"),
f.min(col_speed).alias("Min Speed")])
DataFrame()
+----------+-----------+--------------------+-----------+
| Type 1 | Max Speed | Avg Speed | Min Speed |
+----------+-----------+--------------------+-----------+
| Water | 115 | 67.25806451612904 | 15 |
| Rock | 150 | 67.5 | 20 |
| Ghost | 130 | 103.75 | 80 |
| Ice | 95 | 90.0 | 85 |
| Dragon | 80 | 66.66666666666667 | 50 |
| Grass | 80 | 54.23076923076923 | 30 |
| Fire | 105 | 86.28571428571429 | 60 |
| Normal | 121 | 72.75 | 20 |
| Poison | 90 | 58.785714285714285 | 25 |
| Fighting | 95 | 66.14285714285714 | 35 |
+----------+-----------+--------------------+-----------+
Data truncated.
More than one column can be used for grouping
df.aggregate([col_type_1, col_type_2], [
f.max(col_speed).alias("Max Speed"),
f.avg(col_speed).alias("Avg Speed"),
f.min(col_speed).alias("Min Speed")])
DataFrame()
+--------+---------+-----------+-------------------+-----------+
| Type 1 | Type 2 | Max Speed | Avg Speed | Min Speed |
+--------+---------+-----------+-------------------+-----------+
| Water | | 90 | 68.05263157894737 | 40 |
| Poison | Ground | 85 | 80.5 | 76 |
| Grass | Psychic | 55 | 47.5 | 40 |
| Water | Flying | 81 | 81.0 | 81 |
| Rock | Flying | 150 | 140.0 | 130 |
| Ice | Flying | 85 | 85.0 | 85 |
| Dragon | | 70 | 60.0 | 50 |
| Dragon | Flying | 80 | 80.0 | 80 |
| Fire | | 105 | 81.8 | 60 |
| Fire | Flying | 100 | 96.66666666666667 | 90 |
+--------+---------+-----------+-------------------+-----------+
Data truncated.
Setting Parameters#
Each of the built in aggregate functions provides arguments for the parameters that affect their
operation. These can also be overridden using the builder approach to setting any of the following
parameters. When you use the builder, you must call build() to finish. For example, these two
expressions are equivalent.
first_1 = f.first_value(col("a"), order_by=[col("a")])
first_2 = f.first_value(col("a")).order_by(col("a")).build()
Ordering#
You can control the order in which rows are processed by window functions by providing
a list of order_by functions for the order_by parameter. In the following example, we
sort the Pokemon by their attack in increasing order and take the first value, which gives us the
Pokemon with the smallest attack value in each Type 1.
df.aggregate(
[col('"Type 1"')],
[f.first_value(
col('"Name"'),
order_by=[col('"Attack"').sort(ascending=True)]
).alias("Smallest Attack")
])
DataFrame()
+----------+-----------------+
| Type 1 | Smallest Attack |
+----------+-----------------+
| Water | Magikarp |
| Rock | Omanyte |
| Ghost | Gastly |
| Ice | Jynx |
| Dragon | Dratini |
| Grass | Exeggcute |
| Fire | Vulpix |
| Normal | Chansey |
| Poison | Zubat |
| Fighting | Mankey |
+----------+-----------------+
Data truncated.
Distinct#
When you set the parameter distinct to True, then unique values will only be evaluated one
time each. Suppose we want to create an array of all of the Type 2 for each Type 1 of our
Pokemon set. Since there will be many entries of Type 2 we only one each distinct value.
df.aggregate([col_type_1], [f.array_agg(col_type_2, distinct=True).alias("Type 2 List")])
DataFrame()
+----------+--------------------------------------------------+
| Type 1 | Type 2 List |
+----------+--------------------------------------------------+
| Water | [Fighting, Flying, , Poison, Psychic, Dark, Ice] |
| Rock | [Water, Ground, Flying] |
| Ghost | [Poison] |
| Ice | [Flying, Psychic] |
| Dragon | [, Flying] |
| Grass | [Psychic, , Poison] |
| Fire | [, Dragon, Flying] |
| Normal | [Fairy, Flying, ] |
| Poison | [Ground, Flying, ] |
| Fighting | [] |
+----------+--------------------------------------------------+
Data truncated.
In the output of the above we can see that there are some Type 1 for which the Type 2 entry
is null. In reality, we probably want to filter those out. We can do this in two ways. First,
we can filter DataFrame rows that have no Type 2. If we do this, we might have some Type 1
entries entirely removed. The second is we can use the filter argument described below.
df.filter(col_type_2.is_not_null()).aggregate([col_type_1], [f.array_agg(col_type_2, distinct=True).alias("Type 2 List")])
df.aggregate([col_type_1], [f.array_agg(col_type_2, distinct=True, filter=col_type_2.is_not_null()).alias("Type 2 List")])
DataFrame()
+----------+------------------------------------------------+
| Type 1 | Type 2 List |
+----------+------------------------------------------------+
| Water | [Fighting, Ice, Flying, Psychic, Dark, Poison] |
| Rock | [Flying, Ground, Water] |
| Ghost | [Poison] |
| Ice | [Psychic, Flying] |
| Dragon | [Flying] |
| Grass | [Psychic, Poison] |
| Fire | [Flying, Dragon] |
| Normal | [Fairy, Flying] |
| Poison | [Flying, Ground] |
| Fighting | |
+----------+------------------------------------------------+
Data truncated.
Which approach you take should depend on your use case.
Null Treatment#
This option allows you to either respect or ignore null values.
One common usage for handling nulls is the case where you want to find the first value within a partition. By setting the null treatment to ignore nulls, we can find the first non-null value in our partition.
from datafusion.common import NullTreatment
df.aggregate([col_type_1], [
f.first_value(
col_type_2,
order_by=[col_attack],
null_treatment=NullTreatment.RESPECT_NULLS
).alias("Lowest Attack Type 2")])
df.aggregate([col_type_1], [
f.first_value(
col_type_2,
order_by=[col_attack],
null_treatment=NullTreatment.IGNORE_NULLS
).alias("Lowest Attack Type 2")])
DataFrame()
+----------+----------------------+
| Type 1 | Lowest Attack Type 2 |
+----------+----------------------+
| Water | Poison |
| Rock | Water |
| Ghost | Poison |
| Ice | Psychic |
| Dragon | Flying |
| Grass | Psychic |
| Fire | Flying |
| Normal | Flying |
| Poison | Flying |
| Fighting | |
+----------+----------------------+
Data truncated.
Filter#
Using the filter option is useful for filtering results to include in the aggregate function. It can be seen in the example above on how this can be useful to only filter rows evaluated by the aggregate function without filtering rows from the entire DataFrame.
Filter takes a single expression.
Suppose we want to find the speed values for only Pokemon that have low Attack values.
df.aggregate([col_type_1], [
f.avg(col_speed).alias("Avg Speed All"),
f.avg(col_speed, filter=col_attack < lit(50)).alias("Avg Speed Low Attack")])
DataFrame()
+----------+--------------------+----------------------+
| Type 1 | Avg Speed All | Avg Speed Low Attack |
+----------+--------------------+----------------------+
| Water | 67.25806451612904 | 63.833333333333336 |
| Rock | 67.5 | 52.5 |
| Ghost | 103.75 | 80.0 |
| Ice | 90.0 | |
| Dragon | 66.66666666666667 | |
| Grass | 54.23076923076923 | 42.5 |
| Fire | 86.28571428571429 | 65.0 |
| Normal | 72.75 | 52.8 |
| Poison | 58.785714285714285 | 48.0 |
| Fighting | 66.14285714285714 | |
+----------+--------------------+----------------------+
Data truncated.
Comparing subsets within a group#
Sometimes you need to compare the full membership of a group against a
subset that meets some condition — for example, “which groups have at least
one failure, but not every member failed?”. The filter argument on an
aggregate restricts the rows that contribute to that aggregate without
dropping the group, so a single pass can produce both the full set and the
filtered subset side by side. Pairing
array_agg() with distinct=True and
filter= is a compact way to express this: collect the distinct values
of the group, collect the distinct values that satisfy the condition, then
compare the two arrays.
Suppose each row records a line item with the supplier that fulfilled it and a flag for whether that supplier met the commit date. We want to identify partially failed orders — orders where at least one supplier failed but not every supplier failed:
orders_df = ctx.from_pydict(
{
"order_id": [1, 1, 1, 2, 2, 3, 4, 4],
"supplier_id": [100, 101, 102, 200, 201, 300, 400, 401],
"failed": [False, True, False, False, False, True, True, True],
},
)
grouped = orders_df.aggregate(
[col("order_id")],
[
f.array_agg(col("supplier_id"), distinct=True).alias("all_suppliers"),
f.array_agg(
col("supplier_id"),
filter=col("failed"),
distinct=True,
).alias("failed_suppliers"),
],
)
grouped.filter(
(f.array_length(col("failed_suppliers")) > lit(0))
& (f.array_length(col("failed_suppliers")) < f.array_length(col("all_suppliers")))
).select(col("order_id"), col("failed_suppliers"))
DataFrame()
+----------+------------------+
| order_id | failed_suppliers |
+----------+------------------+
| 1 | [101] |
+----------+------------------+
Order 1 is partial (one of three suppliers failed). Order 2 is excluded because no supplier failed, order 3 because its only supplier failed, and order 4 because both of its suppliers failed.
Grouping Sets#
The default style of aggregation produces one row per group. Sometimes you want a single query to produce rows at multiple levels of detail — for example, totals per type and an overall grand total, or subtotals for every combination of two columns plus the individual column totals. Writing separate queries and concatenating them is tedious and runs the data multiple times. Grouping sets solve this by letting you specify several grouping levels in one pass.
DataFusion supports three grouping set styles through the
GroupingSet class:
rollup()— hierarchical subtotals, like a drill-down reportcube()— every possible subtotal combination, like a pivot tablegrouping_sets()— explicitly list exactly which grouping levels you want
Because result rows come from different grouping levels, a column that is not part of a
particular level will be null in that row. Use grouping() to
distinguish a real null in the data from one that means “this column was aggregated across.”
It returns 0 when the column is a grouping key for that row, and 1 when it is not.
Rollup#
rollup() creates a hierarchy. rollup(a, b) produces
grouping sets (a, b), (a), and () — like nested subtotals in a report. This is useful
when your columns have a natural hierarchy, such as region → city or type → subtype.
Suppose we want to summarize Pokemon stats by Type 1 with subtotals and a grand total. With
the default aggregation style we would need two separate queries. With rollup we get it all at
once:
from datafusion.expr import GroupingSet
df.aggregate(
[GroupingSet.rollup(col_type_1)],
[f.count(col_speed).alias("Count"),
f.avg(col_speed).alias("Avg Speed"),
f.max(col_speed).alias("Max Speed")]
).sort(col_type_1.sort(ascending=True, nulls_first=True))
DataFrame()
+----------+-------+-------------------+-----------+
| Type 1 | Count | Avg Speed | Max Speed |
+----------+-------+-------------------+-----------+
| | 163 | 71.65030674846626 | 150 |
| Bug | 14 | 66.78571428571429 | 145 |
| Dragon | 3 | 66.66666666666667 | 80 |
| Electric | 9 | 98.88888888888889 | 140 |
| Fairy | 2 | 47.5 | 60 |
| Fighting | 7 | 66.14285714285714 | 95 |
| Fire | 14 | 86.28571428571429 | 105 |
| Ghost | 4 | 103.75 | 130 |
| Grass | 13 | 54.23076923076923 | 80 |
| Ground | 8 | 58.125 | 120 |
+----------+-------+-------------------+-----------+
Data truncated.
The first row — where Type 1 is null — is the grand total across all types. But how do you
tell a grand-total null apart from a Pokemon that genuinely has no type? The
grouping() function returns 0 when the column is a grouping key
for that row and 1 when it is aggregated across.
Apply .alias() to the grouping() expression to give the column a readable name:
result = df.aggregate(
[GroupingSet.rollup(col_type_1)],
[f.count(col_speed).alias("Count"),
f.avg(col_speed).alias("Avg Speed"),
f.grouping(col_type_1).alias("Is Total")]
)
result.sort(col_type_1.sort(ascending=True, nulls_first=True))
DataFrame()
+----------+-------+-------------------+----------+
| Type 1 | Count | Avg Speed | Is Total |
+----------+-------+-------------------+----------+
| | 163 | 71.65030674846626 | 1 |
| Bug | 14 | 66.78571428571429 | 0 |
| Dragon | 3 | 66.66666666666667 | 0 |
| Electric | 9 | 98.88888888888889 | 0 |
| Fairy | 2 | 47.5 | 0 |
| Fighting | 7 | 66.14285714285714 | 0 |
| Fire | 14 | 86.28571428571429 | 0 |
| Ghost | 4 | 103.75 | 0 |
| Grass | 13 | 54.23076923076923 | 0 |
| Ground | 8 | 58.125 | 0 |
+----------+-------+-------------------+----------+
Data truncated.
With two columns the hierarchy becomes more apparent. rollup(Type 1, Type 2) produces:
one row per
(Type 1, Type 2)pair — the most detailed levelone row per
Type 1— subtotalsone grand total row
df.aggregate(
[GroupingSet.rollup(col_type_1, col_type_2)],
[f.count(col_speed).alias("Count"),
f.avg(col_speed).alias("Avg Speed")]
).sort(
col_type_1.sort(ascending=True, nulls_first=True),
col_type_2.sort(ascending=True, nulls_first=True)
)
DataFrame()
+----------+--------+-------+--------------------+
| Type 1 | Type 2 | Count | Avg Speed |
+----------+--------+-------+--------------------+
| | | 163 | 71.65030674846626 |
| Bug | | 3 | 53.333333333333336 |
| Bug | | 14 | 66.78571428571429 |
| Bug | Flying | 3 | 93.33333333333333 |
| Bug | Grass | 2 | 27.5 |
| Bug | Poison | 6 | 73.33333333333333 |
| Dragon | | 3 | 66.66666666666667 |
| Dragon | | 2 | 60.0 |
| Dragon | Flying | 1 | 80.0 |
| Electric | | 6 | 112.5 |
+----------+--------+-------+--------------------+
Data truncated.
Cube#
cube() produces every possible subset. cube(a, b)
produces grouping sets (a, b), (a), (b), and () — one more than rollup because
it also includes (b) alone. This is useful when neither column is “above” the other in a
hierarchy and you want all cross-tabulations.
For our Pokemon data, cube(Type 1, Type 2) gives us stats broken down by the type pair,
by Type 1 alone, by Type 2 alone, and a grand total — all in one query:
df.aggregate(
[GroupingSet.cube(col_type_1, col_type_2)],
[f.count(col_speed).alias("Count"),
f.avg(col_speed).alias("Avg Speed")]
).sort(
col_type_1.sort(ascending=True, nulls_first=True),
col_type_2.sort(ascending=True, nulls_first=True)
)
DataFrame()
+--------+----------+-------+--------------------+
| Type 1 | Type 2 | Count | Avg Speed |
+--------+----------+-------+--------------------+
| | | 86 | 72.46511627906976 |
| | | 163 | 71.65030674846626 |
| | Dark | 1 | 81.0 |
| | Dragon | 1 | 100.0 |
| | Fairy | 3 | 51.666666666666664 |
| | Fighting | 1 | 70.0 |
| | Flying | 23 | 91.08695652173913 |
| | Grass | 2 | 27.5 |
| | Ground | 6 | 55.166666666666664 |
| | Ice | 3 | 66.66666666666667 |
+--------+----------+-------+--------------------+
Data truncated.
Compared to the rollup example above, notice the extra rows where Type 1 is null but
Type 2 has a value — those are the per-Type 2 subtotals that rollup does not include.
Explicit Grouping Sets#
grouping_sets() lets you list exactly which grouping levels
you need when rollup or cube would produce too many or too few. Each argument is a list of
columns forming one grouping set.
For example, if we want only the per-Type 1 totals and per-Type 2 totals — but not the
full (Type 1, Type 2) detail rows or the grand total — we can ask for exactly that:
df.aggregate(
[GroupingSet.grouping_sets([col_type_1], [col_type_2])],
[f.count(col_speed).alias("Count"),
f.avg(col_speed).alias("Avg Speed")]
).sort(
col_type_1.sort(ascending=True, nulls_first=True),
col_type_2.sort(ascending=True, nulls_first=True)
)
DataFrame()
+--------+----------+-------+--------------------+
| Type 1 | Type 2 | Count | Avg Speed |
+--------+----------+-------+--------------------+
| | | 86 | 72.46511627906976 |
| | Dark | 1 | 81.0 |
| | Dragon | 1 | 100.0 |
| | Fairy | 3 | 51.666666666666664 |
| | Fighting | 1 | 70.0 |
| | Flying | 23 | 91.08695652173913 |
| | Grass | 2 | 27.5 |
| | Ground | 6 | 55.166666666666664 |
| | Ice | 3 | 66.66666666666667 |
| | Poison | 22 | 71.5909090909091 |
+--------+----------+-------+--------------------+
Data truncated.
Each row belongs to exactly one grouping level. The grouping()
function tells you which level each row comes from:
result = df.aggregate(
[GroupingSet.grouping_sets([col_type_1], [col_type_2])],
[f.count(col_speed).alias("Count"),
f.avg(col_speed).alias("Avg Speed"),
f.grouping(col_type_1).alias("grouping(Type 1)"),
f.grouping(col_type_2).alias("grouping(Type 2)")]
)
result.sort(
col_type_1.sort(ascending=True, nulls_first=True),
col_type_2.sort(ascending=True, nulls_first=True)
)
DataFrame()
+--------+----------+-------+--------------------+------------------+------------------+
| Type 1 | Type 2 | Count | Avg Speed | grouping(Type 1) | grouping(Type 2) |
+--------+----------+-------+--------------------+------------------+------------------+
| | | 86 | 72.46511627906976 | 1 | 0 |
| | Dark | 1 | 81.0 | 1 | 0 |
| | Dragon | 1 | 100.0 | 1 | 0 |
| | Fairy | 3 | 51.666666666666664 | 1 | 0 |
| | Fighting | 1 | 70.0 | 1 | 0 |
| | Flying | 23 | 91.08695652173913 | 1 | 0 |
| | Grass | 2 | 27.5 | 1 | 0 |
| | Ground | 6 | 55.166666666666664 | 1 | 0 |
| | Ice | 3 | 66.66666666666667 | 1 | 0 |
| | Poison | 22 | 71.5909090909091 | 1 | 0 |
+--------+----------+-------+--------------------+------------------+------------------+
Data truncated.
Where grouping(Type 1) is 0 the row is a per-Type 1 total (and Type 2 is null).
Where grouping(Type 2) is 0 the row is a per-Type 2 total (and Type 1 is null).
Aggregate Functions#
The available aggregate functions are:
- Comparison Functions
- Array Functions
- Statistical Functions
- Linear Regression Functions
- String Functions
Grouping Set Functions -
datafusion.functions.grouping()-datafusion.expr.GroupingSet.rollup()-datafusion.expr.GroupingSet.cube()-datafusion.expr.GroupingSet.grouping_sets()
The functions in the datafusion.functions.spark namespace mirror Apache
Spark semantics, which can differ from the DataFusion built-ins of the same
name. They live in a separate namespace so you opt in explicitly. See
Spark-Compatible Functions for the full catalog and the semantic differences.
User-Defined Aggregate Functions#
You can ship custom aggregations to the engine by subclassing
Accumulator and registering it via
udaf(). See datafusion.user_defined for
the accumulator interface and worked examples.
Note
Serialization
Python aggregate UDFs travel inline inside pickled or
to_bytes()-serialized expressions —
the accumulator class is captured by value via cloudpickle,
so worker processes do not need to pre-register the UDF. Any names
the accumulator resolves via import are captured by reference
and must be importable on the receiving worker. See
datafusion.ipc for the full IPC model and security caveats.