Joins

DataFusion supports the following join variants via the method join()

  • Inner Join

  • Left Join

  • Right Join

  • Full Join

  • Left Semi Join

  • Left Anti Join

For the examples in this section we’ll use the following two DataFrames

In [1]: from datafusion import SessionContext

In [2]: ctx = SessionContext()

In [3]: left = ctx.from_pydict(
   ...:     {
   ...:         "customer_id": [1, 2, 3],
   ...:         "customer": ["Alice", "Bob", "Charlie"],
   ...:     }
   ...: )
   ...: 

In [4]: right = ctx.from_pylist([
   ...:     {"id": 1, "name": "CityCabs"},
   ...:     {"id": 2, "name": "MetroRide"},
   ...:     {"id": 5, "name": "UrbanGo"},
   ...: ])
   ...: 

Inner Join

When using an inner join, only rows containing the common values between the two join columns present in both DataFrames will be included in the resulting DataFrame.

In [5]: left.join(right, left_on="customer_id", right_on="id", how="inner")
Out[5]: 
DataFrame()
+-------------+----------+----+-----------+
| customer_id | customer | id | name      |
+-------------+----------+----+-----------+
| 1           | Alice    | 1  | CityCabs  |
| 2           | Bob      | 2  | MetroRide |
+-------------+----------+----+-----------+

The parameter join_keys specifies the columns from the left DataFrame and right DataFrame that contains the values that should match.

Left Join

A left join combines rows from two DataFrames using the key columns. It returns all rows from the left DataFrame and matching rows from the right DataFrame. If there’s no match in the right DataFrame, it returns null values for the corresponding columns.

In [6]: left.join(right, left_on="customer_id", right_on="id", how="left")
Out[6]: 
DataFrame()
+-------------+----------+----+-----------+
| customer_id | customer | id | name      |
+-------------+----------+----+-----------+
| 1           | Alice    | 1  | CityCabs  |
| 2           | Bob      | 2  | MetroRide |
| 3           | Charlie  |    |           |
+-------------+----------+----+-----------+

Full Join

A full join merges rows from two tables based on a related column, returning all rows from both tables, even if there is no match. Unmatched rows will have null values.

In [7]: left.join(right, left_on="customer_id", right_on="id", how="full")
Out[7]: 
DataFrame()
+-------------+----------+----+-----------+
| customer_id | customer | id | name      |
+-------------+----------+----+-----------+
| 1           | Alice    | 1  | CityCabs  |
| 2           | Bob      | 2  | MetroRide |
|             |          | 5  | UrbanGo   |
| 3           | Charlie  |    |           |
+-------------+----------+----+-----------+

Left Semi Join

A left semi join retrieves matching rows from the left table while omitting duplicates with multiple matches in the right table.

In [8]: left.join(right, left_on="customer_id", right_on="id", how="semi")
Out[8]: 
DataFrame()
+-------------+----------+
| customer_id | customer |
+-------------+----------+
| 1           | Alice    |
| 2           | Bob      |
+-------------+----------+

Left Anti Join

A left anti join shows all rows from the left table without any matching rows in the right table, based on a the specified matching columns. It excludes rows from the left table that have at least one matching row in the right table.

In [9]: left.join(right, left_on="customer_id", right_on="id", how="anti")
Out[9]: 
DataFrame()
+-------------+----------+
| customer_id | customer |
+-------------+----------+
| 3           | Charlie  |
+-------------+----------+

Duplicate Keys

It is common to join two DataFrames on a common column name. Starting in version 51.0.0, datafusion-python` will now coalesce on column with identical names by default. This reduces problems with ambiguous column selection after joins. You can disable this feature by setting the parameter coalesce_duplicate_keys to False.

In [10]: left = ctx.from_pydict(
   ....:     {
   ....:         "id": [1, 2, 3],
   ....:         "customer": ["Alice", "Bob", "Charlie"],
   ....:     }
   ....: )
   ....: 

In [11]: right = ctx.from_pylist([
   ....:     {"id": 1, "name": "CityCabs"},
   ....:     {"id": 2, "name": "MetroRide"},
   ....:     {"id": 5, "name": "UrbanGo"},
   ....: ])
   ....: 

In [12]: left.join(right, "id", how="inner")
Out[12]: 
DataFrame()
+----+----------+-----------+
| id | customer | name      |
+----+----------+-----------+
| 1  | Alice    | CityCabs  |
| 2  | Bob      | MetroRide |
+----+----------+-----------+

In contrast to the above example, if we wish to get both columns:

In [13]: left.join(right, "id", how="inner", coalesce_duplicate_keys=False)
Out[13]: 
DataFrame()
+----+----------+----+-----------+
| id | customer | id | name      |
+----+----------+----+-----------+
| 1  | Alice    | 1  | CityCabs  |
| 2  | Bob      | 2  | MetroRide |
+----+----------+----+-----------+