Column Selections

Use select() for basic column selection.

DataFusion can work with several file types, to start simple we can use a subset of the TLC Trip Record Data, which you can download here.

In [1]: from datafusion import SessionContext

In [2]: ctx = SessionContext()

In [3]: df = ctx.read_parquet("yellow_tripdata_2021-01.parquet")

In [4]: df.select("trip_distance", "passenger_count")
Out[4]: 
DataFrame()
+---------------+-----------------+
| trip_distance | passenger_count |
+---------------+-----------------+
| 2.1           | 1.0             |
| 0.2           | 1.0             |
| 14.7          | 1.0             |
| 10.6          | 0.0             |
| 4.94          | 1.0             |
| 1.6           | 1.0             |
| 4.1           | 1.0             |
| 5.7           | 1.0             |
| 9.1           | 1.0             |
| 2.7           | 2.0             |
+---------------+-----------------+

For mathematical or logical operations use col() to select columns, and give meaningful names to the resulting operations using alias()

In [5]: from datafusion import col, lit

In [6]: df.select((col("tip_amount") + col("tolls_amount")).alias("tips_plus_tolls"))
Out[6]: 
DataFrame()
+-----------------+
| tips_plus_tolls |
+-----------------+
| 0.0             |
| 0.0             |
| 8.65            |
| 6.05            |
| 4.06            |
| 2.35            |
| 0.0             |
| 0.0             |
| 0.0             |
| 3.15            |
+-----------------+

Warning

Please be aware that all identifiers are effectively made lower-case in SQL, so if your file has capital letters (ex: Name) you must put your column name in double quotes or the selection won’t work. As an alternative for simple column selection use select() without double quotes

For selecting columns with capital letters use '"VendorID"'

In [7]: df.select(col('"VendorID"'))
Out[7]: 
DataFrame()
+----------+
| VendorID |
+----------+
| 1        |
| 1        |
| 1        |
| 1        |
| 2        |
| 1        |
| 1        |
| 1        |
| 1        |
| 1        |
+----------+

To combine it with literal values use the lit()

In [8]: large_trip_distance = col("trip_distance") > lit(5.0)

In [9]: low_passenger_count = col("passenger_count") < lit(4)

In [10]: df.select((large_trip_distance & low_passenger_count).alias("lonely_trips"))
Out[10]: 
DataFrame()
+--------------+
| lonely_trips |
+--------------+
| false        |
| false        |
| true         |
| true         |
| false        |
| false        |
| false        |
| true         |
| true         |
| false        |
+--------------+