Column Selections

Use select() for basic column selection.

DataFusion can work with several file types, to start simple we can use a subset of the TLC Trip Record Data

In [1]: import urllib.request

In [2]: from datafusion import SessionContext

In [3]: urllib.request.urlretrieve("https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2021-01.parquet",
   ...:                            "yellow_trip_data.parquet")
   ...: 
Out[3]: ('yellow_trip_data.parquet', <http.client.HTTPMessage at 0x7f7415f15490>)

In [4]: ctx = SessionContext()

In [5]: df = ctx.read_parquet("yellow_trip_data.parquet")

In [6]: df.select_columns("trip_distance", "passenger_count")
Out[6]: 
DataFrame()
+---------------+-----------------+
| trip_distance | passenger_count |
+---------------+-----------------+
| 2.1           | 1.0             |
| 0.2           | 1.0             |
| 14.7          | 1.0             |
| 10.6          | 0.0             |
| 4.94          | 1.0             |
| 1.6           | 1.0             |
| 4.1           | 1.0             |
| 5.7           | 1.0             |
| 9.1           | 1.0             |
| 2.7           | 2.0             |
+---------------+-----------------+

For mathematical or logical operations use col() to select columns, and give meaningful names to the resulting operations using alias()

In [7]: from datafusion import col, lit

In [8]: df.select((col("tip_amount") + col("tolls_amount")).alias("tips_plus_tolls"))
Out[8]: 
DataFrame()
+-----------------+
| tips_plus_tolls |
+-----------------+
| 0.0             |
| 0.0             |
| 8.65            |
| 6.05            |
| 4.06            |
| 2.35            |
| 0.0             |
| 0.0             |
| 0.0             |
| 3.15            |
+-----------------+

Warning

Please be aware that all identifiers are effectively made lower-case in SQL, so if your file has capital letters (ex: Name) you must put your column name in double quotes or the selection won’t work. As an alternative for simple column selection use select_columns() without double quotes

For selecting columns with capital letters use '"VendorID"'

In [9]: df.select(col('"VendorID"'))
Out[9]: 
DataFrame()
+----------+
| VendorID |
+----------+
| 1        |
| 1        |
| 1        |
| 1        |
| 2        |
| 1        |
| 1        |
| 1        |
| 1        |
| 1        |
+----------+

To combine it with literal values use the lit()

In [10]: large_trip_distance = col("trip_distance") > lit(5.0)

In [11]: low_passenger_count = col("passenger_count") < lit(4)

In [12]: df.select((large_trip_distance & low_passenger_count).alias("lonely_trips"))
Out[12]: 
DataFrame()
+--------------+
| lonely_trips |
+--------------+
| false        |
| false        |
| true         |
| true         |
| false        |
| false        |
| false        |
| true         |
| true         |
| false        |
+--------------+