Column Selections¶
Use select()
for basic column selection.
DataFusion can work with several file types, to start simple we can use a subset of the TLC Trip Record Data, which you can download here.
In [1]: from datafusion import SessionContext
In [2]: ctx = SessionContext()
In [3]: df = ctx.read_parquet("yellow_tripdata_2021-01.parquet")
In [4]: df.select("trip_distance", "passenger_count")
Out[4]:
DataFrame()
+---------------+-----------------+
| trip_distance | passenger_count |
+---------------+-----------------+
| 2.1 | 1.0 |
| 0.2 | 1.0 |
| 14.7 | 1.0 |
| 10.6 | 0.0 |
| 4.94 | 1.0 |
| 1.6 | 1.0 |
| 4.1 | 1.0 |
| 5.7 | 1.0 |
| 9.1 | 1.0 |
| 2.7 | 2.0 |
+---------------+-----------------+
For mathematical or logical operations use col()
to select columns, and give meaningful names to the resulting
operations using alias()
In [5]: from datafusion import col, lit
In [6]: df.select((col("tip_amount") + col("tolls_amount")).alias("tips_plus_tolls"))
Out[6]:
DataFrame()
+-----------------+
| tips_plus_tolls |
+-----------------+
| 0.0 |
| 0.0 |
| 8.65 |
| 6.05 |
| 4.06 |
| 2.35 |
| 0.0 |
| 0.0 |
| 0.0 |
| 3.15 |
+-----------------+
Warning
Please be aware that all identifiers are effectively made lower-case in SQL, so if your file has capital letters
(ex: Name) you must put your column name in double quotes or the selection won’t work. As an alternative for simple
column selection use select()
without double quotes
For selecting columns with capital letters use '"VendorID"'
In [7]: df.select(col('"VendorID"'))
Out[7]:
DataFrame()
+----------+
| VendorID |
+----------+
| 1 |
| 1 |
| 1 |
| 1 |
| 2 |
| 1 |
| 1 |
| 1 |
| 1 |
| 1 |
+----------+
To combine it with literal values use the lit()
In [8]: large_trip_distance = col("trip_distance") > lit(5.0)
In [9]: low_passenger_count = col("passenger_count") < lit(4)
In [10]: df.select((large_trip_distance & low_passenger_count).alias("lonely_trips"))
Out[10]:
DataFrame()
+--------------+
| lonely_trips |
+--------------+
| false |
| false |
| true |
| true |
| false |
| false |
| false |
| true |
| true |
| false |
+--------------+