Column Selections¶
Use select()
for basic column selection.
DataFusion can work with several file types, to start simple we can use a subset of the TLC Trip Record Data
In [1]: import urllib.request
In [2]: from datafusion import SessionContext
In [3]: urllib.request.urlretrieve("https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2021-01.parquet",
...: "yellow_trip_data.parquet")
...:
Out[3]: ('yellow_trip_data.parquet', <http.client.HTTPMessage at 0x7f7415f15490>)
In [4]: ctx = SessionContext()
In [5]: df = ctx.read_parquet("yellow_trip_data.parquet")
In [6]: df.select_columns("trip_distance", "passenger_count")
Out[6]:
DataFrame()
+---------------+-----------------+
| trip_distance | passenger_count |
+---------------+-----------------+
| 2.1 | 1.0 |
| 0.2 | 1.0 |
| 14.7 | 1.0 |
| 10.6 | 0.0 |
| 4.94 | 1.0 |
| 1.6 | 1.0 |
| 4.1 | 1.0 |
| 5.7 | 1.0 |
| 9.1 | 1.0 |
| 2.7 | 2.0 |
+---------------+-----------------+
For mathematical or logical operations use col()
to select columns, and give meaningful names to the resulting
operations using alias()
In [7]: from datafusion import col, lit
In [8]: df.select((col("tip_amount") + col("tolls_amount")).alias("tips_plus_tolls"))
Out[8]:
DataFrame()
+-----------------+
| tips_plus_tolls |
+-----------------+
| 0.0 |
| 0.0 |
| 8.65 |
| 6.05 |
| 4.06 |
| 2.35 |
| 0.0 |
| 0.0 |
| 0.0 |
| 3.15 |
+-----------------+
Warning
Please be aware that all identifiers are effectively made lower-case in SQL, so if your file has capital letters
(ex: Name) you must put your column name in double quotes or the selection won’t work. As an alternative for simple
column selection use select_columns()
without double quotes
For selecting columns with capital letters use '"VendorID"'
In [9]: df.select(col('"VendorID"'))
Out[9]:
DataFrame()
+----------+
| VendorID |
+----------+
| 1 |
| 1 |
| 1 |
| 1 |
| 2 |
| 1 |
| 1 |
| 1 |
| 1 |
| 1 |
+----------+
To combine it with literal values use the lit()
In [10]: large_trip_distance = col("trip_distance") > lit(5.0)
In [11]: low_passenger_count = col("passenger_count") < lit(4)
In [12]: df.select((large_trip_distance & low_passenger_count).alias("lonely_trips"))
Out[12]:
DataFrame()
+--------------+
| lonely_trips |
+--------------+
| false |
| false |
| true |
| true |
| false |
| false |
| false |
| true |
| true |
| false |
+--------------+