Concepts¶
In this section, we will cover a basic example to introduce a few key concepts. We will use the same source file as described in the Introduction, the Pokemon data set.
In [1]: from datafusion import SessionContext, col, lit, functions as f
In [2]: ctx = SessionContext()
In [3]: df = ctx.read_parquet("yellow_tripdata_2021-01.parquet")
In [4]: df = df.select(
...: "trip_distance",
...: col("total_amount").alias("total"),
...: (f.round(lit(100.0) * col("tip_amount") / col("total_amount"), lit(1))).alias("tip_percent"),
...: )
...:
In [5]: df.show()
DataFrame()
+---------------+-------+-------------+
| trip_distance | total | tip_percent |
+---------------+-------+-------------+
| 2.1 | 11.8 | 0.0 |
| 0.2 | 4.3 | 0.0 |
| 14.7 | 51.95 | 16.7 |
| 10.6 | 36.35 | 16.6 |
| 4.94 | 24.36 | 16.7 |
| 1.6 | 14.15 | 16.6 |
| 4.1 | 17.3 | 0.0 |
| 5.7 | 21.8 | 0.0 |
| 9.1 | 28.8 | 0.0 |
| 2.7 | 18.95 | 16.6 |
| 6.11 | 24.3 | 0.0 |
| 1.21 | 10.79 | 23.1 |
| 7.4 | 33.92 | 0.0 |
| 1.7 | 14.16 | 16.7 |
| 0.81 | 8.3 | 0.0 |
| 1.01 | 10.3 | 9.7 |
| 0.73 | 12.09 | 23.1 |
| 1.17 | 12.36 | 16.7 |
| 0.78 | 9.96 | 16.7 |
| 1.66 | 12.3 | 0.0 |
+---------------+-------+-------------+
Session Context¶
The first statement group creates a SessionContext
.
# create a context
ctx = datafusion.SessionContext()
A Session Context is the main interface for executing queries with DataFusion. It maintains the state of the connection between a user and an instance of the DataFusion engine. Additionally it provides the following functionality:
Create a DataFrame from a data source.
Register a data source as a table that can be referenced from a SQL query.
Execute a SQL query
DataFrame¶
The second statement group creates a DataFrame
,
# Create a DataFrame from a file
df = ctx.read_parquet("yellow_tripdata_2021-01.parquet")
A DataFrame refers to a (logical) set of rows that share the same column names, similar to a Pandas DataFrame.
DataFrames are typically created by calling a method on SessionContext
, such as read_csv
, and can then be modified by
calling the transformation methods, such as filter()
, select()
, aggregate()
,
and limit()
to build up a query definition.
Expressions¶
The third statement uses Expressions
to build up a query definition. You can find
explanations for what the functions below do in the user documentation for
col()
, lit()
, round()
,
and alias()
.
df = df.select(
"trip_distance",
col("total_amount").alias("total"),
(f.round(lit(100.0) * col("tip_amount") / col("total_amount"), lit(1))).alias("tip_percent"),
)
Finally the show()
method converts the logical plan
represented by the DataFrame into a physical plan and execute it, collecting all results and
displaying them to the user. It is important to note that DataFusion performs lazy evaluation
of the DataFrame. Until you call a method such as show()
or collect()
, DataFusion will not perform the query.